OSWorld AI Agents Outperform Humans as GPT-5.2 Struggles with Censorship & Physics
Model Performance & Benchmarks
GPT-5.2 Underperforms on Physics and Censorship Benchmarks
- OpenAI’s GPT-5.2 (xhigh) scored 0% on the CritPt physics reasoning benchmark, trailing behind Gemini Pro Preview (9.1%) and DeepSeek V3.3 (7.4%). The model also ranked as the most censored on the Sansa benchmark, with Llama-3-8b-Instruct and Mistral-8b showing the least censorship.
OSWorld: AI Agents Match Human Performance on Real Computer Tasks
- A new OSWorld benchmark shows an AI agent ("agent s3 w/ Opus 4.5 + GPT-5 bBoN") achieving a 72.6% success rate across 369 real computer tasks, nearly matching reported human performance (72.36%).
New Model Releases & Optimizations
Devstral 2 & Devstral Small 2 Now Available in LM Studio for Apple Silicon
- Devstral 2 and Devstral Small 2 are now optimized for MLX on Apple Silicon, enabling local execution on Apple devices via LM Studio.
Qwen3 Next-Gen Optimization Boosts Speed by 40%
- Qwen3 received optimizations (e.g., short-circuiting recurrent decay, reshaping improvements) resulting in a 40% generation speed upgrade. Users are encouraged to test the updated version.
Local LLM & Hardware Performance
Running Large LLMs Efficiently on NVIDIA Thor
- A Qwen3-Next-80B-A3B-Instruct-NVFP4 (MOE model) was successfully deployed on NVIDIA Thor using VLLM + Docker, achieving fast inference. The setup supports DGX Spark and includes QWEN Image for multimodal tasks.
Mistral 3 Benchmarks on llama.cpp (Vulkan Backend)
- Mistral 3 performance metrics (tokens/sec) were benchmarked across multiple GPUs (e.g., RX 7900 GRE, GTX 1080 Ti) using llama.cpp with Vulkan backend and quantization. Results highlight efficiency gains on various hardware setups.
AI Tools & Developer Experiences
Mistral Vibe: First Impressions as a Coding Assistant
- A user shared their first experience with Mistral Vibe, a coding assistant, praising its easy installation and ability to generate a functional C# app from scratch. The tool is positioned as a ChatGPT alternative for developers.