Sign in Subscribe

14 Dec 2025 2 min read

OSWorld AI Agents Outperform Humans as GPT-5.2 Struggles with Censorship & Physics

Model Performance & Benchmarks

GPT-5.2 Underperforms on Physics and Censorship Benchmarks

OpenAI’s GPT-5.2 (xhigh) scored 0% on the CritPt physics reasoning benchmark, trailing behind Gemini Pro Preview (9.1%) and DeepSeek V3.3 (7.4%). The model also ranked as the most censored on the Sansa benchmark, with Llama-3-8b-Instruct and Mistral-8b showing the least censorship.

OSWorld: AI Agents Match Human Performance on Real Computer Tasks

A new OSWorld benchmark shows an AI agent ("agent s3 w/ Opus 4.5 + GPT-5 bBoN") achieving a 72.6% success rate across 369 real computer tasks, nearly matching reported human performance (72.36%).
- OSWorld result: 72.6% success on 369 real computer tasks

New Model Releases & Optimizations

Devstral 2 & Devstral Small 2 Now Available in LM Studio for Apple Silicon

Devstral 2 and Devstral Small 2 are now optimized for MLX on Apple Silicon, enabling local execution on Apple devices via LM Studio.
- Devstral 2 and Devstral Small 2 in LM Studio - Now in MLX on Apple Silicon
  - LM Studio

Qwen3 Next-Gen Optimization Boosts Speed by 40%

Qwen3 received optimizations (e.g., short-circuiting recurrent decay, reshaping improvements) resulting in a 40% generation speed upgrade. Users are encouraged to test the updated version.
- Qwen3 Next generation optimization

Local LLM & Hardware Performance

Running Large LLMs Efficiently on NVIDIA Thor

A Qwen3-Next-80B-A3B-Instruct-NVFP4 (MOE model) was successfully deployed on NVIDIA Thor using VLLM + Docker, achieving fast inference. The setup supports DGX Spark and includes QWEN Image for multimodal tasks.
- Success on running a large, useful LLM fast on NVIDIA Thor!

Mistral 3 Benchmarks on llama.cpp (Vulkan Backend)

Mistral 3 performance metrics (tokens/sec) were benchmarked across multiple GPUs (e.g., RX 7900 GRE, GTX 1080 Ti) using llama.cpp with Vulkan backend and quantization. Results highlight efficiency gains on various hardware setups.
- Mistral 3 llama.cpp benchmarks
  - Ministral-3 GGUF Model
  - llama.cpp GitHub

AI Tools & Developer Experiences

Mistral Vibe: First Impressions as a Coding Assistant

A user shared their first experience with Mistral Vibe, a coding assistant, praising its easy installation and ability to generate a functional C# app from scratch. The tool is positioned as a ChatGPT alternative for developers.
- [Usage experience] First experience with Vibe
  - Mistral Vibe Docs
  - Mistral Console