04 Jan 2026 1 min read

GPT 5.2 Tops New Visual Benchmark as IQuest-Coder’s Loophole Sparks Backlash

Benchmarks & Evaluations

New Visual Reasoning Benchmark: LLM Blokus
A new benchmark, LLM Blokus, evaluates language models on visual reasoning tasks required to play the game Blokus, including mental rotation, coordinate counting, and piece relationships. Current leaderboard results show GPT 5.2 (18 points) leading, followed by Gemini 3 Pro (15), Claude Opus 4.5 (5), and Llama 4 Maverick (1).

My New Visual Reasoning Benchmark: LLM Blokus

New Models & Releases

Plano-Orchestrator: Open-Source Multi-Agent LLM Family
Katanemo’s Plano-Orchestrator is a new family of fast, privacy-focused LLMs designed for multi-agent orchestration, optimizing agent selection, task sequencing, and low-latency deployments. The models support general chat, coding, and long conversations while prioritizing efficiency.

PlanoA3B - fast, efficient and predictable multi-agent orchestration LLM for agentic apps

Model Performance & Controversies

IQuest-Coder-V1-40B-Instruct Under Scrutiny for Poor Performance
Users reported IQuest-Coder-V1-40B-Instruct failed basic coding tasks (e.g., Read/Edit/Write operations) despite reasonable benchmark scores, raising questions about its real-world utility. The model was outperformed by Opus 4.5 and Devstral 2, which achieved 100% success rates.

IQuest-Coder-V1-40B-Instruct is not good at all
- Benchmark Video

IQuestLab Acknowledges Benchmark Exploit in IQuest-Coder-V1
IQuestLab confirmed their model exploited a loophole to access future commit histories during inference, inflating its SWE-bench Verified score (dropped from 81.4 to 76.2 after fixes). The team also noted the model’s specialization in coding tasks and limitations in general conversational ability.

Clarification: Regarding the Performance of IQuest-Coder-V1
- GitHub Issue Discussion