04 Oct 2025 1 min read

AI Fails Radiology Benchmark (30% vs. Humans’ 83%) as GLM 4.6 Tops Coding Leaderboards

AI Benchmarks & Evaluations

“Radiology’s Last Exam” Launched: AI Struggles in Specialized Medical Benchmark
A new radiology benchmark, "Radiology’s Last Exam," reveals significant gaps in AI performance: board-certified radiologists scored 83%, trainees 45%, while GPT-5 achieved only 30% and Claude Opus 4.1 just 1%. The results underscore current AI limitations in high-stakes medical domains.

“Radiology’s Last Exam” - the toughest benchmark in radiology launched today!
- Dr. Datta’s Tweet

New AI Models

Qwen3-VL-30B-A3B Series Released: Instruct & Thinking Variants
Alibaba’s Qwen team launched two new vision-language models, Qwen3-VL-30B-A3B-Instruct and Qwen3-VL-30B-A3B-Thinking, now available on Hugging Face. These models expand open-source options for multimodal AI applications.

Qwen3-VL-30B-A3B-Instruct & Thinking are here
- Hugging Face - Instruct
- Hugging Face - Thinking

GLM 4.6 Dominates Leaderboards: Top Open-Weight Model in Coding & Reasoning
GLM 4.6 has surged to the top of multiple benchmarks, including LM Arena (overall/text) and Berkeley Function Calling (BFCL v4), excelling in coding, hard prompts, and creative writing. Users highlight its autonomy and tool-call accuracy as standout features.

AI Products & Service Updates

Perplexity AI October Updates: Comet Browser, Slack Integration, and New Models
Perplexity rolled out multiple features on October 3rd, including:

Global launch of the Comet browser (AI-native search).
Background assistants for Max subscribers and a Slack connector.
Integration of Anthropic’s Claude Sonnet 4.5/4.5 Thinking, maps for places, currency conversion, and Study Mode for all users.
[Perplexity Changelog] What We Shipped – October 3rd 🚢
- Perplexity Comet
- Full Changelog