AI Fails Radiology Benchmark (30% vs. Humans’ 83%) as GLM 4.6 Tops Coding Leaderboards
AI Benchmarks & Evaluations
“Radiology’s Last Exam” Launched: AI Struggles in Specialized Medical Benchmark
A new radiology benchmark, "Radiology’s Last Exam," reveals significant gaps in AI performance: board-certified radiologists scored 83%, trainees 45%, while GPT-5 achieved only 30% and Claude Opus 4.1 just 1%. The results underscore current AI limitations in high-stakes medical domains.
New AI Models
Qwen3-VL-30B-A3B Series Released: Instruct & Thinking Variants
Alibaba’s Qwen team launched two new vision-language models, Qwen3-VL-30B-A3B-Instruct and Qwen3-VL-30B-A3B-Thinking, now available on Hugging Face. These models expand open-source options for multimodal AI applications.
GLM 4.6 Dominates Leaderboards: Top Open-Weight Model in Coding & Reasoning
GLM 4.6 has surged to the top of multiple benchmarks, including LM Arena (overall/text) and Berkeley Function Calling (BFCL v4), excelling in coding, hard prompts, and creative writing. Users highlight its autonomy and tool-call accuracy as standout features.
- GLM 4.6 IS A FUKING AMAZING MODEL AND NOBODY CAN TELL ME OTHERWISE
- GLM 4.6 new best open weight overall on lmarena
AI Products & Service Updates
Perplexity AI October Updates: Comet Browser, Slack Integration, and New Models
Perplexity rolled out multiple features on October 3rd, including:
- Global launch of the Comet browser (AI-native search).
- Background assistants for Max subscribers and a Slack connector.
- Integration of Anthropic’s Claude Sonnet 4.5/4.5 Thinking, maps for places, currency conversion, and Study Mode for all users.
- [Perplexity Changelog] What We Shipped – October 3rd 🚢