1 min read

AI Fails Radiology Benchmark (30% vs. Humans’ 83%) as GLM 4.6 Tops Coding Leaderboards

AI Benchmarks & Evaluations

“Radiology’s Last Exam” Launched: AI Struggles in Specialized Medical Benchmark
A new radiology benchmark, "Radiology’s Last Exam," reveals significant gaps in AI performance: board-certified radiologists scored 83%, trainees 45%, while GPT-5 achieved only 30% and Claude Opus 4.1 just 1%. The results underscore current AI limitations in high-stakes medical domains.


New AI Models

Qwen3-VL-30B-A3B Series Released: Instruct & Thinking Variants
Alibaba’s Qwen team launched two new vision-language models, Qwen3-VL-30B-A3B-Instruct and Qwen3-VL-30B-A3B-Thinking, now available on Hugging Face. These models expand open-source options for multimodal AI applications.


GLM 4.6 Dominates Leaderboards: Top Open-Weight Model in Coding & Reasoning
GLM 4.6 has surged to the top of multiple benchmarks, including LM Arena (overall/text) and Berkeley Function Calling (BFCL v4), excelling in coding, hard prompts, and creative writing. Users highlight its autonomy and tool-call accuracy as standout features.


AI Products & Service Updates

Perplexity AI October Updates: Comet Browser, Slack Integration, and New Models
Perplexity rolled out multiple features on October 3rd, including: