2 min read

Anthropic’s ‘Mythos’ Emerges Amid Allegations of Data Contamination in Frontier AI Benchmarks

Large Language Models & Frontier Research

Anthropic Testing New 'Mythos' Model: Anthropic has begun testing a new high-end model named 'Mythos,' which reportedly surpasses the capabilities of their current Opus line. Part of a new 'Capybara' tier, the model shows major improvements in reasoning and coding, though its rollout is being handled cautiously due to its powerful cybersecurity capabilities.

ARC-AGI 3 Paper Questions Frontier Model Benchmarks: A new research paper alleges that frontier models, including Gemini 3, may have inflated their ARC-AGI scores through memorization of training data rather than true reasoning. This potential data contamination raises concerns about the validity of current benchmarks used to measure AI generalization.

Audio & Multimodal AI

Mistral AI Releases Voxtral TTS: Mistral AI has launched Voxtral TTS, a 3-billion-parameter open-weight text-to-speech model that supports nine languages. The model reportedly outperforms ElevenLabs Flash v2.5 in human preference tests and is optimized for local use, requiring only 3 GB of RAM.

AI Performance & Efficiency

Qwen 3.5 Achieves 1.1 Million Tokens Per Second: Google Cloud has demonstrated extreme scalability by running the Qwen 3.5 27B model at a throughput of over 1.1 million tokens per second. This was achieved using 96 NVIDIA B200 GPUs on Google Kubernetes Engine, showcasing significant advancements in inference speed for large-scale deployments.

TurboQuant Integration in Llama.cpp: Google's new TurboQuant quantization method is being benchmarked in llama.cpp, offering a way to drastically reduce KV cache size. This technology promises more efficient local inference by enabling extreme compression with minimal impact on model performance.