27 Mar 2026 2 min read

Anthropic’s ‘Mythos’ Emerges Amid Allegations of Data Contamination in Frontier AI Benchmarks

Large Language Models & Frontier Research

Anthropic Testing New 'Mythos' Model: Anthropic has begun testing a new high-end model named 'Mythos,' which reportedly surpasses the capabilities of their current Opus line. Part of a new 'Capybara' tier, the model shows major improvements in reasoning and coding, though its rollout is being handled cautiously due to its powerful cybersecurity capabilities.

Anthropic is testing 'Mythos' its 'most powerful AI model ever developed' | Fortune
- https://fortune.com/2026/03/26/anthropic-says-testing-mythos-powerful-new-ai-model-after-data-leak-reveals-its-existence-step-change-in-capabilities/

ARC-AGI 3 Paper Questions Frontier Model Benchmarks: A new research paper alleges that frontier models, including Gemini 3, may have inflated their ARC-AGI scores through memorization of training data rather than true reasoning. This potential data contamination raises concerns about the validity of current benchmarks used to measure AI generalization.

ARC-AGI 3 Paper alleges that Gemini 3 (and other frontier models) intentionally or not “cheated” their ARC-AGI 1 and 2 scores through memorisation of similar benchmark tasks during training

Audio & Multimodal AI

Mistral AI Releases Voxtral TTS: Mistral AI has launched Voxtral TTS, a 3-billion-parameter open-weight text-to-speech model that supports nine languages. The model reportedly outperforms ElevenLabs Flash v2.5 in human preference tests and is optimized for local use, requiring only 3 GB of RAM.

AI Performance & Efficiency

Qwen 3.5 Achieves 1.1 Million Tokens Per Second: Google Cloud has demonstrated extreme scalability by running the Qwen 3.5 27B model at a throughput of over 1.1 million tokens per second. This was achieved using 96 NVIDIA B200 GPUs on Google Kubernetes Engine, showcasing significant advancements in inference speed for large-scale deployments.

Qwen 3.5 27B at 1.1M tok/s on B200s, all configs on GitHub
- https://medium.com/google-cloud/1-million-tokens-per-second-qwen-3-5-27b-on-gke-with-b200-gpus-161da5c1b592

TurboQuant Integration in Llama.cpp: Google's new TurboQuant quantization method is being benchmarked in llama.cpp, offering a way to drastically reduce KV cache size. This technology promises more efficient local inference by enabling extreme compression with minimal impact on model performance.

TurboQuant in Llama.cpp benchmarks