04 Jun 2026 1 min read

Google Debuts Gemma 4 and Gemini Beats Law Professors Amidst DeepSWE Benchmark Scrutiny

New Model Releases

Google Launches Gemma 4 12B and Teases Larger Variants: Google has officially released Gemma 4 12B, a multimodal, encoder-free model designed for high performance on consumer-grade hardware. The model features a 256K context window and support for over 140 languages, though initial community benchmarks show it trailing the smaller Qwen3.5-9B in overall efficiency despite strong coding performance.

Research and Performance

Gemini 2.5 Pro Outperforms Law Professors in Stanford Study: A study conducted by Stanford University found that Google’s Gemini 2.5 Pro beat 16 law professors at answering legal questions 75% of the time. The AI's responses were rated higher and were less likely to be flagged as harmful, suggesting LLMs are becoming viable tools for scalable evaluation in complex professional domains.

AI Beat Law Professors At Answering Questions, Study Finds—And It Wasn’t Close
- https://www.forbes.com/sites/aliciapark/2026/06/02/stanford-study-finds-ai-beats-law-professors-75-of-the-time/

Industry Benchmarks

DeepSWE Benchmark Reliability Questioned Following Audit: A recent audit of the DeepSWE benchmark has revealed significant flaws and suggested that the evaluation was rushed. The findings indicate that the benchmark requires substantial improvements before it can be considered a reliable industry standard for measuring model quality.

Someone did an audit on the new DeepSWE, the results aren't pretty