09 Jun 2026 2 min read

OpenAI Confidentially Files for IPO as Xiaomi Claims 1,000 TPS on 1T Model

High-Performance Models

Xiaomi Claims Performance Breakthrough with MiMo-V2.5-Pro UltraSpeed: Xiaomi has announced a 1 trillion parameter Mixture-of-Experts (MoE) model that achieves over 1,000 tokens per second on a standard 8-GPU server. This development represents a significant advancement in running massive models efficiently on commodity hardware rather than specialized AI accelerators.

Xiaomi just claimed 1,000+ tps on a 1T model using a standard 8-GPU server

Optimization & Local Inference

Packed-Twin-Inference Technique Doubles Token Speeds: A new inference method called "packed-twin-inference" exploits unused compute in quantized models to achieve nearly 2X speed improvements on single GPUs. By running multiple model instances side-by-side, the technique enables speculative decoding without the need for an additional side model.

2X tk/s (from 19.4 -> 38.1 tk/s on 1 x MI50) Playing with a hypothesis like speculative decoding.. but instead of an additional side model, exploiting that I can run multiple computations side-by-side AS IF I had Qwen3.6-27B loaded twice in memory - small quants don't use all the available compute.

Llama.cpp WebGPU Update Significantly Boosts Prefill Speeds: A new pull request for the llama.cpp repository introduces major optimizations for the WebGPU backend, specifically targeting k-quants. Benchmarks demonstrate up to a 3.78x speedup for models like Gemma, further improving the viability of browser-based AI inference.

ggml-webgpu: Improve prefill speeds for k-quants + refactor matmul for Q4/Q5/Q8 and k-quants by yomaytk · Pull Request #24225 · ggml-org/llama.cpp
- https://github.com/ggml-org/llama.cpp/pull/24225
- https://github.com/ggml-org/llama.cpp/tree/ad1b88ca0d37a2171efba1c04f1a3531c78f1b52

Specialized AI Applications

Omi Health Releases Local Medical Speech-to-Text Model: Omi Med STT v1 is a fine-tuned version of NVIDIA’s Parakeet model designed specifically for medical transcription. The 0.6B parameter model offers open weights and runs locally on consumer hardware, providing a private alternative to cloud-based medical ASR services.

I fine-tuned Parakeet 0.6B for medical ASR — open weights, local Mac/CUDA/CPU

Business & Industry

OpenAI Confidentially Files for IPO: OpenAI has reportedly filed for an initial public offering, following similar confidential moves by Anthropic and SpaceX. This transition to a public company marks a pivotal moment for the AI industry as leaders seek massive capital to fuel further model scaling and innovation.

OpenAI Confidentially Files for IPO on the Heels of SpaceX and Anthropic
- https://www.wired.com/story/openai-confidentially-files-for-ipo/