Speech 2.5 key stats
- Latency (measured)
- 325 ms1
- Vapi streaming benchmark (50 trials per model) (checked 2026-06-10) Measured as speech-2.5-turbo-preview (its realtime tier); median of 50 sequential live streaming trials, June 2026, including network RTT.
- platform.minimax.io/docs/api-reference/speech-t2a-http (checked 2026-06-10) Vendor lists 40+ languages for the current speech generation.
- platform.minimax.io/docs/guides/pricing-paygo (checked 2026-06-10) Speech 2.5 turbo $60 per 1M characters pay-as-you-go; the arena clips are the 2.5 generation, turbo tier (matches the measured realtime latency).
- minimax.io/news/minimax-speech-25 (checked 2026-06-10) Speech 2.5; the arena clips are this generation, turbo tier. The Speech-02 series preceded it in 2025-04.
Background
MiniMax's speech models moved fast through 2025, with the Speech-02 series arriving in April and Speech 2.5 following in August. The current generation supports more than 40 languages and clones a voice from roughly six to ten seconds of reference audio, using a learnable speaker encoder that needs no transcript. MiniMax is widely regarded as the strongest text to speech provider for Chinese, and the 2.5 generation brought English accuracy and rhythm up alongside it.
Sources: minimax.io
At a glance
The arena clips on this Index were generated with the Speech 2.5 generation, turbo tier, the realtime tier we also measured for latency. In our 50 trial streaming benchmark it returned first audio in a median of 325 ms.
Sources: platform.minimax.io
Position in the rankings
Standings as of Jun 13, 2026, 00:15 UTC
Frequently asked questions
- How is Speech 2.5 tested on the Humanness Index™?
- Listeners hear Speech 2.5 against another model in a blind head to head round, both voices reading the same customer support prompt from the same cloned source voice, and they pick whichever sounds more human. Its Humanness score derives purely from those votes.
- Which MiniMax generation do the arena clips use?
- The clips were generated with the Speech 2.5 generation, turbo tier, the realtime tier we also measured for latency (325 ms median TTFB).
How human does your model really sound?
The benchmark is open source. Suggest a model, read the methodology, or ask us to put your voice in the arena.