Speech 2 Turbo key stats
- Latency (measured)
- 315 ms1
- Vapi streaming benchmark (50 trials per model) (checked 2026-06-11) Median of 50 sequential live streaming trials, June 2026; includes network RTT from the benchmark machine.
- arxiv.org/abs/2505.07916 (checked 2026-06-11) MiniMax-Speech technical report (the Speech-02 architecture) lists 32 languages.
- platform.minimax.io/docs/guides/pricing-paygo (checked 2026-06-11) T2A pay-as-you-go: speech-02-turbo at $60 per 1M characters.
- minimax.io/news/speech-02-series (checked 2026-06-11) Series launch post dated April 2, 2025; rollout coverage ran through May 2025.
Background
Speech 2 Turbo (speech-02-turbo in the API) is the realtime tier of the MiniMax Speech 2 generation, launched in April 2025 and optimized for low latency interactive applications like voice agents and live translation. It shares the generation's learnable speaker encoder, which clones a voice from roughly ten seconds of audio without a transcript across 32 languages, and it ranked third on the Artificial Analysis Speech Arena at release while its HD sibling held first.
Sources: minimax.io, arxiv.org
At a glance
The realtime member of the Speech 2 pair and the direct ancestor of the Speech 2.5 entry already on the Index. In our 50 trial streaming benchmark it returned first audio in a median of 315 ms including network time, in line with the 325 ms we measured for its 2.5 successor.
Sources: platform.minimax.io
Frequently asked questions
- How is Speech 2 Turbo tested on the Humanness Index™?
- Listeners hear Speech 2 Turbo against another model in a blind head to head round, both voices reading the same customer support prompt from the same cloned source voice, and they pick whichever sounds more human. Its Humanness score derives purely from those votes.
- How does Speech 2 Turbo relate to the Speech 2.5 entry?
- The Speech 2.5 entry on the Index runs the newer generation, turbo tier; Speech 2 Turbo is the April 2025 realtime tier that preceded it. Both share the MiniMax cloning stack, so the blind tests compare generations of the same family directly.
How human does your model really sound?
The benchmark is open source. Suggest a model, read the methodology, or ask us to put your voice in the arena.