Skip to content
The Humanness Index™
Built by VapiGitHub

The Humanness Index™

The open benchmark for how human voice AI sounds. Built and operated by Vapi.

MethodologyGitHubContactvapi.ai

Code is Apache-2.0. Standings data is CC BY 4.0. Audio clips and source voices are licensed recordings, all rights reserved. Provider logomarks belong to their respective owners and are used nominatively. “The Humanness Index™” name and logo are Vapi trademarks; see TRADEMARKS.md.

  1. Humanness Index™
  2. MiniMax
  3. Speech 2.5

Humanness Index™ · TTS model

MiniMax

Speech 2.5

by MiniMax

MiniMax's speech models moved fast through 2025, with the Speech-02 series arriving in April and Speech 2.5 following in August.

Rank
#8
Humanness
71
Likely rank
#3–16
Blind votes
98

Standings as of Jun 13, 2026, 00:15 UTC

LowerHigher

A real arena clip: a cloned source voice reading a customer support prompt at phone quality.

Speech 2.5 key stats

Latency (measured)
325 ms1
Languages
402
Price / 1M chars
$603
Released
August 7, 20254
  1. Vapi streaming benchmark (50 trials per model) (checked 2026-06-10) Measured as speech-2.5-turbo-preview (its realtime tier); median of 50 sequential live streaming trials, June 2026, including network RTT.
  2. platform.minimax.io/docs/api-reference/speech-t2a-http (checked 2026-06-10) Vendor lists 40+ languages for the current speech generation.
  3. platform.minimax.io/docs/guides/pricing-paygo (checked 2026-06-10) Speech 2.5 turbo $60 per 1M characters pay-as-you-go; the arena clips are the 2.5 generation, turbo tier (matches the measured realtime latency).
  4. minimax.io/news/minimax-speech-25 (checked 2026-06-10) Speech 2.5; the arena clips are this generation, turbo tier. The Speech-02 series preceded it in 2025-04.

Background

MiniMax's speech models moved fast through 2025, with the Speech-02 series arriving in April and Speech 2.5 following in August. The current generation supports more than 40 languages and clones a voice from roughly six to ten seconds of reference audio, using a learnable speaker encoder that needs no transcript. MiniMax is widely regarded as the strongest text to speech provider for Chinese, and the 2.5 generation brought English accuracy and rhythm up alongside it.

Sources: minimax.io

At a glance

The arena clips on this Index were generated with the Speech 2.5 generation, turbo tier, the realtime tier we also measured for latency. In our 50 trial streaming benchmark it returned first audio in a median of 325 ms.

Sources: platform.minimax.io

Position in the rankings

Standings as of Jun 13, 2026, 00:15 UTC

RankProviderModelHumannessLatency
#6ElevenLabsElevenLabsTurbo v2.575265 ms
#7ElevenLabsElevenLabsFlash v2.572197 ms
#8MiniMaxMiniMaxSpeech 2.571325 ms
#9InworldInworldTTS-266288 ms
#10MiniMaxMiniMaxSpeech 2 HD64357 ms

See the full Humanness Index™ rankings

Frequently asked questions

How is Speech 2.5 tested on the Humanness Index™?
Listeners hear Speech 2.5 against another model in a blind head to head round, both voices reading the same customer support prompt from the same cloned source voice, and they pick whichever sounds more human. Its Humanness score derives purely from those votes.
Which MiniMax generation do the arena clips use?
The clips were generated with the Speech 2.5 generation, turbo tier, the realtime tier we also measured for latency (325 ms median TTFB).

Keep exploring

MiniMaxMiniMaxAll MiniMax models on the IndexMiniMaxSpeech 2 HDRank #10 · Humanness 64MiniMaxSpeech 2 TurboRank #12 · Humanness 63

Back to the Humanness Index™

How human does your model really sound?

The benchmark is open source. Suggest a model, read the methodology, or ask us to put your voice in the arena.

Add your modelStar on GitHub