Humanness Index™ · TTS model

MiniMax Speech 2 HD

by MiniMax

Rank: #5
Humanness: 89
Likely rank: #2–7
Blind votes: 1,044

Standings as of Jul 28, 2026, 05:14 UTC

A real arena clip: a cloned source voice reading a customer support prompt at phone quality.

Speech 2 HD key stats

Latency (measured): 357 ms¹
Languages: 32²
Price / 1M chars: $100³
Released: April 2025⁴

Vapi streaming benchmark (50 trials per model) (checked 2026-06-11) Median of 50 sequential live streaming trials, June 2026; includes network RTT from the benchmark machine.
arxiv.org/abs/2505.07916 (checked 2026-06-11) MiniMax-Speech technical report (the Speech-02 architecture) lists 32 languages.
platform.minimax.io/docs/guides/pricing-paygo (checked 2026-06-11) T2A pay-as-you-go: speech-02-hd at $100 per 1M characters.
minimax.io/news/speech-02-series (checked 2026-06-11) Series launch post dated April 2, 2025; rollout coverage ran through May 2025.

Background

Speech 2 HD (speech-02-hd in the API) is the high fidelity tier of the MiniMax Speech 2 generation, launched in April 2025 for voiceover and audiobook work where rhythm consistency and clarity matter most. The generation behind it, documented in the MiniMax-Speech technical report, pairs an autoregressive Transformer with a learnable speaker encoder that clones a voice from roughly ten seconds of reference audio with no transcript, across 32 languages. Speech 2 HD topped the Artificial Analysis Speech Arena ELO rankings on release, ahead of OpenAI and ElevenLabs.

Sources: minimax.io, arxiv.org

At a glance

The quality tier of the Speech 2 pair, against Speech 2 Turbo as the realtime tier. In our 50 trial streaming benchmark it returned first audio in a median of 357 ms including network time, close behind its Turbo sibling despite the fidelity focus.

Sources: platform.minimax.io

Position in the rankings

Standings as of Jul 28, 2026, 05:14 UTC

Rank	Provider	Model	Humanness	Latency
#3	MiniMax	Speech 2.8	91	325 ms
#4	Canopy Labs	Orpheus	89	—
#5	MiniMax	Speech 2 HD	89	357 ms
#6	xAI	Grok TTS (Streaming)	86	285 ms
#7	Speechify	Simba 3.2	83	—

See the full Humanness Index™ rankings

Frequently asked questions

How is Speech 2 HD tested on the Humanness Index™?: Listeners hear Speech 2 HD against another model in a blind head to head round, both voices reading the same customer support prompt from the same cloned source voice, and they pick whichever sounds more human. Its Humanness score derives purely from those votes.
How does Speech 2 HD differ from Speech 2 Turbo?: Same generation, different tuning. Speech 2 HD targets high fidelity output for voiceovers and audiobooks and bills at $100 per 1M characters; Speech 2 Turbo targets realtime interaction at $60 per 1M. In our 50 trial benchmark they measured 357 ms and 315 ms median time to first audio respectively.

Keep exploring

MiniMaxAll MiniMax models on the Index Speech 2.8Rank #3 · Humanness 91 Speech 2 TurboRank #13 · Humanness 70

Back to the Humanness Index™

Find the most human-sounding voice for your agent.

Compare the models in blind tests, read the methodology, or get in touch.

Read the methodology Star on GitHub

Build a TTS model? Add yours to the Index.

Speech 2 HD key stats

Latency (measured)

357 ms¹

Languages

32²

Price / 1M chars

$100³

Released

April 2025⁴

Vapi streaming benchmark (50 trials per model) (checked 2026-06-11) Median of 50 sequential live streaming trials, June 2026; includes network RTT from the benchmark machine.

arxiv.org/abs/2505.07916 (checked 2026-06-11) MiniMax-Speech technical report (the Speech-02 architecture) lists 32 languages.

platform.minimax.io/docs/guides/pricing-paygo (checked 2026-06-11) T2A pay-as-you-go: speech-02-hd at $100 per 1M characters.

minimax.io/news/speech-02-series (checked 2026-06-11) Series launch post dated April 2, 2025; rollout coverage ran through May 2025.

Background

Rank

Provider

Model

Humanness

Latency

MiniMax

Speech 2.8

325 ms

Canopy Labs

Orpheus

—

MiniMax

Speech 2 HD

357 ms

xAI

Grok TTS (Streaming)

285 ms

Speechify

Simba 3.2

—

Frequently asked questions

How is Speech 2 HD tested on the Humanness Index™?

Listeners hear Speech 2 HD against another model in a blind head to head round, both voices reading the same customer support prompt from the same cloned source voice, and they pick whichever sounds more human. Its Humanness score derives purely from those votes.

How does Speech 2 HD differ from Speech 2 Turbo?

Same generation, different tuning. Speech 2 HD targets high fidelity output for voiceovers and audiobooks and bills at $100 per 1M characters; Speech 2 Turbo targets realtime interaction at $60 per 1M. In our 50 trial benchmark they measured 357 ms and 315 ms median time to first audio respectively.