xAI's TTS is the voice to beat for naturalness. In blind, side-by-side comparisons, listeners pick it as the more human-sounding option more often than any other model on the Index, and it holds that edge at phone quality, where most voices start to sound synthetic. For teams where sounding human is the whole point, it's where we'd start.
The Humanness Index™
How human does your voice AI really sound?
Humanness is how much a voice feels like a real person. Help rank the models with your preferences.
Read the whitepaperWhich one is human?
Listen to both blind samples, then cast your vote.
Most Human Models
- Latency
- 285 ms
- Languages
- 20
- Latency
- 128 ms
- Languages
- 42
- Latency
- —
- Languages
- English
- Latency
- 265 ms
- Languages
- 32
- Latency
- 758 ms
- Languages
- 70+
- Latency
- 197 ms
- Languages
- 32
- Latency
- 325 ms
- Languages
- 40
- Latency
- 226 ms
- Languages
- English
- Latency
- 302 ms
- Languages
- English
- Latency
- 337 ms
- Languages
- 15
What we Listen for
What makes a human voice
Listen to a synthetic voice and you can usually name what gave it away. These four tells come up the most. Nobody can fully define that feeling, so we play voices blind and let people judge. Both sides of every battle speak with the same cloned source voice, so votes compare the models, not the voices. Every score on this page comes from those votes.
Expressiveness
Emotion and emphasis. Stressing the right words, sounding like it means what it says instead of reading text aloud.
Tone & prosody
The intonation, rhythm, and melody of speech. The natural rise and fall of how people actually talk.
Artifacts
The little human sounds: breaths, stutters, natural pauses. A voice with none of them sounds too clean to be real.
Latency
How quickly a voice starts to respond. Once a reply lags past a beat, the conversation stops feeling live.
Humanness Deep Dive
Humanness distribution
Hover for details or click a dot to hear it
Not plotted (no public streaming API to measure): Canopy Labs Orpheus.
| Likely Rank | Model | Listen | ||||||
|---|---|---|---|---|---|---|---|---|
| #1 | #1–2 | xAI | Grok TTS | 100 | 460 ms | $15 | 71 | |
| #2 | #1–2 | xAI | Grok TTS (Streaming) | 96 | 285 ms | $15 | 58 | |
| #3 | #3–10 | Cartesia | Sonic 3.5 | 82 | 128 ms | $50 | 56 | |
| #4 | #3–10 | Canopy Labs | Orpheus | 82 | — | Open source | 49 | |
| #5 | #3–12 | ElevenLabs | Turbo v2.5 | 74 | 265 ms | $50 | 55 | |
| #6 | #3–13 | ElevenLabs | Eleven v3 | 73 | 758 ms | $100 | 55 | |
| #7 | #3–13 | ElevenLabs | Flash v2.5 | 71 | 197 ms | $50 | 52 | |
| #8 | #5–13 | MiniMax | Speech 2.5 | 70 | 325 ms | $60 | 48 | |
| #9 | #5–13 | ElevenLabs | Flash v2 | 65 | 226 ms | $50 | 48 | |
| #10 | #6–13 | ElevenLabs | Turbo v2 | 62 | 302 ms | $50 | 46 |
The Index only includes models that support voice cloning: each battle plays the same cloned source voice through both models, so the comparison is head to head and fair. Don't see your model on this list? Contact us at humannessindex@vapi.ai.
How human does your model really sound?
The benchmark is open source. Suggest a model, read the methodology, or ask us to put your voice in the arena.


