Skip to content
The Humanness Index™
Built by VapiGitHub

The Humanness Index™

The open benchmark for how human voice AI sounds. Built and operated by Vapi.

MethodologyGitHubContactvapi.ai

Code is Apache-2.0. Standings data is CC BY 4.0. Audio clips and source voices are licensed recordings, all rights reserved. Provider logomarks belong to their respective owners and are used nominatively. “The Humanness Index™” name and logo are Vapi trademarks; see TRADEMARKS.md.

  1. Humanness Index™
  2. xAI
  3. Grok TTS (Streaming)

Humanness Index™ · TTS model

xAI

Grok TTS (Streaming)

by xAI

Streaming is the WebSocket variant of xAI's Grok TTS, built for real time agents that need audio flowing before the full input has been processed.

Rank
#2
Humanness
98
Likely rank
#1–7
Blind votes
76

Standings as of Jun 13, 2026, 00:15 UTC

LowerHigher

A real arena clip: a cloned source voice reading a customer support prompt at phone quality.

Grok TTS (Streaming) key stats

Latency (measured)
285 ms1
Languages
202
Price / 1M chars
$153
Streaming
Yes4
Released
April 17, 20265
  1. Vapi streaming benchmark (50 trials per model) (checked 2026-06-11) Median of 50 sequential live streaming trials, June 2026; includes network RTT from the benchmark machine.
  2. docs.x.ai/developers/model-capabilities/audio/voice (checked 2026-06-10)
  3. x.ai/news/grok-stt-and-tts-apis (checked 2026-06-10) $15.00 per 1M characters per the launch post (x.ai/api/voice; docs.x.ai/developers/models/text-to-speech). Secondary coverage reported $4.20 per 1M; the launch post figure is used.
  4. docs.x.ai/developers/model-capabilities/audio/voice (checked 2026-06-10) WebSocket transport with no input length limit.
  5. x.ai/news/grok-stt-and-tts-apis (checked 2026-06-10) Same stack as the REST TTS API; WebSocket variant.

Background

Streaming is the WebSocket variant of xAI's Grok TTS, built for real time agents that need audio flowing before the full input has been processed. It accepts unbounded text over a persistent connection and begins returning audio immediately, which makes it the natural fit for live voice applications. It shares the Grok Voice stack and its five expressive voices, and it ranks alongside its REST sibling at the very top of the Humanness Index™.

Sources: docs.x.ai

Release history

Streaming shares the Grok TTS stack that reached general availability in April 2026. The WebSocket endpoint accepts unbounded text over a persistent connection, unlike the 15,000 character cap on REST requests, and bills at the same flat per character rate.

Sources: x.ai

Position in the rankings

Standings as of Jun 13, 2026, 00:15 UTC

RankProviderModelHumannessLatency
#1xAIxAIGrok TTS100460 ms
#2xAIxAIGrok TTS (Streaming)98285 ms
#3CartesiaCartesiaSonic 3.582128 ms
#4Canopy LabsCanopy LabsOrpheus78—

See the full Humanness Index™ rankings

Frequently asked questions

How is Grok TTS (Streaming) tested on the Humanness Index™?
Listeners hear Grok TTS (Streaming) against another model in a blind head to head round, both voices reading the same customer support prompt from the same cloned source voice, and they pick whichever sounds more human. Its Humanness score derives purely from those votes.
How does Grok TTS (Streaming) differ from Grok TTS?
Same Grok Voice stack and voices; the difference is transport. Streaming holds a WebSocket open, accepts unbounded text, and starts returning audio before the full input is processed, which suits live agents.

Keep exploring

xAIxAIAll xAI models on the IndexxAIGrok TTSRank #1 · Humanness 100

Back to the Humanness Index™

How human does your model really sound?

The benchmark is open source. Suggest a model, read the methodology, or ask us to put your voice in the arena.

Add your modelStar on GitHub