Humanness Index™ · TTS model

xAI Grok TTS (Streaming)

Streaming is the WebSocket variant of xAI's Grok TTS, built for real time agents that need audio flowing before the full input has been processed.

Rank: #6
Humanness: 86
Likely rank: #3–7
Blind votes: 1,049

Standings as of Jul 28, 2026, 00:49 UTC

A real arena clip: a cloned source voice reading a customer support prompt at phone quality.

Grok TTS (Streaming) key stats

Latency (measured): 285 ms¹
Languages: 20²
Price / 1M chars: $15³
Streaming: Yes⁴
Released: April 17, 2026⁵

Vapi streaming benchmark (50 trials per model) (checked 2026-06-11) Median of 50 sequential live streaming trials, June 2026; includes network RTT from the benchmark machine.
docs.x.ai/developers/model-capabilities/audio/voice (checked 2026-06-10)
x.ai/news/grok-stt-and-tts-apis (checked 2026-06-10) $15.00 per 1M characters per the launch post (x.ai/api/voice; docs.x.ai/developers/models/text-to-speech). Secondary coverage reported $4.20 per 1M; the launch post figure is used.
docs.x.ai/developers/model-capabilities/audio/voice (checked 2026-06-10) WebSocket transport with no input length limit.
x.ai/news/grok-stt-and-tts-apis (checked 2026-06-10) Same stack as the REST TTS API; WebSocket variant.

Background

Streaming is the WebSocket variant of xAI's Grok TTS, built for real time agents that need audio flowing before the full input has been processed. It accepts unbounded text over a persistent connection and begins returning audio immediately, which makes it the natural fit for live voice applications. It shares the Grok Voice stack and its five expressive voices, and it ranks alongside its REST sibling at the very top of the Humanness Index™.

Sources: docs.x.ai

Release history

Streaming shares the Grok TTS stack that reached general availability in April 2026. The WebSocket endpoint accepts unbounded text over a persistent connection, unlike the 15,000 character cap on REST requests, and bills at the same flat per character rate.

Sources: x.ai

Position in the rankings

Standings as of Jul 28, 2026, 00:49 UTC

Rank	Provider	Model	Humanness	Latency
#4	Canopy Labs	Orpheus	89	—
#5	MiniMax	Speech 2 HD	89	357 ms
#6	xAI	Grok TTS (Streaming)	86	285 ms
#7	Speechify	Simba 3.2	83	—
#8	Inworld	TTS-1.5-max	78	337 ms

See the full Humanness Index™ rankings

Frequently asked questions

How is Grok TTS (Streaming) tested on the Humanness Index™?: Listeners hear Grok TTS (Streaming) against another model in a blind head to head round, both voices reading the same customer support prompt from the same cloned source voice, and they pick whichever sounds more human. Its Humanness score derives purely from those votes.
How does Grok TTS (Streaming) differ from Grok TTS?: Same Grok Voice stack and voices; the difference is transport. Streaming holds a WebSocket open, accepts unbounded text, and starts returning audio before the full input is processed, which suits live agents.

Keep exploring

xAIAll xAI models on the Index Grok TTSRank #2 · Humanness 94

Back to the Humanness Index™

Find the most human-sounding voice for your agent.

Compare the models in blind tests, read the methodology, or get in touch.

Read the methodology Star on GitHub

Build a TTS model? Add yours to the Index.

Grok TTS (Streaming) key stats

Latency (measured)

285 ms¹

Languages

20²

Price / 1M chars

$15³

Streaming

Yes⁴

Released

April 17, 2026⁵

Vapi streaming benchmark (50 trials per model) (checked 2026-06-11) Median of 50 sequential live streaming trials, June 2026; includes network RTT from the benchmark machine.

docs.x.ai/developers/model-capabilities/audio/voice (checked 2026-06-10)

x.ai/news/grok-stt-and-tts-apis (checked 2026-06-10) $15.00 per 1M characters per the launch post (x.ai/api/voice; docs.x.ai/developers/models/text-to-speech). Secondary coverage reported $4.20 per 1M; the launch post figure is used.

docs.x.ai/developers/model-capabilities/audio/voice (checked 2026-06-10) WebSocket transport with no input length limit.

x.ai/news/grok-stt-and-tts-apis (checked 2026-06-10) Same stack as the REST TTS API; WebSocket variant.

Background

Rank

Provider

Model

Humanness

Latency

Canopy Labs

Orpheus

—

MiniMax

Speech 2 HD

357 ms

xAI

Grok TTS (Streaming)

285 ms

Speechify

Simba 3.2

—

Inworld

TTS-1.5-max

337 ms

Frequently asked questions

How is Grok TTS (Streaming) tested on the Humanness Index™?

Listeners hear Grok TTS (Streaming) against another model in a blind head to head round, both voices reading the same customer support prompt from the same cloned source voice, and they pick whichever sounds more human. Its Humanness score derives purely from those votes.

How does Grok TTS (Streaming) differ from Grok TTS?

Same Grok Voice stack and voices; the difference is transport. Streaming holds a WebSocket open, accepts unbounded text, and starts returning audio before the full input is processed, which suits live agents.