Background
Streaming is the WebSocket variant of xAI's Grok TTS, built for real time agents that need audio flowing before the full input has been processed. It accepts unbounded text over a persistent connection and begins returning audio immediately, which makes it the natural fit for live voice applications. It shares the Grok Voice stack and its five expressive voices, and it ranks alongside its REST sibling at the very top of the Humanness Index™.
Sources: docs.x.ai
Release history
Streaming shares the Grok TTS stack that reached general availability in April 2026. The WebSocket endpoint accepts unbounded text over a persistent connection, unlike the 15,000 character cap on REST requests, and bills at the same flat per character rate.
Sources: x.ai
Frequently asked questions
- How is Grok TTS (Streaming) tested on the Humanness Index™?
- Listeners hear Grok TTS (Streaming) against another model in a blind head to head round, both voices reading the same customer support prompt from the same cloned source voice, and they pick whichever sounds more human. Its Humanness score derives purely from those votes.
- How does Grok TTS (Streaming) differ from Grok TTS?
- Same Grok Voice stack and voices; the difference is transport. Streaming holds a WebSocket open, accepts unbounded text, and starts returning audio before the full input is processed, which suits live agents.
How human does your model really sound?
The benchmark is open source. Suggest a model, read the methodology, or ask us to put your voice in the arena.