Grok TTS key stats
- Latency (measured)
- 460 ms1
- Vapi streaming benchmark (50 trials per model) (checked 2026-06-11) Median of 50 sequential live WS trials with optimize_streaming_latency disabled (the flag is the only difference from the Streaming config), June 2026; includes network RTT.
- docs.x.ai/developers/model-capabilities/audio/voice (checked 2026-06-10)
- x.ai/news/grok-stt-and-tts-apis (checked 2026-06-10) $15.00 per 1M characters per the launch post (x.ai/api/voice; docs.x.ai/developers/models/text-to-speech). Secondary coverage reported $4.20 per 1M; the launch post figure is used.
- x.ai/news/grok-stt-and-tts-apis (checked 2026-06-10) Standalone TTS API GA; the Grok Voice stack has been public since 2025-12-17.
Background
Grok TTS is the text to speech model behind Grok Voice, the assistant that ships on Grok mobile apps, Tesla vehicles, and Starlink customer support. xAI built the stack in house, from voice activity detection to the audio models themselves, and opened it to developers through the Grok Voice Agent API in December 2025 and a standalone TTS API in April 2026. The API offers five expressive voices across 20 languages, with inline speech tags like [laugh] and [whisper] for fine grained delivery control. On the Humanness Index™ it is the voice to beat: listeners pick it as the more human option more often than any other model in the field.
Sources: x.ai, x.ai
Release history
The Grok Voice stack went public in December 2025 with the Grok Voice Agent API at $0.05 per minute. The standalone TTS API reached general availability on April 17, 2026 at a flat $15.00 per 1M characters, with REST requests accepting up to 15,000 characters.
Sources: docs.x.ai
Position in the rankings
Standings as of Jun 13, 2026, 00:15 UTC
Frequently asked questions
- How is Grok TTS tested on the Humanness Index™?
- Listeners hear Grok TTS against another model in a blind head to head round, both voices reading the same customer support prompt from the same cloned source voice, and they pick whichever sounds more human. Its Humanness score derives purely from those votes.
- What latency does Grok TTS have?
- We measured a 460 ms median time to first audio over 50 live trials in June 2026. The two Grok entries share one WebSocket API and differ only by the optimize_streaming_latency flag; Grok TTS is measured with the flag disabled, and the Streaming entry, measured with it enabled, returned 285 ms.
How human does your model really sound?
The benchmark is open source. Suggest a model, read the methodology, or ask us to put your voice in the arena.