Humanness Index™ · TTS model

Inworld TTS-2

by Inworld

Realtime TTS-2 is Inworld's frontier voice model, released in May 2026 as a research preview.

Rank: #12
Humanness: 72
Likely rank: #9–16
Blind votes: 986

Standings as of Jul 31, 2026, 21:01 UTC

A real arena clip: a cloned source voice reading a customer support prompt at phone quality.

TTS-2 key stats

Latency (measured): 288 ms¹
Languages: 100+²
Price / 1M chars: $25³
Released: May 5, 2026⁴

Vapi streaming benchmark (50 trials per model) (checked 2026-06-10) Median of 50 sequential live streaming trials, June 2026; includes network RTT from the benchmark machine.
inworld.ai/blog/realtime-tts-2 (checked 2026-06-10) One voice identity held across more than 100 languages.
inworld.ai/pricing (checked 2026-06-10) On-demand rate, $25 per 1M characters (cheaper tier than TTS 1.5 Max).
inworld.ai/blog/realtime-tts-2 (checked 2026-06-10) Research preview.

Background

Realtime TTS-2 is Inworld's frontier voice model, released in May 2026 as a research preview. It conditions on the audio of prior conversation turns, picking up the user's tone, pacing, and emotional state before deciding not just what to say but how to say it. Developers direct it with plain English steering instructions instead of fixed emotion enums, and it holds a single voice identity across more than 100 languages. Teams on TTS-1.5 upgrade by changing one model identifier.

Sources: inworld.ai

At a glance

Conversation aware synthesis, plain English steering tags, Advanced Voice Design personas, and integration partners that include Vapi. In our 50 trial streaming benchmark it returned first audio in a median of 288 ms.

Sources: inworld.ai

Position in the rankings

Standings as of Jul 31, 2026, 21:01 UTC

Rank	Provider	Model	Humanness	Latency
#10	ElevenLabs	Flash v2	77	226 ms
#11	ElevenLabs	Turbo v2	76	302 ms
#12	Inworld	TTS-2	72	288 ms
#13	Cartesia	Sonic 3.5	71	128 ms
#14	MiniMax	Speech 2 Turbo	70	315 ms

See the full Humanness Index™ rankings

Frequently asked questions

How is TTS-2 tested on the Humanness Index™?: Listeners hear TTS-2 against another model in a blind head to head round, both voices reading the same customer support prompt from the same cloned source voice, and they pick whichever sounds more human. Its Humanness score derives purely from those votes.
What makes Realtime TTS-2 different?: It conditions on the audio of prior conversation turns, picking up the user's tone and pacing before deciding how to speak. Developers steer it with plain English instructions, and it holds one voice identity across more than 100 languages.

Keep exploring

InworldAll Inworld models on the Index TTS-1.5-maxRank #9 · Humanness 78

Back to the Humanness Index™

Find the most human-sounding voice for your agent.

Compare the models in blind tests, read the methodology, or get in touch.

Read the methodology Star on GitHub

Build a TTS model? Add yours to the Index.

TTS-2 key stats

Latency (measured)

288 ms¹

Languages

100+²

Price / 1M chars

$25³

Released

May 5, 2026⁴

Vapi streaming benchmark (50 trials per model) (checked 2026-06-10) Median of 50 sequential live streaming trials, June 2026; includes network RTT from the benchmark machine.

inworld.ai/blog/realtime-tts-2 (checked 2026-06-10) One voice identity held across more than 100 languages.

inworld.ai/pricing (checked 2026-06-10) On-demand rate, $25 per 1M characters (cheaper tier than TTS 1.5 Max).

inworld.ai/blog/realtime-tts-2 (checked 2026-06-10) Research preview.

Background

Rank

Provider

Model

Humanness

Latency

#10

ElevenLabs

Flash v2

226 ms

#11

ElevenLabs

Turbo v2

302 ms

#12

Inworld

TTS-2

288 ms

#13

Cartesia

Sonic 3.5

128 ms

#14

MiniMax

Speech 2 Turbo

315 ms

Frequently asked questions

How is TTS-2 tested on the Humanness Index™?

Listeners hear TTS-2 against another model in a blind head to head round, both voices reading the same customer support prompt from the same cloned source voice, and they pick whichever sounds more human. Its Humanness score derives purely from those votes.

What makes Realtime TTS-2 different?

It conditions on the audio of prior conversation turns, picking up the user's tone and pacing before deciding how to speak. Developers steer it with plain English instructions, and it holds one voice identity across more than 100 languages.