Voice AI has a grimy secret: most of it was by no means designed for dialog. The dominant paradigm — feed textual content in, get audio out — traces its lineage to audiobook narration and voiceover manufacturing, the place the mannequin by no means hears the individual on the opposite finish. That’s tremendous if you’re producing a podcast intro. It’s not tremendous when a pissed off person is attempting to get help from an AI agent at 11pm.
Inworld AI is asking that out immediately with the launch of Realtime TTS-2, a brand new voice mannequin launched as a analysis preview through its Inworld API and Inworld Realtime API. The mannequin hears the complete audio of the change, picks up the person’s tone, pacing and emotional state, then takes voice course in plain English the best way builders immediate an LLM.
What’s Truly Completely different Right here
The significant architectural distinction with TTS-2 is that it operates as a closed-loop system. The mannequin takes the precise audio of the prior turns of the change as enter, not only a transcript — it hears how the person really sounded. That’s a non-trivial distinction. A transcript of “okay, tremendous” offers you the phrases. The audio of “okay, tremendous” tells you whether or not the individual is relieved, resigned, or sarcastic. TTS-2 is designed to make use of that sign.
The identical line lands otherwise after a joke than after dangerous information, and the mannequin is aware of the distinction as a result of it heard the prior flip. Tone, pacing, and emotional state carry ahead robotically. Virtually talking, audio context flows throughout turns inside a Realtime session with out builders needing to go express prior_audio fields or construct extra plumbing.
4 Capabilities, One Mannequin
Inworld crew is transport TTS-2 with 4 key options, positioning the mixture and never any particular person piece, because the differentiation.
- Voice Route: It lets builders steer supply utilizing plain-language prompts inline at inference time. As a substitute of choosing from a hard and fast emotion enum like [sad] or [excited], builders go a bracket tag like [speak sadly, as if something bad just happened] immediately within the textual content. Lengthy, descriptive prompts beat brief labels — the mannequin responds much better to full context than single-word labels. Inline non-verbal markers like [laugh], [sigh], [breathe], [clear_throat], and [cough] could be dropped wherever within the textual content the place the second ought to happen, and the mannequin locations them as audio occasions, not pronounced phrases.
- Conversational Consciousness: It’s the closed-loop structure described above — the architectural shift that separates TTS-2 from prior-generation fashions that deal with every sentence as a stateless era name.
- Crosslingual help: One voice identification is preserved throughout over 100 languages, together with mid-utterance language switches inside a single era. No language flag is required — the mannequin handles transitions robotically, conserving timbre, pitch, and character fixed throughout the change. The highest-tier languages ship at native-speaker high quality, whereas the lengthy tail is described as launch-window experimental, per the mannequin releasing as a analysis preview.
- Superior Voice Design: It generates a saved voice from a written immediate and no reference audio required. Builders can describe an individual in prose, save the end result as a reusable voice, and name it like some other voice within the app. Voice Design ships with three stability modes: Expressive (for reside shopper dialog and companions), Balanced (the default for many agent workloads), and Steady (for IVR {and professional} deployments the place pitch drift is unacceptable).
The Conversational Layer Beneath
Past the 4 key options, it calls out a set of behaviors that push speech additional into what it describes as “individual paying consideration” territory. Essentially the most technically fascinating is disfluencies: the mannequin generates pure uh and um, self-corrections, mid-noun-phrase pauses, and trailing ideas that sign heat and recall slightly than malfunction. Critically, totally different speaker profiles cluster fillers otherwise, and the mannequin follows the rhythm — filler-as-energy sounds totally different from filler-as-hesitation. Voice cloning can also be supported through a two-step API: add a reference pattern (5–15 seconds, clear, single speaker) to /voices/v1/voices:clone, get a voice ID, and use it like some other voice.
The place It Suits within the Stack
TTS-2 is one layer in Inworld’s broader Realtime API pipeline. The total stack consists of Realtime STT, which transcribes and profiles the speaker in a single go — capturing age, accent, pitch, vocal type, emotional tone, and pacing as structured alerts on the identical connection. A Realtime Router that routes throughout 200+ fashions, choosing the suitable mannequin and instruments primarily based on the person’s state and dialog context. And TTS-2 on the output layer. The pipeline runs over a single persistent WebSocket connection, with sub-200ms median time-to-first-audio for the TTS layer.
https://artificialanalysis.ai/text-to-speech/leaderboard. (information as of Might 5, 2026)
The Broader Context
Realtime TTS 1.5 already ranks #1 on the Synthetic Evaluation Speech Enviornment (as of Might 5, 2026), forward of Google (#2) and ElevenLabs (#3). The launch of TTS-2 alerts that Inworld considers uncooked audio high quality a solved drawback — and is now competing on the behavioral layer: context-awareness, steerability, and identification consistency throughout languages.
Try the Docs and Technical particulars. Additionally, be happy to comply with us on Twitter and don’t overlook to affix our 130k+ ML SubReddit and Subscribe to our E-newsletter. Wait! are you on telegram? now you’ll be able to be a part of us on telegram as effectively.
Must companion with us for selling your GitHub Repo OR Hugging Face Web page OR Product Launch OR Webinar and so forth.? Join with us

