Inworld

Inworld AI provides expressive TTS designed for conversational AI and game characters. The plugin defaults to Inworld’s TTS-2 model, which adds natural-language steering, 100+ languages (15 GA, 90+ experimental), and high-quality instant voice cloning.

Vision Agents requires a Stream account for real-time transport. Most providers offer free tiers to get started.

Inworld also offers a Realtime speech-to-speech API over WebRTC.

Installation

uv add "vision-agents[inworld]"

Get your API key from the Inworld Portal and set INWORLD_API_KEY in your environment (or pass api_key= explicitly).

Quick Start

from vision_agents.core import Agent, User
from vision_agents.plugins import inworld, gemini, deepgram, getstream

agent = Agent(
    edge=getstream.Edge(),
    agent_user=User(name="Assistant", id="agent"),
    instructions="You are a helpful assistant.",
    llm=gemini.LLM("gemini-3-flash-preview"),
    stt=deepgram.STT(),
    tts=inworld.TTS(),  # defaults to model_id="inworld-tts-2", voice_id="Sarah"
)

Set INWORLD_API_KEY in your environment or pass api_key directly.

Parameters

Name	Type	Default	Description
`api_key`	`str`	`None`	API key (defaults to `INWORLD_API_KEY` env var)
`voice_id`	`str`	`"Sarah"`	Voice ID (`"Sarah"`, `"Dennis"`, `"Ashley"`, `"Olivia"`, `"Clive"`, or custom/cloned voices)
`model_id`	`str`	`"inworld-tts-2"`	Model (`"inworld-tts-2"`, `"inworld-tts-1.5-max"`, `"inworld-tts-1.5-mini"`)
`sample_rate`	`int`	`16000`	Desired PCM output sample rate in Hz
`temperature`	`float`	`1.1`	Randomness when sampling audio tokens (0–2)
`speaking_rate`	`float`	`None`	Speech speed multiplier (0.5–1.5). `None` uses the server default
`auto_mode`	`bool`	`True`	Let Inworld decide optimal flush behavior for streamed input
`apply_text_normalization`	`"ON" \| "OFF"`	`None`	Optional text normalization behavior
`ws_url`	`str`	Inworld endpoint	Inworld bidirectional WebSocket endpoint

inworld-tts-1 and inworld-tts-1-max are deprecated by Inworld — migrate to inworld-tts-2 or inworld-tts-1.5-*.

Steering (TTS-2)

TTS-2 takes natural-language stage directions inline with your text. Place the instruction in square brackets before the segment it should apply to:

text = (
    "[whisper in a hushed style] I have to tell you something. "
    "[laugh] Just kidding! [say with force] Now let's get to work."
)
async for chunk in await tts.stream_audio(text):
    ...

Steering covers articulation, intonation, volume, pitch, range, speed, and vocal style — and supports non-verbal sounds like [laugh], [breathe], [clear throat], [sigh], [cough], [yawn]. Combining dimensions ([whisper in a hushed style], [say playfully and very fast]) produces better results than bare single-word tags. See Inworld’s steering docs and prompting guide for the full reference.

Inworld TTS supports up to 2,000 characters per request. The plugin connects to Inworld’s bidirectional WebSocket endpoint and streams 16-bit PCM audio at the configured sample_rate — no extra configuration needed.

Installation

Quick Start

Parameters

Steering (TTS-2)

Next Steps

Build a Voice Agent

Build a Video Agent

​Installation

​Quick Start

​Parameters

​Steering (TTS-2)

​Next Steps

Build a Voice Agent

Build a Video Agent

Installation

Quick Start

Parameters

Steering (TTS-2)

Next Steps