Inworld AI provides expressive TTS designed for conversational AI and game characters. The plugin defaults to Inworld’s TTS-2 model, which adds natural-language steering, 100+ languages (15 GA, 90+ experimental), and high-quality instant voice cloning.Documentation Index
Fetch the complete documentation index at: https://visionagents.ai/llms.txt
Use this file to discover all available pages before exploring further.
Vision Agents requires a Stream account
for real-time transport. Most providers offer free tiers to get started.
Installation
INWORLD_API_KEY in your environment (or pass api_key= explicitly).
Quick Start
Parameters
| Name | Type | Default | Description |
|---|---|---|---|
api_key | str | None | API key (defaults to INWORLD_API_KEY env var) |
voice_id | str | "Sarah" | Voice ID ("Sarah", "Dennis", "Ashley", "Olivia", "Clive", or custom/cloned voices) |
model_id | str | "inworld-tts-2" | Model ("inworld-tts-2", "inworld-tts-1.5-max", "inworld-tts-1.5-mini") |
sample_rate | int | 16000 | Desired PCM output sample rate in Hz |
temperature | float | 1.1 | Randomness when sampling audio tokens (0–2) |
speaking_rate | float | None | Speech speed multiplier (0.5–1.5). None uses the server default |
auto_mode | bool | True | Let Inworld decide optimal flush behavior for streamed input |
apply_text_normalization | "ON" | "OFF" | None | Optional text normalization behavior |
ws_url | str | Inworld endpoint | Inworld bidirectional WebSocket endpoint |
inworld-tts-1 and inworld-tts-1-max are deprecated by Inworld — migrate to inworld-tts-2 or inworld-tts-1.5-*.Steering (TTS-2)
TTS-2 takes natural-language stage directions inline with your text. Place the instruction in square brackets before the segment it should apply to:[laugh], [breathe], [clear throat], [sigh], [cough], [yawn]. Combining dimensions ([whisper in a hushed style], [say playfully and very fast]) produces better results than bare single-word tags.
See Inworld’s steering docs and prompting guide for the full reference.
Inworld TTS supports up to 2,000 characters per request. The plugin connects to Inworld’s bidirectional WebSocket endpoint and streams 16-bit PCM audio at the configured
sample_rate — no extra configuration needed.Next Steps
Build a Voice Agent
Get started with voice
Build a Video Agent
Add video processing

