Skip to main content

Documentation Index

Fetch the complete documentation index at: https://visionagents.ai/llms.txt

Use this file to discover all available pages before exploring further.

Inworld AI provides expressive TTS designed for conversational AI and game characters. The plugin defaults to Inworld’s TTS-2 model, which adds natural-language steering, 100+ languages (15 GA, 90+ experimental), and high-quality instant voice cloning.
Vision Agents requires a Stream account for real-time transport. Most providers offer free tiers to get started.
Inworld also offers a Realtime speech-to-speech API over WebRTC.

Installation

uv add "vision-agents[inworld]"
Get your API key from the Inworld Portal and set INWORLD_API_KEY in your environment (or pass api_key= explicitly).

Quick Start

from vision_agents.core import Agent, User
from vision_agents.plugins import inworld, gemini, deepgram, getstream

agent = Agent(
    edge=getstream.Edge(),
    agent_user=User(name="Assistant", id="agent"),
    instructions="You are a helpful assistant.",
    llm=gemini.LLM("gemini-3-flash-preview"),
    stt=deepgram.STT(),
    tts=inworld.TTS(),  # defaults to model_id="inworld-tts-2", voice_id="Sarah"
)
Set INWORLD_API_KEY in your environment or pass api_key directly.

Parameters

NameTypeDefaultDescription
api_keystrNoneAPI key (defaults to INWORLD_API_KEY env var)
voice_idstr"Sarah"Voice ID ("Sarah", "Dennis", "Ashley", "Olivia", "Clive", or custom/cloned voices)
model_idstr"inworld-tts-2"Model ("inworld-tts-2", "inworld-tts-1.5-max", "inworld-tts-1.5-mini")
sample_rateint16000Desired PCM output sample rate in Hz
temperaturefloat1.1Randomness when sampling audio tokens (0–2)
speaking_ratefloatNoneSpeech speed multiplier (0.5–1.5). None uses the server default
auto_modeboolTrueLet Inworld decide optimal flush behavior for streamed input
apply_text_normalization"ON" | "OFF"NoneOptional text normalization behavior
ws_urlstrInworld endpointInworld bidirectional WebSocket endpoint
inworld-tts-1 and inworld-tts-1-max are deprecated by Inworld — migrate to inworld-tts-2 or inworld-tts-1.5-*.

Steering (TTS-2)

TTS-2 takes natural-language stage directions inline with your text. Place the instruction in square brackets before the segment it should apply to:
text = (
    "[whisper in a hushed style] I have to tell you something. "
    "[laugh] Just kidding! [say with force] Now let's get to work."
)
async for chunk in await tts.stream_audio(text):
    ...
Steering covers articulation, intonation, volume, pitch, range, speed, and vocal style — and supports non-verbal sounds like [laugh], [breathe], [clear throat], [sigh], [cough], [yawn]. Combining dimensions ([whisper in a hushed style], [say playfully and very fast]) produces better results than bare single-word tags. See Inworld’s steering docs and prompting guide for the full reference.
Inworld TTS supports up to 2,000 characters per request. The plugin connects to Inworld’s bidirectional WebSocket endpoint and streams 16-bit PCM audio at the configured sample_rate — no extra configuration needed.

Next Steps

Build a Voice Agent

Get started with voice

Build a Video Agent

Add video processing