Text To Speech (TTS)

Text-to-Speech (TTS) transforms written words into spoken audio, allowing your applications to “speak” to users naturally. With the Vision Agents, you can easily add voice capabilities to your video calls and applications, creating experiences where text becomes lifelike speech in real-time.

How does text to speech work?

Read text
Generate speech patterns
Create audio

Using Text-To-Speech with Stream

The Vision Agents SDK simplifies text-to-speech integration by providing a clean, plugin-based system that handles all the complexity for you. Here’s how it works in practice:

Choose Your Voice: Pick from popular TTS providers like ElevenLabs, Cartesia, Inworld AI or Kokoro (for offline processing).
Send Your Text: Simply call the agent.say() method with whatever text you want spoken—the plugin handles the rest.
Automatic Audio: The TTS service converts your text to speech and sends back high-quality audio.
Seamless Integration: The SDK automatically routes the audio into your Stream call, so everyone hears it immediately.
Real-time Experience: The speech plays instantly to all call participants, creating a natural conversation flow.

Worked example

Let’s walk through a real-world example to see how TTS works in your application. Imagine you’re building a customer support system where callers get placed in a queue. Here’s how TTS makes this experience feel personal and professional: The Scenario: A customer calls your support line and gets placed in a queue. What Happens:

Your system detects the caller and generates a friendly message: “Thank you for calling TechCorp Support. Your estimated wait time is 5 minutes.”
Instead of showing this as text on screen (which the caller can’t see), your TTS plugin converts it to natural speech that sounds like a real person.
The voice speaks directly to the caller through the Stream call, creating an immediate human connection.
As the queue updates, new messages are automatically spoken: “Your wait time is now 3 minutes” or “We’re connecting you to an agent now.”

The Result: Instead of a silent, frustrating wait, customers get a conversational experience that feels like they’re being personally attended to, even when they’re waiting in line.

Getting Started

AI Technologies

Core Architecture

Reference

How does text to speech work?

Using Text-To-Speech with Stream

Worked example

Getting Started

AI Technologies

Core Architecture

Reference

​How does text to speech work?

​Using Text-To-Speech with Stream

​Worked example

How does text to speech work?

Using Text-To-Speech with Stream

Worked example