How does text to speech work?
- Read text
- Generate speech patterns
- Create audio

Using Text-To-Speech with Stream
The Vision Agents SDK simplifies text-to-speech integration by providing a clean, plugin-based system that handles all the complexity for you. Here’s how it works in practice:- Choose Your Voice: Pick from popular TTS providers like ElevenLabs (for ultra-realistic voices), Cartesia, or Kokoro (for offline processing).
-
Send Your Text: Simply call the
agent.say()
method with whatever text you want spoken—the plugin handles the rest. - Automatic Audio: The TTS service converts your text to speech and sends back high-quality audio.
- Seamless Integration: The SDK automatically routes the audio into your Stream call, so everyone hears it immediately.
- Real-time Experience: The speech plays instantly to all call participants, creating a natural conversation flow.

Worked example
Let’s walk through a real-world example to see how TTS works in your application. Imagine you’re building a customer support system where callers get placed in a queue. Here’s how TTS makes this experience feel personal and professional: The Scenario: A customer calls your support line and gets placed in a queue. What Happens:- Your system detects the caller and generates a friendly message: “Thank you for calling TechCorp Support. Your estimated wait time is 5 minutes.”
- Instead of showing this as text on screen (which the caller can’t see), your TTS plugin converts it to natural speech that sounds like a real person.
- The voice speaks directly to the caller through the Stream call, creating an immediate human connection.
- As the queue updates, new messages are automatically spoken: “Your wait time is now 3 minutes” or “We’re connecting you to an agent now.”