Skip to main content
ElevenLabs is a voice AI platform that offers advanced Text-to-Speech (TTS) and Speech-to-Text (STT) capabilities with highly realistic and expressive voices.
It supports multiple languages and voices, making it ideal for real-time conversational agents, narrated content, accessibility tools, and voice-enabled applications.
The ElevenLabs plugin for the Stream Python AI SDK allows you to add both TTS and STT functionality to your project.

Installation

Install the Stream ElevenLabs plugin with
uv add vision-agents[elevenlabs]

Example

Check out our Elevenlabs example to see a practical implementation of the plugin and get inspiration for your own projects, or read on for some key details.

Text-to-Speech (TTS)

Initialisation

The ElevenLabs TTS plugin exists in the form of the TTS class:
from vision_agents.plugins import elevenlabs

tts = elevenlabs.TTS()
To initialise without passing in the API key, make sure the ELEVENLABS_API_KEY is available as an environment variable. You can do this either by defining it in a .env file or exporting it directly in your terminal.

Parameters

These are the parameters available in the ElevenLabs TTS plugin for you to customise:
NameTypeDefaultDescription
api_keystr or NoneNoneYour ElevenLabs API key. If not provided, the plugin will look for the ELEVENLABS_API_KEY environment variable.
voice_idstr"VR6AewLTigWG4xSOukaG"The ID of the voice to use for TTS. You can use any voice from your ElevenLabs account.
model_idstr"eleven_multilingual_v2"The ID of the ElevenLabs TTS model to use. Controls the language and tone model for synthesis.

Functionality

Send text to convert to speech

The send() method sends the text passed in for the service to synthesize. The resulting audio is then played through the configured output track.
tts.send("Demo text you want AI voice to say")

Speech-to-Text (STT)

ElevenLabs provides real-time speech-to-text capabilities through their Scribe v2 model, which offers low latency (~150ms) transcription with support for 99 languages.

Initialisation

The ElevenLabs STT plugin uses the STT class:
from vision_agents.plugins import elevenlabs

stt = elevenlabs.STT()
To initialise without passing in the API key, make sure the ELEVENLABS_API_KEY is available as an environment variable. You can do this either by defining it in a .env file or exporting it directly in your terminal.

Parameters

These are the parameters available in the ElevenLabs STT plugin for you to customise:
NameTypeDefaultDescription
api_keystr or NoneNoneYour ElevenLabs API key. If not provided, the plugin will look for the ELEVENLABS_API_KEY environment variable.
model_idstr"scribe_v2_realtime"The model to use for transcription. Defaults to Scribe v2 realtime model.
language_codestr"en"Language code for transcription (e.g., “en”, “es”, “fr”). Supports 99 languages.
vad_silence_threshold_secsfloat1.5VAD silence threshold in seconds before committing a transcript.
vad_thresholdfloat0.4VAD threshold for speech detection (0.0-1.0).
min_speech_duration_msint100Minimum speech duration in milliseconds to trigger transcription.
min_silence_duration_msint100Minimum silence duration in milliseconds to detect speech boundaries.
audio_chunk_duration_msint100Duration of audio chunks to send (100-1000ms recommended).
clientAsyncElevenLabs or NoneNoneOptional pre-configured AsyncElevenLabs client instance.

Features

  • Real-time transcription: Low latency (~150ms) speech recognition
  • Multi-language support: 99 languages supported
  • VAD-based commit strategy: Automatic transcript segmentation based on voice activity detection
  • Automatic reconnection: Built-in exponential backoff for connection failures
  • Audio resampling: Automatically resamples audio to 16kHz mono for optimal quality
The Scribe v2 model does not support turn detection. The turn_detection property is set to False for this implementation.