ElevenLabs STT

Installation
Quick Start
Parameters
Next Steps

ElevenLabs provides real-time speech-to-text via Scribe v2 with ~150ms latency, 99 languages, and built-in VAD-based turn detection. No separate turn detection plugin is needed.

Vision Agents requires a Stream account for real-time transport. Most providers offer free tiers to get started.

ElevenLabs also provides highly realistic text-to-speech. You can use both in the same agent.

Installation

uv add "vision-agents[elevenlabs]"

Quick Start

from vision_agents.core import Agent, User
from vision_agents.plugins import elevenlabs, gemini, getstream

agent = Agent(
    edge=getstream.Edge(),
    agent_user=User(name="Assistant", id="agent"),
    instructions="You are a helpful assistant.",
    llm=gemini.LLM("gemini-3-flash-preview"),
    stt=elevenlabs.STT(),
    tts=elevenlabs.TTS(),
)

Set ELEVENLABS_API_KEY in your environment or pass api_key directly.

Parameters

stt = elevenlabs.STT(
    model_id="scribe_v2_realtime",
    language_code="en",
)

Name	Type	Default	Description
`model_id`	`str`	`"scribe_v2_realtime"`	Scribe model
`language_code`	`str`	`"en"`	Language code
`api_key`	`str`	`None`	API key (defaults to `ELEVENLABS_API_KEY` env var)
`vad_silence_threshold_secs`	`float`	`0.3`	Silence duration (seconds) before VAD commits
`vad_threshold`	`float`	`0.4`	VAD sensitivity threshold for speech detection
`min_speech_duration_ms`	`int`	`100`	Minimum speech duration in milliseconds
`min_silence_duration_ms`	`int`	`100`	Minimum silence duration in milliseconds
`audio_chunk_duration_ms`	`int`	`100`	Audio chunk size sent to the server (100-1000ms)

ElevenLabs STT includes built-in turn detection via VAD. When you use elevenlabs.STT, the Agent automatically ignores any external TurnDetector plugin to prevent conflicts. You do not need to add a separate turn detection plugin.

Next Steps

ElevenLabs TTS

Expressive text-to-speech

Build a Voice Agent

Get started with voice

Deepgram STT Fast-Whisper

⌘I

Overview

Language Models

Realtime

Speech-to-Text

Text-to-Speech

Vision & Video

Avatars

Turn Detection

Infrastructure

Edge Transport

Custom Integrations

Installation

Quick Start

Parameters

Next Steps

ElevenLabs TTS

Build a Voice Agent

Overview

Language Models

Realtime

Speech-to-Text

Text-to-Speech

Vision & Video

Avatars

Turn Detection

Infrastructure

Edge Transport

Custom Integrations

Documentation Index

​Installation

​Quick Start

​Parameters

​Next Steps

ElevenLabs TTS

Build a Voice Agent

Installation

Quick Start

Parameters

Next Steps