ElevenLabs provides real-time speech-to-text via Scribe v2 with ~150ms latency, 99 languages, and built-in VAD-based turn detection. No separate turn detection plugin is needed.Documentation Index
Fetch the complete documentation index at: https://visionagents.ai/llms.txt
Use this file to discover all available pages before exploring further.
Vision Agents requires a Stream account
for real-time transport. Most providers offer free tiers to get started.
Installation
Quick Start
Parameters
| Name | Type | Default | Description |
|---|---|---|---|
model_id | str | "scribe_v2_realtime" | Scribe model |
language_code | str | "en" | Language code |
api_key | str | None | API key (defaults to ELEVENLABS_API_KEY env var) |
vad_silence_threshold_secs | float | 0.3 | Silence duration (seconds) before VAD commits |
vad_threshold | float | 0.4 | VAD sensitivity threshold for speech detection |
min_speech_duration_ms | int | 100 | Minimum speech duration in milliseconds |
min_silence_duration_ms | int | 100 | Minimum silence duration in milliseconds |
audio_chunk_duration_ms | int | 100 | Audio chunk size sent to the server (100-1000ms) |
Next Steps
ElevenLabs TTS
Expressive text-to-speech
Build a Voice Agent
Get started with voice

