Skip to main content
Vision Agents ships with 30+ plugins that connect AI providers to your real-time voice and video applications. Each plugin wraps a provider’s API with a consistent interface — swap providers without rewriting your agent logic.
Vision Agents requires a Stream account for real-time transport. Most providers offer free tiers to get started.

Which plugin do I need?

Pick based on what your agent needs to do:
I want to…Start hereWhat you get
Handle calls and respond naturally by voiceRealtimeEnd-to-end voice agent with multimodal support, unified under one plugin and model
Connect to my own tools, APIs, or knowledge baseLanguage ModelsFunction calling, RAG, and full control over STT/TTS choices
Transcribe what users say in real timeSpeech-to-TextStreaming transcription, some with built-in turn detection
Give my agent a distinct, natural voiceText-to-SpeechCloud and local options, from expressive to ultra-low latency
See and understand what’s on cameraVision & VideoObject detection, video analysis, and style transfer
Put a face on my agentAvatarsReal-time lip-synced visual characters
Make conversations feel natural, not roboticTurn DetectionSmart interruption handling and silence detection
Run open-source models on my own infrastructureInfrastructureSelf-hosted inference, model routing, and vector search
Deploy agents over Tencent’s network in ChinaEdge TransportAlternative transport layer with low latency in mainland China

Installation

Plugins install as extras. Add only the ones you need:
uv add "vision-agents[gemini,deepgram,elevenlabs]"
See the Installation guide for the full list of available extras.

Browse by Category

Language Models

Text generation with function calling. Requires separate STT/TTS plugins.
ProviderNotes
GeminiBuilt-in tools: search, code execution, RAG
OpenAIResponses API (GPT-5+) and ChatCompletions
xAI (Grok)Advanced reasoning, function calling
OpenRouterUnified API for Claude, Gemini, GPT, and more
Kimi AIOpenAI-compatible via ChatCompletions
QwenDashScope API via ChatCompletions

Realtime

End-to-end speech-to-speech with built-in STT/TTS. Lowest latency, simplest setup.
ProviderNotes
Gemini RealtimeWebSocket, optional video, built-in VAD
OpenAI RealtimeWebRTC, built-in STT/TTS
Qwen RealtimeNative audio I/O, video support
xAI RealtimeWebSocket, server VAD, web + X search
AWS BedrockAmazon Nova models, auto session management

Speech-to-Text

Real-time transcription. Some include built-in turn detection.
ProviderNotes
DeepgramNova-3, built-in turn detection
ElevenLabsScribe v2, ~150ms latency, built-in VAD
AssemblyAIPunctuation-based turn detection
Fish AudioAuto language detection
Mistral VoxtralWebSocket streaming, requires separate turn detection
Fast-WhisperLocal, CPU/GPU accelerated
WizperWhisper v3, on-the-fly translation

Text-to-Speech

Voice synthesis for agent responses.
ProviderNotes
ElevenLabsHighly realistic, multilingual
CartesiaLow-latency Sonic model
DeepgramAura-2, low-latency
OpenAIgpt-4o-mini-tts, streaming
Fish AudioProsody control, voice cloning
InworldExpressive game character voices
KokoroLocal, runs on CPU, no API key
Pocket TTSLocal, ~200ms latency, voice cloning
xAIFive expressive voices, speech tags
AWS PollyStandard and neural engines

Vision & Video

Video understanding, object detection, and video transformation.
ProviderNotes
MoondreamZero-shot detection, VQA, cloud or local
NVIDIACosmos Reason2, real-time video understanding
RoboflowPre-trained and custom detection models
Ultralytics YOLOPose estimation, object detection
DecartReal-time AI video style transfer

Avatars

Visual AI characters with synchronized lip-sync.
ProviderNotes
HeyGenRealistic AI avatars, automatic lip-sync
LemonSliceReal-time interactive avatars

Turn Detection

Controls when the agent should start and stop speaking.
ProviderNotes
Smart TurnSilero VAD + Whisper features
VogentNeural turn completion prediction
Deepgram and ElevenLabs STT include built-in turn detection — no separate plugin needed.

Infrastructure

Inference platforms and data services for running models on your own terms.
ProviderNotes
BasetenOpenAI-compatible endpoints for open-source models
HuggingFace InferenceUnified API routing to Together, Groq, Cerebras, and more
TurboPufferVector database for RAG with hybrid search

Edge Transport

Alternative real-time transport layers for deploying agents in specific regions.
ProviderNotes
Tencent RTCLow-latency in China, frontend SDKs (early access)

Consistent Interface

Plugins of the same type share a common interface — swap providers in one line:
# Any STT plugin works the same way
stt = deepgram.STT()
stt = elevenlabs.STT()
stt = fish.STT()

# Any TTS plugin works the same way
tts = elevenlabs.TTS()
tts = cartesia.TTS()
tts = kokoro.TTS()

# Any LLM plugin works the same way
llm = gemini.LLM("gemini-3-flash-preview")
llm = openai.LLM(model="gpt-5.4")
llm = openrouter.LLM(model="anthropic/claude-sonnet-4")

Creating Custom Plugins

Don’t see your provider? Build your own plugin to connect additional services. See the Create Your Own Plugin guide.