Skip to main content
Vision Agents ships with 30+ plugins that connect AI providers to your real-time voice and video applications. Each plugin wraps a provider’s API with a consistent interface, so you can swap providers without rewriting your agent logic.
Vision Agents requires a Stream account for real-time transport. Most providers offer free tiers to get started.

Which plugin do I need?

Pick based on what your agent needs to do:
I want to…Start hereWhat you get
Handle calls and respond naturally by voiceRealtimeEnd-to-end voice agent with multimodal support, unified under one plugin and model
Connect to my own tools, APIs, or knowledge baseLanguage ModelsFunction calling, RAG, and full control over STT/TTS choices
Transcribe what users say in real timeSpeech-to-TextStreaming transcription, some with built-in turn detection
Give my agent a distinct, natural voiceText-to-SpeechCloud and local options, from expressive to ultra-low latency
See and understand what’s on cameraVision & VideoObject detection, video analysis, and style transfer
Put a face on my agentAvatarsReal-time lip-synced visual characters
Make conversations feel natural, not roboticTurn DetectionSmart interruption handling and silence detection
Run open-source models on my own infrastructureInfrastructureSelf-hosted inference, model routing, and vector search
Connect users to my agent over WebRTCEdge TransportStream’s global edge network, sub-500ms latency with frontend SDKs
Deploy agents over Tencent’s network in ChinaEdge TransportAlternative transport layer with low latency in mainland China
Connect phone calls (PSTN) to my agentTelephonyInbound/outbound calls via Twilio or Telnyx + media streaming

Installation

New project

Scaffold a ready-to-run agent project with the CLI. Requires uv on your PATH:
uvx vision-agents init my-agent
cd my-agent
cp .env.example .env   # fill in API keys
uv run agent.py run
See the Quickstart for a full walkthrough.

Add plugins

Plugins install as extras. Add only the ones you need to an existing project:
uv add "vision-agents[gemini,deepgram,elevenlabs]"
You can also add explicit plugin packages (the style used by init):
uv add vision-agents-plugins-deepgram vision-agents-plugins-elevenlabs
Browse the categories below for available plugins and their install commands.

Browse by Category

Language Models

Text generation with function calling. Requires separate STT/TTS plugins.
ProviderNotes
Anthropic (Claude)Messages API, streaming, function calling
GeminiBuilt-in tools: search, code execution, RAG
OpenAIResponses API (GPT-5+) and ChatCompletions
xAI (Grok)Advanced reasoning, function calling
OpenRouterUnified API for Claude, Gemini, GPT, and more
Kimi AIOpenAI-compatible via ChatCompletions
MiniMaxMiniMax-M3 and M-series, OpenAI-compatible
QwenDashScope API via ChatCompletions

Realtime

End-to-end speech-to-speech with built-in STT/TTS. Lowest latency, simplest setup.
ProviderNotes
Gemini RealtimeWebSocket, optional video, built-in VAD
Inworld RealtimeWebRTC, protocol-compatible with OpenAI
OpenAI RealtimeWebRTC, built-in STT/TTS
Qwen RealtimeNative audio I/O, video support
xAI RealtimeWebSocket, server VAD, web + X search
AWS BedrockAmazon Nova models, auto session management

Speech-to-Text

Real-time transcription. Some include built-in turn detection.
ProviderNotes
DeepgramNova-3, built-in turn detection
ElevenLabsScribe v2, ~150ms latency, built-in VAD
AssemblyAIPunctuation-based turn detection
CartesiaInk model, streaming PCM, turn detection
Fish AudioAuto language detection
Mistral VoxtralWebSocket streaming, requires separate turn detection
Fast-WhisperLocal, CPU/GPU accelerated
WizperWhisper v3, on-the-fly translation

Text-to-Speech

Voice synthesis for agent responses.
ProviderNotes
ElevenLabsHighly realistic, multilingual
CartesiaLow-latency Sonic model
DeepgramAura-2, low-latency
OpenAIgpt-4o-mini-tts, streaming
Fish AudioProsody control, voice cloning
InworldExpressive game character voices
KokoroLocal, runs on CPU, no API key
Pocket TTSLocal, ~200ms latency, voice cloning
xAIFive expressive voices, speech tags
AWS PollyStandard and neural engines

Vision & Video

Video understanding, object detection, and video transformation.
ProviderNotes
MoondreamZero-shot detection, VQA, cloud or local
NVIDIACosmos Reason2, real-time video understanding
RoboflowPre-trained and custom detection models
Ultralytics YOLOPose estimation, object detection
DecartReal-time AI video style transfer

Avatars

Visual AI characters with synchronized lip-sync.
ProviderNotes
AnamReal-time conversational avatars
LiveAvatarRealistic AI avatars (HeyGen), automatic lip-sync
LemonSliceReal-time interactive avatars

Turn Detection

Controls when the agent should start and stop speaking.
ProviderNotes
Smart TurnSilero VAD + Whisper features
VogentNeural turn completion prediction
Deepgram and ElevenLabs STT include built-in turn detection, so no separate plugin is needed.

Infrastructure

Inference platforms and data services for running models on your own terms.
ProviderNotes
BasetenOpenAI-compatible endpoints for open-source models
HuggingFace InferenceUnified API routing to Together, Groq, Cerebras, and more
TurboPufferVector database for RAG with hybrid search

Edge Transport

Alternative real-time transport layers for deploying agents in specific regions.
ProviderNotes
Stream Video RTCDefault transport: global WebRTC, chat-backed conversation, frontend SDKs
Local transportMicrophone, speakers, and camera as the agent edge
Tencent RTCLow-latency in China, frontend SDKs

Telephony

PSTN phone call integration: bridge inbound and outbound calls into a Stream call.
ProviderNotes
TwilioMedia Streams, TwiML, built-in webhook helpers
TelnyxCall Control, bidirectional media streaming, Stream bridge

Consistent Interface

Plugins of the same type share a common interface. Swap providers in one line:
# Any STT plugin works the same way
stt = deepgram.STT()
stt = elevenlabs.STT()
stt = fish.STT()

# Any TTS plugin works the same way
tts = elevenlabs.TTS()
tts = cartesia.TTS()
tts = kokoro.TTS()

# Any LLM plugin works the same way
llm = gemini.LLM("gemini-3-flash-preview")
llm = openai.LLM(model="gpt-5.4")
llm = openrouter.LLM(model="anthropic/claude-sonnet-4")

Creating Custom Plugins

Don’t see your provider? Build your own plugin to connect additional services. See the Create Your Own Plugin guide.