Skip to main content
Vision Agents ships with 25+ plugins that connect AI providers to your real-time voice and video applications. Each plugin wraps a provider’s API with a consistent interface—swap providers without rewriting your agent logic.
Vision Agents requires a Stream account for real-time transport. Most providers offer free tiers to get started.

Plugin Categories

CategoryPluginsDescription
RealtimeOpenAI, Gemini, Qwen, AWS BedrockNative speech-to-speech over WebRTC/WebSocket
LLMOpenAI, Gemini, OpenRouter, xAI, HuggingFaceText generation with function calling
VLMNVIDIA, HuggingFace, Moondream, OpenRouterVideo understanding via chat completions
STTDeepgram, ElevenLabs, Fish, Fast-Whisper, WizperSpeech-to-text transcription
TTSElevenLabs, Deepgram, Cartesia, Kokoro, Pocket, AWS Polly, InworldText-to-speech synthesis
Turn DetectionSmart Turn, VogentNeural turn-taking detection
Video ProcessorsUltralytics, Roboflow, Moondream, Decart, HeyGenDetection, pose, style transfer, avatars
RAGTurboPuffer, Gemini FileSearchVector search and knowledge retrieval

Installation

Plugins install as extras. Add only the ones you need:
uv add "vision-agents[gemini,deepgram,elevenlabs]"
See the Installation guide for the full list of available plugins.

Consistent Interface

Plugins of the same type share a common interface:
# STT plugins implement process_audio() and emit transcript events
stt = deepgram.STT()
stt = fish.STT()
stt = fast_whisper.STT()

# TTS plugins implement send()
tts = elevenlabs.TTS()
tts = cartesia.TTS()
tts = kokoro.TTS()

# LLM plugins implement simple_response() and register_function()
llm = gemini.LLM("gemini-2.5-flash")
llm = openai.LLM(model="gpt-4o")
llm = openrouter.LLM(model="anthropic/claude-sonnet-4")

Creating Custom Plugins

Build your own plugins to connect additional providers. See the Create Your Own Plugin guide.