Installation

Install Vision Agents from PyPI. We recommend uv as the package manager with Python 3.12 and CPython installed on your machine. For the best development experience we recommend adding our MCP server and Skill.md to your preferred coding tools.

uv add vision-agents

The SDK installs without provider packages by default. Add the ones you need:

uv add "vision-agents[getstream,gemini,deepgram,elevenlabs]" python-dotenv

You’ll need API keys for Stream and each provider you use. Stream offers 333,000 free participant minutes monthly, plus additional credits through the Maker Program for indie developers.

Available Plugins

LLMs & Realtime

Plugin	Description	Docs
`gemini`	Realtime API (WebSocket) + LLM with function calling	Gemini
`openai`	Realtime API (WebRTC) + LLM + TTS	OpenAI
`openrouter`	Unified access to Claude, Gemini, GPT, and more	OpenRouter
`anthropic`	Claude models with function calling	—
`xai`	Grok models	xAI
`huggingface`	LLM and VLM via HuggingFace Inference API	HuggingFace
`qwen`	Qwen 3 Realtime with native audio I/O	Qwen
`aws`	Nova Realtime + Polly TTS	AWS Bedrock

Speech (STT & TTS)

Plugin	STT	TTS	Description	Docs
`deepgram`	✓	✓	Fast transcription with turn detection	Deepgram
`elevenlabs`		✓	Expressive voices for conversational AI	ElevenLabs
`cartesia`		✓	Low-latency TTS with audio markup	Cartesia
`pocket`		✓	CPU-based TTS with voice cloning	Pocket
`fish`	✓	✓	Voice cloning and auto language detection	Fish Audio
`fast_whisper`	✓		Local Whisper with CTranslate2	Fast-Whisper
`wizper`	✓		STT with real-time translation	Wizper
`kokoro`		✓	Local TTS for offline use	Kokoro
`inworld`		✓	Streaming expressive voices for realtime applications	Inworld

Vision & Video

Plugin	Description	Docs
`nvidia`	Cosmos 2 VLM for video understanding	NVIDIA
`ultralytics`	YOLO detection, pose, segmentation	Ultralytics
`roboflow`	Cloud or local detection with RF-DETR	Roboflow
`moondream`	Detection, captioning, VQA	Moondream
`decart`	Real-time video style transfer	Decart
`heygen`	Interactive avatars	HeyGen

Turn Detection

Plugin	Description	Docs
`smart_turn`	Neural turn detection with Silero VAD	Smart Turn
`vogent`	Intelligent turn-taking	Vogent

Infrastructure

Plugin	Description	Docs
`getstream`	Edge network for low-latency transport	—
`twilio`	Phone integration (inbound/outbound)	Calling Guide
`turbopuffer`	Vector search for RAG	Turbopuffer

Getting Started

AI Technologies

Core Architecture

Reference

Available Plugins

LLMs & Realtime

Speech (STT & TTS)

Vision & Video

Turn Detection

Infrastructure

Next Steps

Build a Voice Agent

Build a Video Agent

Getting Started

AI Technologies

Core Architecture

Reference

​Available Plugins

​LLMs & Realtime

​Speech (STT & TTS)

​Vision & Video

​Turn Detection

​Infrastructure

​Next Steps

Build a Voice Agent

Build a Video Agent

Available Plugins

LLMs & Realtime

Speech (STT & TTS)

Vision & Video

Turn Detection

Infrastructure

Next Steps