Skip to main content
To get started with the Vision Agents framework, developers can install the package from pypi. We recommend using uv as the package manager which is also open-source and free to use. To get started run:
uv add vision-agents 
By default, the SDK does not install with any packages. To install packages, you can run the following:
uv add "vision-agents[getstream, openai, elevenlabs, deepgram]"
Before running, you will also need a free API key from Stream. Developers building with Stream each receive 333,000 participant minutes free each month and indie developers and small businesses can apply to our Maker Program which includes an additional $500 worth of credits each month. Each provider also provides free development keys on their respective websites.
Plugin NameDescriptionDocs Link
AWS PollyTTS plugin using Amazon’s cloud-based service with natural-sounding voices and neural engine supportAWS Polly
CartesiaTTS plugin for realistic voice synthesis in real-time voice applicationsCartesia
DecartReal-time AI video transformation service for applying artistic styles and effects to video streamsDecart
DeepgramSTT plugin for fast, accurate real-time transcription with speaker diarizationDeepgram
ElevenLabsTTS plugin with highly realistic and expressive voices for conversational agentsElevenLabs
Fast-WhisperHigh-performance STT plugin using OpenAI’s Whisper model with CTranslate2 for fast inferenceFast-Whisper
Fish AudioSTT and TTS plugin with automatic language detection and voice cloning capabilitiesFish Audio
GeminiRealtime API for building conversational agents with support for both voice and videoGemini
HeyGenRealtime interactive avatars powered by HeyGenHeygen
InworldTTS plugin with high-quality streaming voices for real-time conversational AI agentsInworld
KokoroLocal TTS engine for offline voice synthesis with low latencyKokoro
MoondreamMoondream provides realtime detection and VLM capabilities. Developers can choose from using the hosted API or running locally on their CUDA devices. Vision Agents supports Moondream’s Detect, Caption and VQA skills out-of-the-box.Moondream
OpenAIRealtime API for building conversational agents with out of the box support for real-time video directly over WebRTC, LLMs and Open AI TTSOpenAI
Smart TurnAdvanced turn detection system combining Silero VAD, Whisper, and neural models for natural conversation flowSmart Turn
VogentNeural turn detection system for intelligent turn-taking in voice conversationsVogent
WizperSTT plugin with real-time translation capabilities powered by Whisper v3Wizper
xAILLM plugin using xAI’s Grok models with advanced reasoning and real-time knowledgexAI