Vision Agents requires a Stream account for real-time transport. Most providers offer free tiers to get started.
Plugin Categories
| Category | Plugins | Description |
|---|---|---|
| Realtime | OpenAI, Gemini, Qwen, AWS Bedrock | Native speech-to-speech over WebRTC/WebSocket |
| LLM | OpenAI, Gemini, OpenRouter, xAI, HuggingFace | Text generation with function calling |
| VLM | NVIDIA, HuggingFace, Moondream, OpenRouter | Video understanding via chat completions |
| STT | Deepgram, ElevenLabs, Fish, Fast-Whisper, Wizper | Speech-to-text transcription |
| TTS | ElevenLabs, Deepgram, Cartesia, Kokoro, Pocket, AWS Polly, Inworld | Text-to-speech synthesis |
| Turn Detection | Smart Turn, Vogent | Neural turn-taking detection |
| Video Processors | Ultralytics, Roboflow, Moondream, Decart, HeyGen | Detection, pose, style transfer, avatars |
| RAG | TurboPuffer, Gemini FileSearch | Vector search and knowledge retrieval |

