> ## Documentation Index
> Fetch the complete documentation index at: https://visionagents.ai/llms.txt
> Use this file to discover all available pages before exploring further.

# Introduction to Integrations

Vision Agents ships with 30+ plugins that connect AI providers to your real-time voice and video applications. Each plugin wraps a provider's API with a consistent interface — swap providers without rewriting your agent logic.

<Info>
  Vision Agents requires a [Stream](https://getstream.io/try-for-free/) account for real-time transport. Most providers offer free tiers to get started.
</Info>

## Which plugin do I need?

Pick based on what your agent needs to do:

| I want to...                                     | Start here                          | What you get                                                                       |
| ------------------------------------------------ | ----------------------------------- | ---------------------------------------------------------------------------------- |
| Handle calls and respond naturally by voice      | [Realtime](#realtime)               | End-to-end voice agent with multimodal support, unified under one plugin and model |
| Connect to my own tools, APIs, or knowledge base | [Language Models](#language-models) | Function calling, RAG, and full control over STT/TTS choices                       |
| Transcribe what users say in real time           | [Speech-to-Text](#speech-to-text)   | Streaming transcription, some with built-in turn detection                         |
| Give my agent a distinct, natural voice          | [Text-to-Speech](#text-to-speech)   | Cloud and local options, from expressive to ultra-low latency                      |
| See and understand what's on camera              | [Vision & Video](#vision--video)    | Object detection, video analysis, and style transfer                               |
| Put a face on my agent                           | [Avatars](#avatars)                 | Real-time lip-synced visual characters                                             |
| Make conversations feel natural, not robotic     | [Turn Detection](#turn-detection)   | Smart interruption handling and silence detection                                  |
| Run open-source models on my own infrastructure  | [Infrastructure](#infrastructure)   | Self-hosted inference, model routing, and vector search                            |
| Connect users to my agent over WebRTC            | [Edge Transport](#edge-transport)   | Stream's global edge network — sub-500ms latency with frontend SDKs                |
| Deploy agents over Tencent's network in China    | [Edge Transport](#edge-transport)   | Alternative transport layer with low latency in mainland China                     |

## Installation

Plugins install as extras. Add only the ones you need:

```sh theme={null}
uv add "vision-agents[gemini,deepgram,elevenlabs]"
```

See the [Installation guide](/introduction/installation) for the full list of available extras.

## Browse by Category

### Language Models

Text generation with function calling. Requires separate STT/TTS plugins.

| Provider                                          | Notes                                         |
| ------------------------------------------------- | --------------------------------------------- |
| [Anthropic (Claude)](/integrations/llm/anthropic) | Messages API, streaming, function calling     |
| [Gemini](/integrations/llm/gemini)                | Built-in tools: search, code execution, RAG   |
| [OpenAI](/integrations/llm/openai)                | Responses API (GPT-5+) and ChatCompletions    |
| [xAI (Grok)](/integrations/llm/xai)               | Advanced reasoning, function calling          |
| [OpenRouter](/integrations/llm/openrouter)        | Unified API for Claude, Gemini, GPT, and more |
| [Kimi AI](/integrations/llm/kimi)                 | OpenAI-compatible via ChatCompletions         |
| [MiniMax](/integrations/llm/minimax)              | MiniMax-M3 and M-series, OpenAI-compatible    |
| [Qwen](/integrations/llm/qwen)                    | DashScope API via ChatCompletions             |

### Realtime

End-to-end speech-to-speech with built-in STT/TTS. Lowest latency, simplest setup.

| Provider                                           | Notes                                       |
| -------------------------------------------------- | ------------------------------------------- |
| [Gemini Realtime](/integrations/realtime/gemini)   | WebSocket, optional video, built-in VAD     |
| [Inworld Realtime](/integrations/realtime/inworld) | WebRTC, protocol-compatible with OpenAI     |
| [OpenAI Realtime](/integrations/realtime/openai)   | WebRTC, built-in STT/TTS                    |
| [Qwen Realtime](/integrations/realtime/qwen)       | Native audio I/O, video support             |
| [xAI Realtime](/integrations/realtime/xai)         | WebSocket, server VAD, web + X search       |
| [AWS Bedrock](/integrations/realtime/aws-bedrock)  | Amazon Nova models, auto session management |

### Speech-to-Text

Real-time transcription. Some include built-in turn detection.

| Provider                                       | Notes                                                 |
| ---------------------------------------------- | ----------------------------------------------------- |
| [Deepgram](/integrations/stt/deepgram)         | Nova-3, built-in turn detection                       |
| [ElevenLabs](/integrations/stt/elevenlabs)     | Scribe v2, \~150ms latency, built-in VAD              |
| [AssemblyAI](/integrations/stt/assemblyai)     | Punctuation-based turn detection                      |
| [Cartesia](/integrations/stt/cartesia)         | Ink model, streaming PCM, turn detection              |
| [Fish Audio](/integrations/stt/fish)           | Auto language detection                               |
| [Mistral Voxtral](/integrations/stt/mistral)   | WebSocket streaming, requires separate turn detection |
| [Fast-Whisper](/integrations/stt/fast-whisper) | Local, CPU/GPU accelerated                            |
| [Wizper](/integrations/stt/wizper)             | Whisper v3, on-the-fly translation                    |

### Text-to-Speech

Voice synthesis for agent responses.

| Provider                                   | Notes                                 |
| ------------------------------------------ | ------------------------------------- |
| [ElevenLabs](/integrations/tts/elevenlabs) | Highly realistic, multilingual        |
| [Cartesia](/integrations/tts/cartesia)     | Low-latency Sonic model               |
| [Deepgram](/integrations/tts/deepgram)     | Aura-2, low-latency                   |
| [OpenAI](/integrations/tts/openai)         | gpt-4o-mini-tts, streaming            |
| [Fish Audio](/integrations/tts/fish)       | Prosody control, voice cloning        |
| [Inworld](/integrations/tts/inworld)       | Expressive game character voices      |
| [Kokoro](/integrations/tts/kokoro)         | Local, runs on CPU, no API key        |
| [Pocket TTS](/integrations/tts/pocket)     | Local, \~200ms latency, voice cloning |
| [xAI](/integrations/tts/xai)               | Five expressive voices, speech tags   |
| [AWS Polly](/integrations/tts/aws-polly)   | Standard and neural engines           |

### Vision & Video

Video understanding, object detection, and video transformation.

| Provider                                             | Notes                                         |
| ---------------------------------------------------- | --------------------------------------------- |
| [Moondream](/integrations/vision/moondream)          | Zero-shot detection, VQA, cloud or local      |
| [NVIDIA](/integrations/vision/nvidia)                | Cosmos Reason2, real-time video understanding |
| [Roboflow](/integrations/vision/roboflow)            | Pre-trained and custom detection models       |
| [Ultralytics YOLO](/integrations/vision/ultralytics) | Pose estimation, object detection             |
| [Decart](/integrations/vision/decart)                | Real-time AI video style transfer             |

### Avatars

Visual AI characters with synchronized lip-sync.

| Provider                                       | Notes                                             |
| ---------------------------------------------- | ------------------------------------------------- |
| [Anam](/integrations/avatars/anam)             | Real-time conversational avatars                  |
| [LiveAvatar](/integrations/avatars/liveavatar) | Realistic AI avatars (HeyGen), automatic lip-sync |
| [LemonSlice](/integrations/avatars/lemonslice) | Real-time interactive avatars                     |

### Turn Detection

Controls when the agent should start and stop speaking.

| Provider                                              | Notes                             |
| ----------------------------------------------------- | --------------------------------- |
| [Smart Turn](/integrations/turn-detection/smart-turn) | Silero VAD + Whisper features     |
| [Vogent](/integrations/turn-detection/vogent)         | Neural turn completion prediction |

<Tip>
  [Deepgram](/integrations/stt/deepgram) and [ElevenLabs](/integrations/stt/elevenlabs) STT include built-in turn detection — no separate plugin needed.
</Tip>

### Infrastructure

Inference platforms and data services for running models on your own terms.

| Provider                                                          | Notes                                                     |
| ----------------------------------------------------------------- | --------------------------------------------------------- |
| [Baseten](/integrations/infrastructure/baseten)                   | OpenAI-compatible endpoints for open-source models        |
| [HuggingFace Inference](/integrations/infrastructure/huggingface) | Unified API routing to Together, Groq, Cerebras, and more |
| [TurboPuffer](/integrations/infrastructure/turbopuffer)           | Vector database for RAG with hybrid search                |

### Edge Transport

Alternative real-time transport layers for deploying agents in specific regions.

| Provider                                                   | Notes                                                                      |
| ---------------------------------------------------------- | -------------------------------------------------------------------------- |
| [Stream Video RTC](/integrations/edge-transport/getstream) | Default transport — global WebRTC, chat-backed conversation, frontend SDKs |
| [Local transport](/integrations/edge-transport/local)      | Microphone, speakers, and camera as the agent edge                         |
| [Tencent RTC](/integrations/edge-transport/tencent)        | Low-latency in China, frontend SDKs                                        |

## Consistent Interface

Plugins of the same type share a common interface — swap providers in one line:

```python theme={null}
# Any STT plugin works the same way
stt = deepgram.STT()
stt = elevenlabs.STT()
stt = fish.STT()

# Any TTS plugin works the same way
tts = elevenlabs.TTS()
tts = cartesia.TTS()
tts = kokoro.TTS()

# Any LLM plugin works the same way
llm = gemini.LLM("gemini-3-flash-preview")
llm = openai.LLM(model="gpt-5.4")
llm = openrouter.LLM(model="anthropic/claude-sonnet-4")
```

## Creating Custom Plugins

Don't see your provider? Build your own plugin to connect additional services. See the [Create Your Own Plugin](/integrations/create-your-own-plugin) guide.