> ## Documentation Index
> Fetch the complete documentation index at: https://visionagents.ai/llms.txt
> Use this file to discover all available pages before exploring further.

# Video Agents

> Build video AI agents with realtime models, VLMs, and computer vision processors

Build real-time video AI agents that process video with computer vision models, analyze frames with VLMs, or stream directly to realtime models. [Deploy to production](/guides/deployment) with [built-in metrics](/core/telemetry).

<Prompt description="Copy this prompt into Claude Code, Cursor, Windsurf, or any coding agent to scaffold your project." actions={["copy", "cursor"]}>
  {`Create a Python project for a Vision Agents video AI agent using uv and Python 3.12.

    Steps:
    1. Initialize: uv init && uv add "vision-agents[getstream,gemini,ultralytics]" python-dotenv
    2. Create .env with: STREAM_API_KEY, STREAM_API_SECRET (from getstream.io), GOOGLE_API_KEY (from aistudio.google.com)
    3. Create main.py:

    from dotenv import load_dotenv
    from vision_agents.core import Agent, AgentLauncher, User, Runner
    from vision_agents.plugins import getstream, gemini, ultralytics

    load_dotenv()

    async def create_agent(**kwargs) -> Agent:
      return Agent(
          edge=getstream.Edge(),
          agent_user=User(name="Coach", id="agent"),
          instructions="Analyze what you see on camera and provide real-time feedback on the user's form and technique.",
          llm=gemini.Realtime(fps=3),
          processors=[
              ultralytics.YOLOPoseProcessor(model_path="yolo26n-pose.pt")
          ],
      )

    async def join_call(agent: Agent, call_type: str, call_id: str, **kwargs) -> None:
      call = await agent.create_call(call_type, call_id)
      async with agent.join(call):
          await agent.simple_response("Greet the user and let them know you can see them")
          await agent.finish()

    if __name__ == "__main__":
      Runner(AgentLauncher(create_agent=create_agent, join_call=join_call)).cli()

    4. Run with: uv run main.py run

    Reference docs: https://visionagents.ai
    MCP server: https://visionagents.ai/mcp
    Skill.md: https://visionagents.ai/skill.md`}
</Prompt>

<Info>
  Vision Agents requires a [Stream](https://getstream.io/try-for-free/) account for real-time transport. Stream offers 333,000 free participant minutes monthly, plus additional credits through the [Maker Program](https://getstream.io/chat/pricing/#free-for-maker) for indie developers. Most AI providers also offer free tiers.
</Info>

**Prerequisites:** Complete the [Quickstart](/introduction/quickstart) first.

## Three Approaches

| Mode                | Best For                      | How It Works                                |
| ------------------- | ----------------------------- | ------------------------------------------- |
| **Realtime Models** | Lowest latency, native video  | WebRTC/WebSocket direct to OpenAI or Gemini |
| **VLMs**            | Video understanding, analysis | Frame buffering + chat completions API      |
| **Processors**      | Computer vision, detection    | Custom ML pipelines alongside the LLM       |

## Realtime Mode

Stream video directly to models with native vision support. The `fps` parameter controls how many frames per second are sent to the model:

```python theme={null}
from dotenv import load_dotenv

from vision_agents.core import Agent, AgentLauncher, User, Runner
from vision_agents.plugins import getstream, gemini

load_dotenv()


async def create_agent(**kwargs) -> Agent:
    return Agent(
        edge=getstream.Edge(),
        agent_user=User(name="Assistant", id="agent"),
        instructions="Describe what you see. Be concise.",
        llm=gemini.Realtime(fps=3),
    )


async def join_call(agent: Agent, call_type: str, call_id: str, **kwargs) -> None:
    call = await agent.create_call(call_type, call_id)
    async with agent.join(call):
        await agent.simple_response("What do you see?")
        await agent.finish()


if __name__ == "__main__":
    Runner(AgentLauncher(create_agent=create_agent, join_call=join_call)).cli()
```

Swap providers in one line:

```python theme={null}
llm=openai.Realtime(fps=3)   # OpenAI
llm=gemini.Realtime(fps=3)   # Gemini
llm=qwen.Realtime(fps=1)     # Qwen 3 OMNI
```

## Vision Language Models (VLMs)

For video understanding and analysis, use VLMs that support the chat completions spec. Vision Agents automatically buffers frames and includes them with each request. Add the video-specific plugins:

```bash theme={null}
uv add "vision-agents[nvidia,deepgram,elevenlabs]"
```

Add to your `.env`:

```bash theme={null}
NVIDIA_API_KEY=your_nvidia_api_key
DEEPGRAM_API_KEY=your_deepgram_api_key
ELEVENLABS_API_KEY=your_elevenlabs_api_key
```

```python theme={null}
from dotenv import load_dotenv

from vision_agents.core import Agent, AgentLauncher, User, Runner
from vision_agents.plugins import nvidia, getstream, deepgram, elevenlabs

load_dotenv()


async def create_agent(**kwargs) -> Agent:
    return Agent(
        edge=getstream.Edge(),
        agent_user=User(name="Assistant", id="agent"),
        instructions="Analyze the video and answer questions.",
        llm=nvidia.VLM(
            model="nvidia/cosmos-reason2-8b",
            fps=1,
            frame_buffer_seconds=10,
        ),
        stt=deepgram.STT(),
        tts=elevenlabs.TTS(),
    )


async def join_call(agent: Agent, call_type: str, call_id: str, **kwargs) -> None:
    call = await agent.create_call(call_type, call_id)
    async with agent.join(call):
        await agent.simple_response("Describe what you see")
        await agent.finish()


if __name__ == "__main__":
    Runner(AgentLauncher(create_agent=create_agent, join_call=join_call)).cli()
```

Supported VLM providers:

| Provider                                                    | Use Case                                            |
| ----------------------------------------------------------- | --------------------------------------------------- |
| **[NVIDIA](/integrations/vision/nvidia)**                   | Cosmos 2 for advanced video reasoning               |
| **[HuggingFace](/integrations/infrastructure/huggingface)** | Open-source VLMs (Qwen2-VL, etc.) via inference API |
| **[OpenRouter](/integrations/llm/openrouter)**              | Unified access to Claude, Gemini, and more          |

## Video Processors

For computer vision tasks like object detection, pose estimation, or custom ML models, use processors. They intercept video frames, run inference, and forward results to the LLM.

```bash theme={null}
uv add "vision-agents[ultralytics]"
```

```python theme={null}
from dotenv import load_dotenv

from vision_agents.core import Agent, AgentLauncher, User, Runner
from vision_agents.plugins import getstream, gemini, ultralytics

load_dotenv()


async def create_agent(**kwargs) -> Agent:
    return Agent(
        edge=getstream.Edge(),
        agent_user=User(name="Golf Coach", id="agent"),
        instructions="Analyze the user's golf swing and provide feedback.",
        llm=gemini.Realtime(fps=3),
        processors=[
            ultralytics.YOLOPoseProcessor(model_path="yolo26n-pose.pt")
        ],
    )


async def join_call(agent: Agent, call_type: str, call_id: str, **kwargs) -> None:
    call = await agent.create_call(call_type, call_id)
    async with agent.join(call):
        await agent.simple_response("Say hi and offer to analyze their swing")
        await agent.finish()


if __name__ == "__main__":
    Runner(AgentLauncher(create_agent=create_agent, join_call=join_call)).cli()
```

Available processors:

| Processor            | What It Does                                    |
| -------------------- | ----------------------------------------------- |
| **Ultralytics YOLO** | Object detection, pose estimation, segmentation |
| **Roboflow**         | Cloud or local detection with RF-DETR           |
| **Custom**           | Extend `VideoProcessor` for any ML model        |

Processors can be chained — run detection first, then pass annotated frames to the LLM.

## Custom Pipeline with VLM

Combine VLMs with separate STT and TTS for full control:

```python theme={null}
from vision_agents.plugins import huggingface, getstream, deepgram, elevenlabs

# Inside create_agent:
agent = Agent(
    edge=getstream.Edge(),
    agent_user=User(name="Assistant", id="agent"),
    instructions="You're a visual assistant.",
    llm=huggingface.VLM(
        model="Qwen/Qwen2-VL-7B-Instruct",
        fps=1,
        frame_buffer_seconds=10,
    ),
    stt=deepgram.STT(),
    tts=elevenlabs.TTS(),
)
```

## What's Next

<CardGroup cols={2}>
  <Card title="Video Processors" icon="eye" href="/guides/video-processors">
    Build custom detection and analysis pipelines
  </Card>

  <Card title="Docker Deployment" icon="docker" href="/guides/deployment">
    Docker setup and environment configuration
  </Card>
</CardGroup>

## Examples

* [Golf Coach](/examples/golf-coach) — Realtime pose detection + coaching
* [Security Camera](/examples/security-camera) — Face recognition + package detection
* [Football Commentator](/examples/football-commentator) — Object detection + live commentary
