Skip to main content
In addition to voice agents, which we discussed in the previous section, developers can also build fast, realtime video AI applications using Vision Agents. Video AI agents on Stream can be configured in a few ways:
  • WebRTC: Natively send realtime video at full FPS to LLM models over WebRTC, no intervals or images necessary
  • Interval-based processing: A Video Processor intercepts video frames at a set time, runs them through custom ML models, and then forwards the input to LLMs for further processing.
Like voice agents, the Agent class automatically handles a lot of this logic for you under the hood. Both Gemini Live and OpenAI Realtime support native WebRTC video by default, while LLMs configured with dedicated STT, TTS, and Processors will also automatically forward video frames. These are great for applications across real-time coaching, manufacturing, healthcare, retail, virtual avatars and more.

Building with OpenAI Realtime over WebRTC

Let’s get started by adding the dependencies required for our project. In this example, we assume you have a fresh Python project setup using 3.12+ or something newer. In the guide, we also use uv as our package manager of choice.
# Initialize a project in your working directory
uv init

uv add "vision-agents[getstream, openai]"
Next, in our main.py file, we can start by importing the packages required for our project:
import asyncio
import logging
from uuid import uuid4

from dotenv import load_dotenv

from vision_agents.core.edge.types import User
from vision_agents.plugins import getstream, gemini 
from vision_agents.core import agents, cli

logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)

load_dotenv()
This sets up some basic logging and loads in the .env variables required for our sample. Since we are running the OpenAI model in this example, you will need to have the following in your .env:
# Stream API credentials
STREAM_API_KEY=
STREAM_API_SECRET=

# OpenAI
OPENAI_API_KEY=
Next, let’s setup the Agent the some basic instructions, configure our edge layer and instantiate the LLM we are using:
async def start_agent() -> None:
    # create an agent to run with Stream's edge, OpenAI llm
    agent = agents.Agent(
        edge=getstream.Edge(),  # low latency edge. clients for React, iOS, Android, RN, Flutter etc.
        agent_user=User(name="My happy AI friend", id="agent"),  # the user object for the agent (name, image etc)
        instructions="You're a video AI assistant. Keep responses short and conversational. Don't use special characters or formatting. Be friendly and helpful. Your main job is to describe the world you see to the user. Make it fun!",
        llm=openai.Realtime(),
    )

    # Create a call
    call = agent.edge.client.video.call("default", str(uuid4()))

    # Open the demo UI
    agent.edge.open_demo(call)

    # Have the agent join the call/room
    with await agent.join(call):
        await agent.llm.simple_response("Tell me what you see in the frame")
        await agent.finish()


if __name__ == "__main__":
    asyncio.run(cli.start_dispatcher(start_agent))
Rather than passing in instructions directly in the agent creation step, you can also use @mention syntax in the instructions string, like so:
instructions="Read @voice-agent-instructions.md"
Since we are using OpenAI directly over WebRTC, we automatically benefit from OpenAI’s voices, turn detection, and more. Under the hood, the raw WebRTC tracks go directly to OpenAI; there are no internals or intermediate steps. The result is an LLM that can see and hear the world around you and respond to user conversations with minimal delay. This approach is fantastic for building games and applications where advanced image processing isn’t needed before the model. In the next section, we will look at building an advanced video AI pipeline that does interval processing before the model.

Building a custom Video AI pipeline

A powerful component of the Vision Agents SDK is the ability to integrate realtime Video to any external computer vision model/provider through our processor pipeline. Processors are special classes that allow developers to interact directly with the raw frames. In this section, we will look at building an advanced video AI pipeline capable of detecting poses made by the user. For our processor, we will use the out-of-the-box integration with Ultralytics’ YOLO Pose Detection; however, as we will talk about more in the Processors section, this method can be used to integrate with any generic AI solution capable of processing images. To get started, let’s make a few modifications to our original sample:
    from vision_agents.core.processors import YOLOPoseProcessor # Add import for YOLO 
    from vision_agents.plugins import deepgram, cartesia  # Add import for deepgram and cartesia plugins

    agent = agents.Agent(
        edge=getstream.Edge(),
        agent_user=User(name="My happy AI friend", id="agent"),  # the user object for the agent (name, image etc)
        instructions="You're a video AI assistant built to detect poses. Keep responses short and conversational. Don't use special characters or formatting. Be friendly and helpful. Make it fun!",
        llm=openai.LLM(),
        stt=deepgram.STT(), 
        tts=cartesia.TTS(),
        processors=[
            YOLOPoseProcessor()
        ],  # processors can fetch extra data, check images/audio data or transform video
    )
In the above snippet, we made a few changes to the code:
  1. Instead of using the OpenAI Realtime model, we are now using the LLM model
  2. STT and TTS are broken out to use Deepgram and Cartesia directly
  3. We pass in YOLOPoseProcessor to the processors list on the Agent.
Don’t forget to also update your .env to use the keys from Cartesia and Deepgram. Both are free for developers and can be found on their respective dashboards.
In this example, we are only using one processor; however, it is possible to pass in multiple and chain them together. Processors are also not limited to video; they can also be audio, allowing you to manipulate the user’s audio as well. For more on Processors, LLMs, and Realtime, check out some of the other guides in our docs. Building something with Vision Agents, tell us about it, we love seeing (and sharing) projects from the community.
I