- WebRTC: Natively send realtime video at full FPS to LLM models over WebRTC, no intervals or images necessary
- Interval-based processing: A Video Processor intercepts video frames at a set time, runs them through custom ML models, and then forwards the input to LLMs for further processing.
Agent
class automatically handles a lot of this logic for you under the hood. Both Gemini Live and OpenAI Realtime support native WebRTC video by default, while LLMs configured with dedicated STT, TTS, and Processors will also automatically forward video frames. These are great for applications across real-time coaching, manufacturing, healthcare, retail, virtual avatars and more.
Building with OpenAI Realtime over WebRTC
Let’s get started by adding the dependencies required for our project. In this example, we assume you have a fresh Python project setup using3.12+
or something newer. In the guide, we also use uv
as our package manager of choice.
main.py
file, we can start by importing the packages required for our project:
.env
variables required for our sample. Since we are running the OpenAI model in this example, you will need to have the following in your .env
:
Agent
the some basic instructions, configure our edge layer and instantiate the LLM we are using:
Rather than passing in instructions directly in the agent creation step, you can also use @mention syntax in the instructions string, like so:
Building a custom Video AI pipeline
A powerful component of the Vision Agents SDK is the ability to integrate realtime Video to any external computer vision model/provider through our processor pipeline. Processors are special classes that allow developers to interact directly with the raw frames. In this section, we will look at building an advanced video AI pipeline capable of detecting poses made by the user. For our processor, we will use the out-of-the-box integration with Ultralytics’ YOLO Pose Detection; however, as we will talk about more in the Processors section, this method can be used to integrate with any generic AI solution capable of processing images. To get started, let’s make a few modifications to our original sample:- Instead of using the OpenAI Realtime model, we are now using the LLM model
STT
andTTS
are broken out to use Deepgram and Cartesia directly- We pass in
YOLOPoseProcessor
to theprocessors
list on theAgent
.
Don’t forget to also update your
.env
to use the keys from Cartesia and Deepgram. Both are free for developers and can be found on their respective dashboards.processor
; however, it is possible to pass in multiple and chain them together. Processors are also not limited to video; they can also be audio, allowing you to manipulate the user’s audio as well.
For more on Processors, LLMs, and Realtime, check out some of the other guides in our docs. Building something with Vision Agents, tell us about it, we love seeing (and sharing) projects from the community.