Skip to main content
OpenAI Realtime is a low-latency API that combines real-time video analysis, transcription, text-to-speech synthesis and more into a single streamlined pipeline. The OpenAI Realtime plugin in the Vision Agents SDK is a native integration for realtime video and audio with out-of-the-box support for OpenAI’s realtime models. With it, you can natively stream both video and audio to OpenAI over WebRTC and receive responses in real-time. It also supports MCP and function calling, so agents are empowered to take actions for you. This is ideal for building conversational agents, AI avatars, fitness coaches, visual accessibility assistants, remote support tools with visual guidance, interactive tutors, and much more!

Installation

Install the Stream OpenAI plugin with
uv add vision-agents[openai]

Tutorials

The Voice AI quickstart and Video AI quickstart pages have examples to get you up and running.

Example

Check out our OpenAI example to see a practical implementation of the plugin and get inspiration for your own projects, or read on for some key details.

Usage

The OpenAI Realtime plugin is used as the LLM component of an Agent. Here’s a complete example:
from vision_agents.plugins import openai, getstream
from vision_agents.core.agents import Agent
from getstream import AsyncStream

# Create Stream client and user
client = AsyncStream()
agent_user = await client.create_user(name="AI Assistant")

# Create agent with OpenAI Realtime
agent = Agent(
    edge=getstream.Edge(),
    agent_user=agent_user,
    instructions="You are a helpful voice assistant.",
    llm=openai.Realtime(model="gpt-realtime", voice="marin", fps=1),
    processors=[]
)

# Create and join a call
call = client.video.call("default", call_id)
await call.get_or_create(data={"created_by_id": agent.agent_user.id})

with await agent.join(call):
    # Wait for LLM to be ready
    await agent.llm.simple_response(text="Please greet the user.")
    # Keep running until call ends
    await agent.finish()

Parameters

These are the parameters available in the OpenAI Realtime plugin:
NameTypeDefaultDescription
modelstr"gpt-realtime"The OpenAI model to use for speech-to-speech. Supports real-time models only.
voicestr"marin"The voice to use for spoken responses (e.g., “marin”, “alloy”, “echo”).
fpsint1Number of video frames per second to send (for video-enabled agents).
The API key is read from the OPENAI_API_KEY environment variable. Instructions are set via the Agent’s instructions parameter.

Methods

connect()

Establishes the WebRTC connection to OpenAI’s Realtime API. This is called automatically when the agent joins a call and should not be called directly in most cases.
await agent.llm.connect()

simple_response(text)

Sends a text message to the OpenAI Realtime session. The model will respond with audio output.
await agent.llm.simple_response(text="What do you see in the video?")

simple_audio_response(pcm_data)

Sends raw PCM audio data to OpenAI. Audio should be 48 kHz, 16-bit PCM format.
await agent.llm.simple_audio_response(pcm_data)

request_session_info()

Requests session information from the OpenAI API.
await agent.llm.request_session_info()

Properties

output_track

The output_track property provides access to the audio output stream from OpenAI. This is an AudioStreamTrack that contains the synthesized speech responses.
audio_track = agent.llm.output_track

is_connected

Returns True if the realtime session is currently active.
if agent.llm.is_connected:
    print("Connected to OpenAI Realtime API")

Function Calling

You can give the model the ability to call functions in your code while using the Realtime plugin via the main Agent class. Follow the instructions in the MCP tool calling guide, replacing the LLM with the OpenAI Realtime class.

Events

The OpenAI Realtime plugin emits various events during conversations that you can subscribe to. The plugin wraps OpenAI’s native events into a strongly-typed event system with better ergonomics.
from vision_agents.core.llm.events import (
    RealtimeConnectedEvent,
    RealtimeResponseEvent,
    RealtimeTranscriptEvent,
    RealtimeAudioOutputEvent,
    RealtimeErrorEvent
)

# Subscribe to events
@agent.llm.events.on(RealtimeConnectedEvent)
async def on_connected(event: RealtimeConnectedEvent):
    print(f"Connected! Session ID: {event.session_id}")
    print(f"Capabilities: {event.capabilities}")

@agent.llm.events.on(RealtimeTranscriptEvent)
async def on_transcript(event: RealtimeTranscriptEvent):
    print(f"Transcript: {event.text}")
    print(f"Role: {event.user_metadata.get('role')}")

@agent.llm.events.on(RealtimeResponseEvent)
async def on_response(event: RealtimeResponseEvent):
    print(f"Response: {event.text}")
    print(f"Complete: {event.is_complete}")

@agent.llm.events.on(RealtimeAudioOutputEvent)
async def on_audio_output(event: RealtimeAudioOutputEvent):
    # Handle audio output
    audio_data = event.audio_data
    sample_rate = event.sample_rate
I