Realtime Class

The Realtime component provides end-to-end speech-to-speech communication, combining STT, LLM, and TTS functionality in a single, optimized interface. It delivers ultra-low latency speech processing, direct audio streaming without intermediate text conversion, provider-specific optimizations, and support for multiple modalities (audio, video, text).

Overview

The Realtime class is an abstract base class that enables real-time AI communication through various providers. It eliminates the need for separate STT and TTS services by handling speech-to-speech communication directly, resulting in lower latency and more natural conversations.

Supported Providers

OpenAI Realtime API: WebRTC-based real-time communication with GPT models
Google Gemini Live: Native audio processing with multimodal capabilities

Basic Usage

from vision_agents.plugins import openai, gemini
from vision_agents.core.agents import Agent
from vision_agents.core.edge.types import User

# OpenAI Realtime
llm = openai.Realtime(model="gpt-realtime", voice="marin")

# Gemini Live
llm = gemini.Realtime(model="gemini-2.5-flash-native-audio-preview-09-2025")

# Use with Agent
agent = Agent(
    edge=getstream.Edge(),
    agent_user=User(name="AI Assistant", id="agent"),
    instructions="You're a helpful voice assistant",
    llm=llm,  # Realtime LLM replaces STT/TTS
    processors=[]
)

Abstract Base Class

Core Methods

`async connect()`

Establishes connection to the realtime provider. Must be implemented by each provider.

await llm.connect()

`async simple_audio_response(pcm: PcmData)`

Sends audio data to the realtime provider for processing.

await llm.simple_audio_response(audio_pcm_data)

`async simple_response(text: str, processors=None, participant=None)`

Sends a text message to the realtime provider.

await llm.simple_response("Hello, how can I help you?")

`async close()`

Closes the realtime connection and cleans up resources.

await llm.close()

Properties

`is_connected: bool`

Returns True if the realtime session is currently active.

`output_track: AudioStreamTrack`

WebRTC audio track for outputting synthesized speech.

`fps: int`

Frames per second for video processing (default: 1).

Provider Implementations

OpenAI Realtime

from vision_agents.plugins.openai import Realtime

llm = Realtime(
    model="gpt-realtime",
    voice="marin",
    fps=1
)

Gemini Live

from vision_agents.plugins.gemini import Realtime

llm = Realtime(
    model="gemini-2.5-flash-native-audio-preview-09-2025",
    api_key="your_google_api_key",
    config={
        "response_modalities": ["AUDIO"],
        "speech_config": {
            "voice_config": {
                "prebuilt_voice_config": {
                    "voice_name": "Leda"
                }
            }
        }
    }
)

Event System

The Realtime class emits various events for monitoring and integration:

Connection Events

RealtimeConnectedEvent: Emitted when connection is established
RealtimeDisconnectedEvent: Emitted when connection is lost

Audio Events

RealtimeAudioInputEvent: Audio data received from user
RealtimeAudioOutputEvent: Audio data sent to user

Transcript Events

RealtimeTranscriptEvent: Final transcript of user speech
RealtimePartialTranscriptEvent: Partial transcript during speech

Response Events

RealtimeResponseEvent: Complete response from AI
StandardizedTextDeltaEvent: Streaming text deltas

Error Events

RealtimeErrorEvent: Errors during processing

Example Event Handling

@llm.events.subscribe
async def on_connected(event: RealtimeConnectedEvent):
    print(f"Connected to {event.provider}")

@llm.events.subscribe
async def on_transcript(event: RealtimeTranscriptEvent):
    print(f"User said: {event.text}")

@llm.events.subscribe
async def on_response(event: RealtimeResponseEvent):
    print(f"AI responded: {event.text}")

Video Support

Video Processing

Some providers support video input for multimodal interactions:

# Watch a video track (provider-specific)
await llm._watch_video_track(video_track)

# Stop watching video
await llm._stop_watching_video_track()

Configuration

Provider-Specific Settings

OpenAI Realtime

llm = Realtime(
    model="gpt-realtime",
    voice="marin",
    fps=1,  # Video frames per second
    instructions="You are a helpful assistant"
)

Gemini Live

llm = Realtime(
    model="gemini-2.5-flash-native-audio-preview-09-2025",
    config={
        "response_modalities": ["AUDIO"],
        "speech_config": {
            "voice_config": {
                "prebuilt_voice_config": {
                    "voice_name": "Leda"
                }
            },
            "language_code": "en-US"
        },
        "enable_affective_dialog": True
    }
)

Integration with Agent

Agent Configuration

When using Realtime with an Agent, STT and TTS services are not needed:

agent = Agent(
    edge=getstream.Edge(),
    agent_user=User(name="AI Assistant", id="agent"),
    instructions="You're a helpful voice assistant",
    llm=openai.Realtime(),  # Realtime handles audio
    # No STT, TTS, or VAD needed
    processors=[]
)

Automatic Mode Detection

The Agent automatically detects Realtime mode and adjusts behavior accordingly:

if agent.realtime_mode:
    # Realtime mode - direct audio processing
    pass
else:
    # Traditional mode - STT → LLM → TTS
    pass

Complete Realtime Example

import asyncio
from vision_agents.core.agents import Agent
from vision_agents.plugins import openai, getstream
from vision_agents.core.edge.types import User
from vision_agents.core.events import CallSessionParticipantJoinedEvent

async def main():
    # Create Realtime LLM
    llm = openai.Realtime(
        model="gpt-realtime",
        voice="marin"
    )

    # Create agent with Realtime LLM
    agent = Agent(
        edge=getstream.Edge(),
        agent_user=User(name="AI Assistant", id="agent"),
        instructions="You're a helpful voice assistant. Keep responses conversational and natural.",
        llm=llm,
        processors=[]
    )

    # Set up event handlers
    @agent.subscribe
    async def on_participant_joined(event: CallSessionParticipantJoinedEvent):
        await agent.simple_response(f"Hello {event.participant.user.name}!")

    # Create and join call
    await agent.create_user()
    call = agent.edge.client.video.call("default", "realtime-demo")
    await agent.edge.open_demo(call)

    with await agent.join(call):
        await agent.finish()

    await agent.close()

if __name__ == "__main__":
    asyncio.run(main())

Getting Started

AI Technologies

Core Architecture

Cookbook

Reference

Realtime Class

Realtime Class

Overview

Supported Providers

Basic Usage

Abstract Base Class

Core Methods

`async connect()`

`async simple_audio_response(pcm: PcmData)`

`async simple_response(text: str, processors=None, participant=None)`

`async close()`

Properties

`is_connected: bool`

`output_track: AudioStreamTrack`

`fps: int`

Provider Implementations

OpenAI Realtime

Gemini Live

Event System

Connection Events

Audio Events

Transcript Events

Response Events

Error Events

Example Event Handling

Video Support

Video Processing

Configuration

Provider-Specific Settings

OpenAI Realtime

Gemini Live

Integration with Agent

Agent Configuration

Automatic Mode Detection

Complete Realtime Example

Getting Started

AI Technologies

Core Architecture

Cookbook

Reference

​Realtime Class

​Overview

​Supported Providers

​Basic Usage

​Abstract Base Class

​Core Methods

​async connect()

​async simple_audio_response(pcm: PcmData)

​async simple_response(text: str, processors=None, participant=None)

​async close()

​Properties

​is_connected: bool

​output_track: AudioStreamTrack

​fps: int

​Provider Implementations

​OpenAI Realtime

​Gemini Live

​Event System

​Connection Events

​Audio Events

​Transcript Events

​Response Events

​Error Events

​Example Event Handling

​Video Support

​Video Processing

​Configuration

​Provider-Specific Settings

​OpenAI Realtime

​Gemini Live

​Integration with Agent

​Agent Configuration

​Automatic Mode Detection

​Complete Realtime Example

Realtime Class

Overview

Supported Providers

Basic Usage

Abstract Base Class

Core Methods

`async connect()`

`async simple_audio_response(pcm: PcmData)`

`async simple_response(text: str, processors=None, participant=None)`

`async close()`

Properties

`is_connected: bool`

`output_track: AudioStreamTrack`

`fps: int`

Provider Implementations

OpenAI Realtime

Gemini Live

Event System

Connection Events

Audio Events

Transcript Events

Response Events

Error Events

Example Event Handling

Video Support

Video Processing

Configuration

Provider-Specific Settings

OpenAI Realtime

Gemini Live

Integration with Agent

Agent Configuration

Automatic Mode Detection

Complete Realtime Example