Skip to main content

Realtime Class

The Realtime component provides end-to-end speech-to-speech communication, combining STT, LLM, and TTS functionality in a single, optimized interface. It delivers ultra-low latency speech processing, direct audio streaming without intermediate text conversion, provider-specific optimizations, and support for multiple modalities (audio, video, text).

Overview

The Realtime class is an abstract base class that enables real-time AI communication through various providers. It eliminates the need for separate STT and TTS services by handling speech-to-speech communication directly, resulting in lower latency and more natural conversations.

Supported Providers

  • OpenAI Realtime API: WebRTC-based real-time communication with GPT models
  • Google Gemini Live: Native audio processing with multimodal capabilities

Basic Usage

from vision_agents.plugins import openai, gemini
from vision_agents.core.agents import Agent
from vision_agents.core.edge.types import User

# OpenAI Realtime
llm = openai.Realtime(model="gpt-realtime", voice="marin")

# Gemini Live
llm = gemini.Realtime(model="gemini-2.5-flash-native-audio-preview-09-2025")

# Use with Agent
agent = Agent(
    edge=getstream.Edge(),
    agent_user=User(name="AI Assistant", id="agent"),
    instructions="You're a helpful voice assistant",
    llm=llm,  # Realtime LLM replaces STT/TTS
    processors=[]
)

Abstract Base Class

Core Methods

async connect()

Establishes connection to the realtime provider. Must be implemented by each provider.
await llm.connect()

async simple_audio_response(pcm: PcmData)

Sends audio data to the realtime provider for processing.
await llm.simple_audio_response(audio_pcm_data)

async simple_response(text: str, processors=None, participant=None)

Sends a text message to the realtime provider.
await llm.simple_response("Hello, how can I help you?")

async close()

Closes the realtime connection and cleans up resources.
await llm.close()

Properties

is_connected: bool

Returns True if the realtime session is currently active.

output_track: AudioStreamTrack

WebRTC audio track for outputting synthesized speech.

fps: int

Frames per second for video processing (default: 1).

Provider Implementations

OpenAI Realtime

from vision_agents.plugins.openai import Realtime

llm = Realtime(
    model="gpt-realtime",
    voice="marin",
    fps=1
)

Gemini Live

from vision_agents.plugins.gemini import Realtime

llm = Realtime(
    model="gemini-2.5-flash-native-audio-preview-09-2025",
    api_key="your_google_api_key",
    config={
        "response_modalities": ["AUDIO"],
        "speech_config": {
            "voice_config": {
                "prebuilt_voice_config": {
                    "voice_name": "Leda"
                }
            }
        }
    }
)

Event System

The Realtime class emits various events for monitoring and integration:

Connection Events

  • RealtimeConnectedEvent: Emitted when connection is established
  • RealtimeDisconnectedEvent: Emitted when connection is lost

Audio Events

  • RealtimeAudioInputEvent: Audio data received from user
  • RealtimeAudioOutputEvent: Audio data sent to user

Transcript Events

  • RealtimeTranscriptEvent: Final transcript of user speech
  • RealtimePartialTranscriptEvent: Partial transcript during speech

Response Events

  • RealtimeResponseEvent: Complete response from AI
  • StandardizedTextDeltaEvent: Streaming text deltas

Error Events

  • RealtimeErrorEvent: Errors during processing

Example Event Handling

@llm.events.subscribe
async def on_connected(event: RealtimeConnectedEvent):
    print(f"Connected to {event.provider}")

@llm.events.subscribe
async def on_transcript(event: RealtimeTranscriptEvent):
    print(f"User said: {event.text}")

@llm.events.subscribe
async def on_response(event: RealtimeResponseEvent):
    print(f"AI responded: {event.text}")

Video Support

Video Processing

Some providers support video input for multimodal interactions:
# Watch a video track (provider-specific)
await llm._watch_video_track(video_track)

# Stop watching video
await llm._stop_watching_video_track()

Configuration

Provider-Specific Settings

OpenAI Realtime

llm = Realtime(
    model="gpt-realtime",
    voice="marin",
    fps=1,  # Video frames per second
    instructions="You are a helpful assistant"
)

Gemini Live

llm = Realtime(
    model="gemini-2.5-flash-native-audio-preview-09-2025",
    config={
        "response_modalities": ["AUDIO"],
        "speech_config": {
            "voice_config": {
                "prebuilt_voice_config": {
                    "voice_name": "Leda"
                }
            },
            "language_code": "en-US"
        },
        "enable_affective_dialog": True
    }
)

Integration with Agent

Agent Configuration

When using Realtime with an Agent, STT and TTS services are not needed:
agent = Agent(
    edge=getstream.Edge(),
    agent_user=User(name="AI Assistant", id="agent"),
    instructions="You're a helpful voice assistant",
    llm=openai.Realtime(),  # Realtime handles audio
    # No STT, TTS, or VAD needed
    processors=[]
)

Automatic Mode Detection

The Agent automatically detects Realtime mode and adjusts behavior accordingly:
if agent.realtime_mode:
    # Realtime mode - direct audio processing
    pass
else:
    # Traditional mode - STT → LLM → TTS
    pass

Complete Realtime Example

import asyncio
from vision_agents.core.agents import Agent
from vision_agents.plugins import openai, getstream
from vision_agents.core.edge.types import User
from vision_agents.core.events import CallSessionParticipantJoinedEvent

async def main():
    # Create Realtime LLM
    llm = openai.Realtime(
        model="gpt-realtime",
        voice="marin"
    )
    
    # Create agent with Realtime LLM
    agent = Agent(
        edge=getstream.Edge(),
        agent_user=User(name="AI Assistant", id="agent"),
        instructions="You're a helpful voice assistant. Keep responses conversational and natural.",
        llm=llm,
        processors=[]
    )
    
    # Set up event handlers
    @agent.subscribe
    async def on_participant_joined(event: CallSessionParticipantJoinedEvent):
        await agent.simple_response(f"Hello {event.participant.user.name}!")
    
    # Create and join call
    await agent.create_user()
    call = agent.edge.client.video.call("default", "realtime-demo")
    agent.edge.open_demo(call)
    
    with await agent.join(call):
        await agent.finish()
    
    await agent.close()

if __name__ == "__main__":
    asyncio.run(main())
I