Skip to main content
The Realtime component provides end-to-end speech-to-speech communication, combining STT, LLM, and TTS functionality in a single, optimized interface. It delivers ultra-low latency speech processing, direct audio streaming without intermediate text conversion, and support for multiple modalities (audio, video, text).

When to Use Realtime

Use a Realtime LLM when you want the lowest latency voice interactions. The model handles speech recognition, response generation, and speech synthesis natively—no separate STT or TTS services required. Use the traditional STT → LLM → TTS pipeline when you need custom voices (e.g., Cartesia, ElevenLabs), specific transcription providers, or models that don’t support realtime audio.

Supported Providers

Basic Usage

from vision_agents.plugins import openai, getstream
from vision_agents.core.agents import Agent
from vision_agents.core.edge.types import User

agent = Agent(
    edge=getstream.Edge(),
    agent_user=User(name="AI Assistant", id="agent"),
    instructions="You're a helpful voice assistant",
    llm=openai.Realtime(model="gpt-realtime", voice="marin"),
    processors=[]
)

Methods

simple_response(text, processors=None, participant=None)

Sends a text prompt to the realtime model. The model responds with audio.
await agent.llm.simple_response("What do you see in the video?")

simple_audio_response(pcm, participant=None)

Sends raw PCM audio data directly to the model for processing.
await agent.llm.simple_audio_response(audio_pcm_data)

Properties

PropertyTypeDescription
connectedboolTrue if the realtime session is active
fpsintVideo frames per second sent to the model (default: 1)
session_idstrUUID identifying the current session

Events

The Realtime class emits events for monitoring conversations:
EventDescription
RealtimeConnectedEventConnection established
RealtimeDisconnectedEventConnection closed
RealtimeUserSpeechTranscriptionEventTranscript of user speech
RealtimeAgentSpeechTranscriptionEventTranscript of agent speech
RealtimeResponseEventAI response text
RealtimeAudioInputEventAudio received from user
RealtimeAudioOutputEventAudio sent to user
RealtimeErrorEventError during processing
from vision_agents.core.llm.events import RealtimeUserSpeechTranscriptionEvent

@agent.llm.events.on(RealtimeUserSpeechTranscriptionEvent)
async def on_user_speech(event):
    print(f"User said: {event.text}")
For provider-specific parameters and configuration, see the integration docs for OpenAI, Gemini, AWS Bedrock, or Qwen.