Skip to main content
Google’s Gemini Live plugin is a low-latency API that combines video analysis, transcription, text-to-speech synthesis, function calling and more into a single streamlined pipeline. The Gemini Live plugin in the Vision Agents SDK is a native integration for realtime video and audio with out-of-the-box support for Google’s Gemini Live models. With it, you can natively stream both audio and video frames to Gemini over websockets and receive responses in real-time. It also supports MCP and function calling, so agents are empowered to take actions for you. This is ideal for building conversational agents, AI avatars, fitness coaches, visual accessibility assistants, remote support tools with visual guidance, interactive tutors, and much more!

Installation

Install the Gemini Live plugin with
uv add vision-agents[gemini]

Tutorials

The Voice AI quickstart and Video AI quickstart pages have examples to get you up and running.

Example

Check out our Gemini Live example to see a practical implementation of the plugin and get inspiration for your own projects, or read on for some key details.

Initialization

The Gemini plugin for Stream exists in the form of the Realtime class:
from vision_agents.plugins import gemini

realtime = gemini.Realtime()

Parameters

These are the parameters available in the gemini.Realtime plugin:
NameTypeDefaultDescription
modelstr"gemini-2.5-flash-native-audio-preview-09-2025"The Gemini model to use. Supports Live-enabled models only.
configLiveConnectConfigDict or NoneNoneConfiguration for the Gemini Live connection. If None, uses sensible defaults.
api_keystr or NoneNoneYour Gemini API key. If not provided, the SDK will look for it in env vars.
fpsint1Number of video frames per second to send to Gemini.
clientgenai.Client or NoneNoneOptional pre-configured Gemini client. If provided, uses this instead of creating one.
http_optionsHttpOptions or NoneNoneHTTP options for the Gemini client connection.

Functionality

Connect

The connect() method establishes a websocket connection to Gemini Live:
await realtime.connect()

Send Text Message

The simple_response() method allows you to send a text instruction to Gemini:
await realtime.simple_response("Describe what you see and say hi")

Send Audio

The simple_audio_response() method allows you to send audio data to Gemini:
await realtime.simple_audio_response(pcm_data)

Advanced: Send Realtime Input

For more control, you can use the native send_realtime_input() method which wraps Gemini’s API:
await realtime.send_realtime_input(text="Hello", media=blob)

Function Calling and MCP

Gemini Live supports function calling and MCP (Model Context Protocol) tools. When using the Realtime plugin via the main Agent class, you can register tools that Gemini can call. Follow the instructions in the MCP tool calling guide, using the Gemini Realtime class as your LLM. The plugin automatically handles:
  • Converting your tool definitions to Gemini’s format
  • Executing function calls when Gemini requests them
  • Sending function results back to Gemini

Configuration

The Gemini Live API uses LiveConnectConfigDict for configuration. You can customize various aspects of the connection:
from google.genai.types import LiveConnectConfigDict, Modality, SpeechConfigDict

config = LiveConnectConfigDict(
    response_modalities=[Modality.AUDIO],
    speech_config=SpeechConfigDict(
        language_code="en-US",
    ),
)

realtime = gemini.Realtime(config=config)

Events

The Gemini plugin emits standard Vision Agents events that you can listen to:
  • RealtimeConnectedEvent: Fired when connection is established
  • RealtimeDisconnectedEvent: Fired when connection is closed
  • RealtimeAudioOutputEvent: Fired when Gemini generates audio
  • LLMResponseChunkEvent: Fired when Gemini generates text
  • RealtimeTranscriptEvent: Fired for transcriptions
Access these events through the Agent’s event system. See the Event System guide for more details.
I