Skip to main content
Google’s Gemini provides native multimodal speech-to-speech over WebSocket with optional video. No separate STT/TTS services required.
Vision Agents requires a Stream account for real-time transport. Most providers offer free tiers to get started.
Gemini also provides a traditional LLM with built-in tools for search, code execution, and RAG.

Installation

uv add "vision-agents[gemini]"

Quick Start

from vision_agents.core import Agent, User
from vision_agents.plugins import gemini, getstream

agent = Agent(
    edge=getstream.Edge(),
    agent_user=User(name="Assistant", id="agent"),
    instructions="You are a helpful assistant.",
    llm=gemini.Realtime(fps=3),  # Video frames sent to model
)

Parameters

NameTypeDefaultDescription
modelstr"gemini-3-flash-preview"Gemini model
fpsint1Video frames per second
configLiveConnectConfigDictNoneOptional config dict to customize session behavior
api_keystrNoneAPI key (defaults to GOOGLE_API_KEY env var)

Voice Activity Detection

The Gemini Realtime plugin includes built-in voice activity detection (VAD) with defaults optimized for low-latency conversations. You can override these settings via the config parameter:
from google.genai.types import (
    AutomaticActivityDetectionDict,
    EndSensitivity,
    RealtimeInputConfigDict,
    StartSensitivity,
    TurnCoverage,
)

llm = gemini.Realtime(
    config={
        "realtime_input_config": RealtimeInputConfigDict(
            turn_coverage=TurnCoverage.TURN_INCLUDES_ONLY_ACTIVITY,
            automatic_activity_detection=AutomaticActivityDetectionDict(
                start_of_speech_sensitivity=StartSensitivity.START_SENSITIVITY_HIGH,
                end_of_speech_sensitivity=EndSensitivity.END_SENSITIVITY_HIGH,
                silence_duration_ms=500,
                prefix_padding_ms=50,
            ),
        ),
    },
)
NameTypeDefaultDescription
start_of_speech_sensitivityStartSensitivitySTART_SENSITIVITY_HIGHHow quickly the model detects the start of speech
end_of_speech_sensitivityEndSensitivityEND_SENSITIVITY_HIGHHow quickly the model detects the end of speech
silence_duration_msint500Milliseconds of silence before the model considers a turn end
prefix_padding_msint50Milliseconds of audio to include before detected speech start
Higher sensitivity values make the model react faster to speech starts and stops, which reduces latency but may increase false positives in noisy environments.

VLM (Vision Language Model)

Use Gemini 3 vision models for multimodal interactions with video frames. The VLM buffers video frames, converts them to JPEG, and sends them alongside text prompts.
from vision_agents.core import Agent, User
from vision_agents.plugins import gemini, getstream, deepgram, elevenlabs

agent = Agent(
    edge=getstream.Edge(),
    agent_user=User(name="Vision Agent", id="vision-agent"),
    instructions="Describe what you see in one sentence.",
    llm=gemini.VLM(model="gemini-3-flash-preview"),
    stt=deepgram.STT(),
    tts=elevenlabs.TTS(),
)
NameTypeDefaultDescription
modelstr"gemini-3-flash-preview"Gemini vision model
fpsint1Video frames per second to capture
frame_buffer_secondsint10Seconds of video to buffer for model input
thinking_levelThinkingLevelNoneThinking level for enhanced reasoning
media_resolutionMediaResolutionNoneResolution for multimodal processing
api_keystrNoneAPI key (defaults to GOOGLE_API_KEY env var)

Next Steps

Gemini LLM

LLM with built-in tools and RAG

Build a Video Agent

Add video processing