Vision Agents requires a Stream account
for real-time transport. Most providers offer free tiers to get started.
Installation
Quick Start
Parameters
| Name | Type | Default | Description |
|---|---|---|---|
model | str | "gemini-3-flash-preview" | Gemini model |
fps | int | 1 | Video frames per second |
config | LiveConnectConfigDict | None | Optional config dict to customize session behavior |
api_key | str | None | API key (defaults to GOOGLE_API_KEY env var) |
Voice Activity Detection
The Gemini Realtime plugin includes built-in voice activity detection (VAD) with defaults optimized for low-latency conversations. You can override these settings via theconfig parameter:
| Name | Type | Default | Description |
|---|---|---|---|
start_of_speech_sensitivity | StartSensitivity | START_SENSITIVITY_HIGH | How quickly the model detects the start of speech |
end_of_speech_sensitivity | EndSensitivity | END_SENSITIVITY_HIGH | How quickly the model detects the end of speech |
silence_duration_ms | int | 500 | Milliseconds of silence before the model considers a turn end |
prefix_padding_ms | int | 50 | Milliseconds of audio to include before detected speech start |
VLM (Vision Language Model)
Use Gemini 3 vision models for multimodal interactions with video frames. The VLM buffers video frames, converts them to JPEG, and sends them alongside text prompts.| Name | Type | Default | Description |
|---|---|---|---|
model | str | "gemini-3-flash-preview" | Gemini vision model |
fps | int | 1 | Video frames per second to capture |
frame_buffer_seconds | int | 10 | Seconds of video to buffer for model input |
thinking_level | ThinkingLevel | None | Thinking level for enhanced reasoning |
media_resolution | MediaResolution | None | Resolution for multimodal processing |
api_key | str | None | API key (defaults to GOOGLE_API_KEY env var) |
Next Steps
Gemini LLM
LLM with built-in tools and RAG
Build a Video Agent
Add video processing

