Skip to main content
Google’s Gemini provides realtime multimodal capabilities over WebSocket. Using Vision Agents with Gemini allows developers to quickly build audio and video directly to into their apps and receive responses in real-time. The plugin includes built-in tools for search, code execution, RAG, as well as support for using both LLM and Realtime models.
Vision Agents requires a Stream account for real-time transport. Most providers offer free tiers to get started.

Installation

uv add "vision-agents[gemini]"

Realtime

Native speech-to-speech with optional video over WebSocket.
from vision_agents.core import Agent, User
from vision_agents.plugins import gemini, getstream

agent = Agent(
    edge=getstream.Edge(),
    agent_user=User(name="Assistant", id="agent"),
    instructions="You are a helpful assistant.",
    llm=gemini.Realtime(fps=3),  # Video frames sent to model
)
NameTypeDefaultDescription
modelstr"gemini-2.5-flash"Gemini model
fpsint1Video frames per second
api_keystrNoneAPI key (defaults to GOOGLE_API_KEY env var)

VLM (Vision Language Model)

Use Gemini 3 vision models for multimodal interactions with video frames. The VLM buffers video frames, converts them to JPEG, and sends them alongside text prompts.
from vision_agents.core import Agent, User
from vision_agents.plugins import gemini, getstream, deepgram, elevenlabs

agent = Agent(
    edge=getstream.Edge(),
    agent_user=User(name="Vision Agent", id="vision-agent"),
    instructions="Describe what you see in one sentence.",
    llm=gemini.VLM(model="gemini-3-flash-preview"),
    stt=deepgram.STT(),
    tts=elevenlabs.TTS(),
)
NameTypeDefaultDescription
modelstr"gemini-3-flash-preview"Gemini vision model
fpsint1Video frames per second to capture
frame_buffer_secondsint10Seconds of video to buffer for model input
thinking_levelThinkingLevelNoneThinking level for enhanced reasoning
media_resolutionMediaResolutionNoneResolution for multimodal processing
api_keystrNoneAPI key (defaults to GOOGLE_API_KEY env var)

LLM

Standard chat completions. Requires separate STT/TTS.
from vision_agents.core import Agent, User
from vision_agents.plugins import gemini, getstream, deepgram, elevenlabs

agent = Agent(
    edge=getstream.Edge(),
    agent_user=User(name="Assistant", id="agent"),
    instructions="You are a helpful assistant.",
    llm=gemini.LLM("gemini-2.5-flash"),
    stt=deepgram.STT(),
    tts=elevenlabs.TTS(),
)

Built-in Tools

Gemini provides built-in tools you can enable:
llm = gemini.LLM(
    model="gemini-2.5-flash",
    tools=[
        gemini.tools.GoogleSearch(),
        gemini.tools.CodeExecution(),
        gemini.tools.FileSearch(store),  # RAG
        gemini.tools.URLContext(),
    ]
)
ToolDescription
GoogleSearchGround responses with web data
CodeExecutionRun Python code
FileSearchRAG over your documents
URLContextRead specific web pages

File Search (RAG)

Managed RAG with automatic chunking and retrieval:
from vision_agents.plugins import gemini

store = gemini.GeminiFilesearchRAG(name="my-knowledge-base")
await store.create()
await store.add_directory("./knowledge")

llm = gemini.LLM(
    model="gemini-2.5-flash",
    tools=[gemini.tools.FileSearch(store)]
)
See the RAG guide for more details.

Function Calling

@agent.llm.register_function(description="Get weather for a location")
async def get_weather(location: str) -> dict:
    return {"temperature": "22°C", "condition": "Sunny"}
See the Function Calling guide for details.

Events

The Gemini plugin emits events for connection state and responses. Most developers should use the core events (LLMResponseCompletedEvent, etc.) for provider-agnostic code.
from vision_agents.plugins.gemini.events import (
    GeminiConnectedEvent,
    GeminiErrorEvent,
)

@agent.events.subscribe
async def on_gemini_connected(event: GeminiConnectedEvent):
    print(f"Connected to Gemini model: {event.model}")

@agent.events.subscribe
async def on_gemini_error(event: GeminiErrorEvent):
    print(f"Gemini error: {event.error}")
EventDescription
GeminiConnectedEventRealtime connection established
GeminiErrorEventError occurred
GeminiAudioEventAudio output received
GeminiTextEventText output received
GeminiResponseEventResponse chunk received

Next Steps