> ## Documentation Index
> Fetch the complete documentation index at: https://visionagents.ai/llms.txt
> Use this file to discover all available pages before exploring further.

# Gemini Realtime

<iframe className="w-full aspect-video rounded-xl" src="https://www.youtube.com/embed/8lA6bF2EnvA" title="Gemini Live integration" frameBorder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowFullScreen />

[Google's Gemini](https://ai.google.dev/gemini-api/docs/live) provides native multimodal speech-to-speech over WebSocket with optional video. No separate STT/TTS services required.

<Info>
  Vision Agents requires a [Stream](https://getstream.io/try-for-free/) account
  for real-time transport. Most providers offer free tiers to get started.
</Info>

<Tip>
  Gemini also provides a traditional [LLM](/integrations/llm/gemini) with built-in tools for search, code execution, and RAG.
</Tip>

## Installation

```sh theme={null}
uv add "vision-agents[gemini]"
```

## Quick Start

```python theme={null}
from vision_agents.core import Agent, User
from vision_agents.plugins import gemini, getstream

agent = Agent(
    edge=getstream.Edge(),
    agent_user=User(name="Assistant", id="agent"),
    instructions="You are a helpful assistant.",
    llm=gemini.Realtime(fps=3),  # Video frames sent to model
)
```

## Parameters

| Name      | Type                    | Default                    | Description                                        |
| --------- | ----------------------- | -------------------------- | -------------------------------------------------- |
| `model`   | `str`                   | `"gemini-3-flash-preview"` | Gemini model                                       |
| `fps`     | `int`                   | `1`                        | Video frames per second                            |
| `config`  | `LiveConnectConfigDict` | `None`                     | Optional config dict to customize session behavior |
| `api_key` | `str`                   | `None`                     | API key (defaults to `GOOGLE_API_KEY` env var)     |

## Voice Activity Detection

The Gemini Realtime plugin includes built-in voice activity detection (VAD) with defaults optimized for low-latency conversations. You can override these settings via the `config` parameter:

```python theme={null}
from google.genai.types import (
    AutomaticActivityDetectionDict,
    EndSensitivity,
    RealtimeInputConfigDict,
    StartSensitivity,
    TurnCoverage,
)

llm = gemini.Realtime(
    config={
        "realtime_input_config": RealtimeInputConfigDict(
            turn_coverage=TurnCoverage.TURN_INCLUDES_ONLY_ACTIVITY,
            automatic_activity_detection=AutomaticActivityDetectionDict(
                start_of_speech_sensitivity=StartSensitivity.START_SENSITIVITY_HIGH,
                end_of_speech_sensitivity=EndSensitivity.END_SENSITIVITY_HIGH,
                silence_duration_ms=250,
                prefix_padding_ms=50,
            ),
        ),
    },
)
```

| Name                          | Type               | Default                  | Description                                                   |
| ----------------------------- | ------------------ | ------------------------ | ------------------------------------------------------------- |
| `start_of_speech_sensitivity` | `StartSensitivity` | `START_SENSITIVITY_HIGH` | How quickly the model detects the start of speech             |
| `end_of_speech_sensitivity`   | `EndSensitivity`   | `END_SENSITIVITY_HIGH`   | How quickly the model detects the end of speech               |
| `silence_duration_ms`         | `int`              | `250`                    | Milliseconds of silence before the model considers a turn end |
| `prefix_padding_ms`           | `int`              | `50`                     | Milliseconds of audio to include before detected speech start |

<Tip>
  Higher sensitivity values make the model react faster to speech starts and stops, which reduces latency but may increase false positives in noisy environments.
</Tip>

## VLM (Vision Language Model)

Use Gemini 3 vision models for multimodal interactions with video frames. The VLM buffers video frames, converts them to JPEG, and sends them alongside text prompts.

```python theme={null}
from vision_agents.core import Agent, User
from vision_agents.plugins import gemini, getstream, deepgram, elevenlabs

agent = Agent(
    edge=getstream.Edge(),
    agent_user=User(name="Vision Agent", id="vision-agent"),
    instructions="Describe what you see in one sentence.",
    llm=gemini.VLM(model="gemini-3-flash-preview"),
    stt=deepgram.STT(),
    tts=elevenlabs.TTS(),
)
```

| Name                   | Type              | Default                    | Description                                    |
| ---------------------- | ----------------- | -------------------------- | ---------------------------------------------- |
| `model`                | `str`             | `"gemini-3-flash-preview"` | Gemini vision model                            |
| `fps`                  | `int`             | `1`                        | Video frames per second to capture             |
| `frame_buffer_seconds` | `int`             | `10`                       | Seconds of video to buffer for model input     |
| `thinking_level`       | `ThinkingLevel`   | `None`                     | Thinking level for enhanced reasoning          |
| `media_resolution`     | `MediaResolution` | `None`                     | Resolution for multimodal processing           |
| `api_key`              | `str`             | `None`                     | API key (defaults to `GOOGLE_API_KEY` env var) |

## Next Steps

<CardGroup cols={2}>
  <Card title="Gemini LLM" icon="brain" href="/integrations/llm/gemini">
    LLM with built-in tools and RAG
  </Card>

  <Card title="Build a Video Agent" icon="video" href="/introduction/video-agents">
    Add video processing
  </Card>
</CardGroup>
