Skip to main content

Documentation Index

Fetch the complete documentation index at: https://visionagents.ai/llms.txt

Use this file to discover all available pages before exploring further.

The local plugin replaces the cloud edge with your machine’s microphone, speakers, and camera. Useful for local development, desktop apps, and demos where you don’t want to round-trip through a real-time transport.
No Stream account is required for the local edge — but you’ll still need API keys for whichever LLM / STT / TTS plugins you use.

Installation

uv add "vision-agents[local]"
The plugin uses sounddevice for audio I/O and PyAV for video. On some Linux systems you may need to install portaudio separately.

Quick Start

from vision_agents.core import Agent, User
from vision_agents.plugins import deepgram, gemini
from vision_agents.plugins.local import LocalEdge
from vision_agents.plugins.local.devices import (
    select_audio_input_device,
    select_audio_output_device,
    select_video_device,
)

input_device = select_audio_input_device()
output_device = select_audio_output_device()
video_device = select_video_device()

agent = Agent(
    edge=LocalEdge(
        audio_input=input_device,
        audio_output=output_device,
        video_input=video_device,
    ),
    agent_user=User(name="Local AI", id="local-agent"),
    instructions="Keep responses short and conversational.",
    llm=gemini.LLM("gemini-3-flash-preview"),
    tts=deepgram.TTS(),
    stt=deepgram.STT(),
)
The select_* helpers prompt interactively in the terminal. For headless use, instantiate AudioInputDevice, AudioOutputDevice, and CameraDevice directly with a known device index.

Parameters

NameTypeDefaultDescription
audio_inputAudioInputDeviceMicrophone for capturing user audio.
audio_outputAudioOutputDeviceSpeaker for playing agent audio.
video_inputCameraDeviceNoneCamera for capturing user video. None disables video.
video_widthint640Output video width in pixels.
video_heightint480Output video height in pixels.
video_fpsint30Output video frame rate.
When video_input is set, agent video is rendered locally in a tkinter window. Subclass the device classes (AudioInputDevice, AudioOutputDevice, CameraDevice) to swap in alternative backends (e.g. GStreamer).

Next Steps

Build a Voice Agent

Get started with voice

Build a Video Agent

Add video processing