Vision Agents requires a Stream account
for real-time transport. Most providers offer free tiers to get started.
Installation
Detection (Cloud)
| Name | Type | Default | Description |
|---|---|---|---|
detect_objects | str or List[str] | "person" | Objects to detect (zero-shot) |
conf_threshold | float | 0.3 | Confidence threshold |
fps | int | 30 | Frame processing rate |
Detection (Local)
Runs on-device without API calls. RequiresHF_TOKEN for model access.
VLM (Cloud)
Visual question answering or automatic captioning.| Name | Type | Default | Description |
|---|---|---|---|
mode | str | "vqa" | Mode ("vqa" or "caption") |
VLM (Local)
Cloud vs Local
| Cloud | Local | |
|---|---|---|
| Use when | Simple setup, no infrastructure management | Higher throughput, own GPU infrastructure |
| Pros | No model download, no GPU required, automatic updates | No rate limits, no API costs, full control |
| Cons | Requires API key, 2 RPS rate limit (can be increased) | Requires GPU for best performance |
Local models require
HF_TOKEN for HuggingFace authentication. CUDA
recommended for best performance.Next Steps
Build a Voice Agent
Get started with voice
Build a Video Agent
Add video processing

