Moondream

Moondream is a powerful vision AI model that provides real-time zero-shot object detection on video streams. The Moondream 3 model enables you to detect any object by simply describing it in natural language, without requiring training or fine-tuning. The Moondream plugin in the Vision Agents SDK provides two processors for object detection: a cloud-hosted API version and a local on-device version, giving you flexibility based on your deployment needs.

Installation

Install the Moondream plugin with

uv add vision-agents-plugins-moondream

Choosing the Right Processor

CloudDetectionProcessor (Recommended for Most Users)

Use when: You want a simple setup with no infrastructure management
Pros: No model download, no GPU required, automatic updates
Cons: Requires API key, 2 RPS rate limit by default (can be increased)
Best for: Development, testing, low-to-medium volume applications

LocalDetectionProcessor (For Advanced Users)

Use when: You need higher throughput, have your own GPU infrastructure, or want to avoid rate limits
Pros: No rate limits, no API costs, full control over hardware
Cons: Requires GPU for best performance, model download on first use, infrastructure management
Best for: Production deployments, high-volume applications, custom infrastructure

Quick Start

Using CloudDetectionProcessor (Hosted)

The CloudDetectionProcessor uses Moondream’s hosted API. By default it has a 2 RPS (requests per second) rate limit and requires an API key. The rate limit can be adjusted by contacting the Moondream team.

from vision_agents.plugins import moondream
from vision_agents.core import Agent, User
from vision_agents.plugins import gemini, getstream

# Create a cloud processor with detection
processor = moondream.CloudDetectionProcessor(
    api_key="your-api-key",  # or set MOONDREAM_API_KEY env var
    detect_objects="person",  # or ["person", "car", "dog"] for multiple
    fps=30
)

# Use in an agent
agent = Agent(
    edge=getstream.Edge(),
    agent_user=User(name="Vision Assistant"),
    instructions="You are a helpful vision assistant.",
    llm=gemini.Realtime(fps=10),
    processors=[processor],
)

To initialize without passing in the API key, make sure the MOONDREAM_API_KEY is available as an environment variable. You can do this either by defining it in a .env file or exporting it directly in your terminal.

Using LocalDetectionProcessor (On-Device)

If you are running on your own infrastructure or using a service like Digital Ocean’s Gradient AI GPUs, you can use the LocalDetectionProcessor which downloads the model from HuggingFace and runs on device.

The moondream3-preview model is gated and requires HuggingFace authentication:

Request access at https://huggingface.co/moondream/moondream3-preview
Set HF_TOKEN environment variable: export HF_TOKEN=your_token_here
Or run: huggingface-cli login

from vision_agents.plugins import moondream
from vision_agents.core import Agent, User
from vision_agents.plugins import gemini, getstream

# Create a local processor (no API key needed)
processor = moondream.LocalDetectionProcessor(
    detect_objects=["person", "car", "dog"],
    conf_threshold=0.3,
    device="cuda",  # Auto-detects CUDA, MPS, or CPU
    fps=30
)

# Use in an agent
agent = Agent(
    edge=getstream.Edge(),
    agent_user=User(name="Vision Assistant"),
    instructions="You are a helpful vision assistant.",
    llm=gemini.Realtime(fps=10),
    processors=[processor],
)

Detect Multiple Objects

Both processors support zero-shot detection of multiple object types simultaneously:

# Detect multiple object types with zero-shot detection
processor = moondream.CloudDetectionProcessor(
    api_key="your-api-key",
    detect_objects=["person", "car", "dog", "basketball", "laptop"],
    conf_threshold=0.3
)

Configuration

CloudDetectionProcessor Parameters

Name	Type	Default	Description
`api_key`	`str` or `None`	`None`	API key for Moondream Cloud API. If not provided, will attempt to read from `MOONDREAM_API_KEY` environment variable.
`detect_objects`	`str` or `List[str]`	`"person"`	Object(s) to detect using zero-shot detection. Can be any object name like “person”, “car”, “basketball”.
`conf_threshold`	`float`	`0.3`	Confidence threshold for detections.
`fps`	`int`	`30`	Frame processing rate.
`interval`	`int`	`0`	Processing interval in seconds.
`max_workers`	`int`	`10`	Thread pool size for CPU-intensive operations.

By default, the Moondream Cloud API has a 2 RPS (requests per second) rate limit. Contact the Moondream team to request a higher limit.

LocalDetectionProcessor Parameters

Name	Type	Default	Description
`detect_objects`	`str` or `List[str]`	`"person"`	Object(s) to detect using zero-shot detection. Can be any object name like “person”, “car”, “basketball”.
`conf_threshold`	`float`	`0.3`	Confidence threshold for detections.
`fps`	`int`	`30`	Frame processing rate.
`interval`	`int`	`0`	Processing interval in seconds.
`max_workers`	`int`	`10`	Thread pool size for CPU-intensive operations.
`device`	`str` or `None`	`None`	Device to run inference on (‘cuda’, ‘mps’, or ‘cpu’). Auto-detects CUDA, then MPS (Apple Silicon), then defaults to CPU.
`model_name`	`str`	`"moondream/moondream3-preview"`	Hugging Face model identifier.
`options`	`AgentOptions` or `None`	`None`	Model directory configuration. If not provided, uses default which defaults to tempfile.gettempdir().

Performance will vary depending on your hardware configuration. CUDA is recommended for best performance on NVIDIA GPUs. The model will be downloaded from HuggingFace on first use.

Video Publishing

Both processors publish annotated video frames with bounding boxes drawn on detected objects:

processor = moondream.CloudDetectionProcessor(
    api_key="your-api-key",
    detect_objects=["person", "car"]
)

# The track will show:
# - Green bounding boxes around detected objects
# - Labels with confidence scores
# - Real-time annotation overlay

The annotated video is automatically sent to your realtime LLM, enabling it to understand what objects are present in the scene and their locations.

Use Cases

The Moondream plugin enables a wide range of computer vision applications:

Retail Analytics: Track customer movement and product interactions
Security & Surveillance: Detect specific objects or people in real-time
Sports Analysis: Track players, balls, and equipment
Warehouse Management: Monitor inventory and equipment
Accessibility: Describe surroundings for visually impaired users
Smart Home: Detect pets, packages, or specific objects

Example: Multi-Object Detection

from vision_agents.plugins import moondream, gemini, getstream
from vision_agents.core import Agent, User

# Create processor that detects multiple objects
processor = moondream.CloudDetectionProcessor(
    api_key="your-api-key",
    detect_objects=["person", "dog", "cat", "car", "bicycle"],
    conf_threshold=0.4,
    fps=15
)

# Create agent with vision capabilities
agent = Agent(
    edge=getstream.Edge(),
    agent_user=User(name="Vision Assistant"),
    instructions="""You are a helpful vision assistant. 
    Describe what you see in the video, including the objects detected 
    and their approximate locations.""",
    llm=gemini.Realtime(fps=10),
    processors=[processor],
)

# Start the agent
await agent.start()

Overview

AI Providers

Custom Integrations

Installation

Choosing the Right Processor

CloudDetectionProcessor (Recommended for Most Users)

LocalDetectionProcessor (For Advanced Users)

Quick Start

Using CloudDetectionProcessor (Hosted)

Using LocalDetectionProcessor (On-Device)

Detect Multiple Objects

Configuration

CloudDetectionProcessor Parameters

LocalDetectionProcessor Parameters

Video Publishing

Use Cases

Example: Multi-Object Detection

Links

Overview

AI Providers

Custom Integrations

​Installation

​Choosing the Right Processor

​CloudDetectionProcessor (Recommended for Most Users)

​LocalDetectionProcessor (For Advanced Users)

​Quick Start

​Using CloudDetectionProcessor (Hosted)

​Using LocalDetectionProcessor (On-Device)

​Detect Multiple Objects

​Configuration

​CloudDetectionProcessor Parameters

​LocalDetectionProcessor Parameters

​Video Publishing

​Use Cases

​Example: Multi-Object Detection

​Links

Installation

Choosing the Right Processor

CloudDetectionProcessor (Recommended for Most Users)

LocalDetectionProcessor (For Advanced Users)

Quick Start

Using CloudDetectionProcessor (Hosted)

Using LocalDetectionProcessor (On-Device)

Detect Multiple Objects

Configuration

CloudDetectionProcessor Parameters

LocalDetectionProcessor Parameters

Video Publishing

Use Cases

Example: Multi-Object Detection

Links