Skip to main content
Moondream is a powerful vision AI model that provides real-time zero-shot object detection on video streams. The Moondream 3 model enables you to detect any object by simply describing it in natural language, without requiring training or fine-tuning. The Moondream plugin in the Vision Agents SDK provides two processors for object detection: a cloud-hosted API version and a local on-device version, giving you flexibility based on your deployment needs.

Installation

Install the Moondream plugin with
uv add vision-agents-plugins-moondream

Choosing the Right Processor

  • Use when: You want a simple setup with no infrastructure management
  • Pros: No model download, no GPU required, automatic updates
  • Cons: Requires API key, 2 RPS rate limit by default (can be increased)
  • Best for: Development, testing, low-to-medium volume applications

LocalDetectionProcessor (For Advanced Users)

  • Use when: You need higher throughput, have your own GPU infrastructure, or want to avoid rate limits
  • Pros: No rate limits, no API costs, full control over hardware
  • Cons: Requires GPU for best performance, model download on first use, infrastructure management
  • Best for: Production deployments, high-volume applications, custom infrastructure

Quick Start

Using CloudDetectionProcessor (Hosted)

The CloudDetectionProcessor uses Moondream’s hosted API. By default it has a 2 RPS (requests per second) rate limit and requires an API key. The rate limit can be adjusted by contacting the Moondream team.
from vision_agents.plugins import moondream
from vision_agents.core import Agent, User
from vision_agents.plugins import gemini, getstream

# Create a cloud processor with detection
processor = moondream.CloudDetectionProcessor(
    api_key="your-api-key",  # or set MOONDREAM_API_KEY env var
    detect_objects="person",  # or ["person", "car", "dog"] for multiple
    fps=30
)

# Use in an agent
agent = Agent(
    edge=getstream.Edge(),
    agent_user=User(name="Vision Assistant"),
    instructions="You are a helpful vision assistant.",
    llm=gemini.Realtime(fps=10),
    processors=[processor],
)
To initialize without passing in the API key, make sure the MOONDREAM_API_KEY is available as an environment variable. You can do this either by defining it in a .env file or exporting it directly in your terminal.

Using LocalDetectionProcessor (On-Device)

If you are running on your own infrastructure or using a service like Digital Ocean’s Gradient AI GPUs, you can use the LocalDetectionProcessor which downloads the model from HuggingFace and runs on device.
The moondream3-preview model is gated and requires HuggingFace authentication:
from vision_agents.plugins import moondream
from vision_agents.core import Agent, User
from vision_agents.plugins import gemini, getstream

# Create a local processor (no API key needed)
processor = moondream.LocalDetectionProcessor(
    detect_objects=["person", "car", "dog"],
    conf_threshold=0.3,
    device="cuda",  # Auto-detects CUDA, MPS, or CPU
    fps=30
)

# Use in an agent
agent = Agent(
    edge=getstream.Edge(),
    agent_user=User(name="Vision Assistant"),
    instructions="You are a helpful vision assistant.",
    llm=gemini.Realtime(fps=10),
    processors=[processor],
)

Detect Multiple Objects

Both processors support zero-shot detection of multiple object types simultaneously:
# Detect multiple object types with zero-shot detection
processor = moondream.CloudDetectionProcessor(
    api_key="your-api-key",
    detect_objects=["person", "car", "dog", "basketball", "laptop"],
    conf_threshold=0.3
)

Configuration

CloudDetectionProcessor Parameters

NameTypeDefaultDescription
api_keystr or NoneNoneAPI key for Moondream Cloud API. If not provided, will attempt to read from MOONDREAM_API_KEY environment variable.
detect_objectsstr or List[str]"person"Object(s) to detect using zero-shot detection. Can be any object name like “person”, “car”, “basketball”.
conf_thresholdfloat0.3Confidence threshold for detections.
fpsint30Frame processing rate.
intervalint0Processing interval in seconds.
max_workersint10Thread pool size for CPU-intensive operations.
By default, the Moondream Cloud API has a 2 RPS (requests per second) rate limit. Contact the Moondream team to request a higher limit.

LocalDetectionProcessor Parameters

NameTypeDefaultDescription
detect_objectsstr or List[str]"person"Object(s) to detect using zero-shot detection. Can be any object name like “person”, “car”, “basketball”.
conf_thresholdfloat0.3Confidence threshold for detections.
fpsint30Frame processing rate.
intervalint0Processing interval in seconds.
max_workersint10Thread pool size for CPU-intensive operations.
devicestr or NoneNoneDevice to run inference on (‘cuda’, ‘mps’, or ‘cpu’). Auto-detects CUDA, then MPS (Apple Silicon), then defaults to CPU.
model_namestr"moondream/moondream3-preview"Hugging Face model identifier.
optionsAgentOptions or NoneNoneModel directory configuration. If not provided, uses default which defaults to tempfile.gettempdir().
Performance will vary depending on your hardware configuration. CUDA is recommended for best performance on NVIDIA GPUs. The model will be downloaded from HuggingFace on first use.

Video Publishing

Both processors publish annotated video frames with bounding boxes drawn on detected objects:
processor = moondream.CloudDetectionProcessor(
    api_key="your-api-key",
    detect_objects=["person", "car"]
)

# The track will show:
# - Green bounding boxes around detected objects
# - Labels with confidence scores
# - Real-time annotation overlay
The annotated video is automatically sent to your realtime LLM, enabling it to understand what objects are present in the scene and their locations.

Use Cases

The Moondream plugin enables a wide range of computer vision applications:
  • Retail Analytics: Track customer movement and product interactions
  • Security & Surveillance: Detect specific objects or people in real-time
  • Sports Analysis: Track players, balls, and equipment
  • Warehouse Management: Monitor inventory and equipment
  • Accessibility: Describe surroundings for visually impaired users
  • Smart Home: Detect pets, packages, or specific objects

Example: Multi-Object Detection

from vision_agents.plugins import moondream, gemini, getstream
from vision_agents.core import Agent, User

# Create processor that detects multiple objects
processor = moondream.CloudDetectionProcessor(
    api_key="your-api-key",
    detect_objects=["person", "dog", "cat", "car", "bicycle"],
    conf_threshold=0.4,
    fps=15
)

# Create agent with vision capabilities
agent = Agent(
    edge=getstream.Edge(),
    agent_user=User(name="Vision Assistant"),
    instructions="""You are a helpful vision assistant. 
    Describe what you see in the video, including the objects detected 
    and their approximate locations.""",
    llm=gemini.Realtime(fps=10),
    processors=[processor],
)

# Start the agent
await agent.start()