Skip to main content
NVIDIA provides powerful vision language models through their Chat Completions API with NVCF (NVIDIA Cloud Functions) asset management. The NVIDIA plugin in the Vision Agents SDK enables real-time video understanding using models like Cosmos Reason2. The NVIDIA plugin provides:
  • Video understanding: Automatically buffers and forwards video frames to NVIDIA VLM models
  • Streaming responses: Real-time text responses with chunk events
  • Asset management: Automatic upload and cleanup of frame assets via NVCF

Installation

Install the NVIDIA plugin with:
uv add vision-agents[nvidia]

Configuration

Set your NVIDIA API key:
export NVIDIA_API_KEY=your_nvidia_api_key

Usage

from vision_agents.plugins import nvidia, getstream, deepgram, elevenlabs
from vision_agents.core import Agent, User

llm = nvidia.VLM(
    model="nvidia/cosmos-reason2-8b",
    fps=1, 
    frame_buffer_seconds=10,
)

agent = Agent(
    edge=getstream.Edge(),
    agent_user=User(name="NVIDIA Video Assistant", id="agent"),
    instructions="You're a helpful video AI assistant. Analyze the video frames and respond to user questions about what you see.",
    llm=llm,
    tts=elevenlabs.TTS(),
    stt=deepgram.STT(),
)
The VLM automatically buffers video frames and includes them when responding to user questions via STT transcripts.

Parameters

NameTypeDefaultDescription
modelstr"nvidia/cosmos-reason2-8b"NVIDIA model ID to use.
api_keyOptional[str]NoneNVIDIA API token. If not provided, reads from NVIDIA_API_KEY environment variable.
fpsint1Number of video frames per second to buffer.
frame_buffer_secondsint10Number of seconds of video to buffer for the model’s input.
frame_widthint800Width of video frames to send.
frame_heightint600Height of video frames to send.
max_tokensint1024Maximum response tokens.
temperaturefloat0.2Temperature for sampling.
top_pfloat0.7Top-p sampling parameter.
frames_per_secondint8Frames per second sent to video models.
Generate a response to text input with video context:
response = await vlm.simple_response("What do you see?")
print(response.text)

Events

The NVIDIA VLM plugin emits events during conversations:
from vision_agents.core.llm.events import (
    LLMResponseChunkEvent,
    LLMResponseCompletedEvent,
)
from vision_agents.plugins.nvidia.events import LLMErrorEvent

@agent.llm.events.subscribe
async def on_chunk(event: LLMResponseChunkEvent):
    print(f"Chunk: {event.delta}")

@agent.llm.events.subscribe
async def on_complete(event: LLMResponseCompletedEvent):
    print(f"Response: {event.text}")

@agent.llm.events.subscribe
async def on_error(event: LLMErrorEvent):
    print(f"Error: {event.error_message}")

Example

Check out the NVIDIA example for a complete implementation using NVIDIA VLM with Deepgram STT, ElevenLabs TTS, and Stream for real-time communication.