Skip to main content
HuggingFace Inference is an inference platform that provides access to thousands of models through a unified API. Routes to multiple providers (Together AI, Groq, Cerebras, Replicate, Fireworks) so you can switch backends without changing code. Supports both text LLM and VLM (vision) models.
Vision Agents requires a Stream account for real-time transport. Most providers offer free tiers to get started.

Installation

uv add "vision-agents[huggingface]"

LLM

Text-only language model with streaming and function calling.
from vision_agents.core import Agent, User
from vision_agents.plugins import huggingface, getstream, deepgram

agent = Agent(
    edge=getstream.Edge(),
    agent_user=User(name="Assistant", id="agent"),
    instructions="You are a helpful assistant.",
    llm=huggingface.LLM(
        model="meta-llama/Meta-Llama-3-8B-Instruct",
        provider="fastest"
    ),
    stt=deepgram.STT(),
    tts=deepgram.TTS(),
)

@agent.llm.register_function(description="Get weather for a location")
async def get_weather(location: str) -> dict:
    return {"temperature": "72°F", "condition": "Sunny"}
NameTypeDefaultDescription
modelstrHuggingFace model ID
providerstrNoneProvider ("together", "groq", "fastest", "cheapest")
api_keystrNoneAPI key (defaults to HF_TOKEN env var)

VLM

Vision language model with automatic video frame buffering. Supports models like Qwen2-VL.
from vision_agents.core import Agent, User
from vision_agents.plugins import huggingface, getstream, deepgram

agent = Agent(
    edge=getstream.Edge(),
    agent_user=User(name="Assistant", id="agent"),
    instructions="You are a visual assistant.",
    llm=huggingface.VLM(
        model="Qwen/Qwen2-VL-7B-Instruct",
        fps=1,
        frame_buffer_seconds=10,
    ),
    stt=deepgram.STT(),
    tts=deepgram.TTS(),
)
NameTypeDefaultDescription
modelstrHuggingFace VLM model ID
fpsint1Video frames per second to buffer
frame_buffer_secondsint10Seconds of video to buffer
providerstrNoneInference provider

Next Steps

Build a Voice Agent

Get started with voice

Build a Video Agent

Add video processing