HuggingFace Inference

HuggingFace Inference is an inference platform that provides access to thousands of models through a unified API. Routes to multiple providers (Together AI, Groq, Cerebras, Replicate, Fireworks) so you can switch backends without changing code. Supports both text LLM and VLM (vision) models.

Vision Agents requires a Stream account for real-time transport. Most providers offer free tiers to get started.

For local on-device inference using open-weight models, see HuggingFace Transformers.

Installation

uv add "vision-agents[huggingface]"

LLM

Text-only language model with streaming and function calling.

from vision_agents.core import Agent, User
from vision_agents.plugins import huggingface, getstream, deepgram

agent = Agent(
    edge=getstream.Edge(),
    agent_user=User(name="Assistant", id="agent"),
    instructions="You are a helpful assistant.",
    llm=huggingface.LLM(
        model="meta-llama/Meta-Llama-3-8B-Instruct",
        provider="fastest"
    ),
    stt=deepgram.STT(),
    tts=deepgram.TTS(),
)

@agent.llm.register_function(description="Get weather for a location")
async def get_weather(location: str) -> dict:
    return {"temperature": "72°F", "condition": "Sunny"}

Name	Type	Default	Description
`model`	`str`	—	HuggingFace model ID
`provider`	`str`	`None`	Provider (`"together"`, `"groq"`, `"fastest"`, `"cheapest"`)
`api_key`	`str`	`None`	API key (defaults to `HF_TOKEN` env var)

VLM

Vision language model with automatic video frame buffering and function calling. Supports models like Qwen2-VL.

from vision_agents.core import Agent, User
from vision_agents.plugins import huggingface, getstream, deepgram

agent = Agent(
    edge=getstream.Edge(),
    agent_user=User(name="Assistant", id="agent"),
    instructions="You are a visual assistant.",
    llm=huggingface.VLM(
        model="Qwen/Qwen2-VL-7B-Instruct",
        fps=1,
        frame_buffer_seconds=10,
    ),
    stt=deepgram.STT(),
    tts=deepgram.TTS(),
)

Name	Type	Default	Description
`model`	`str`	—	HuggingFace VLM model ID
`fps`	`int`	`1`	Video frames per second to buffer
`frame_buffer_seconds`	`int`	`10`	Seconds of video to buffer
`provider`	`str`	`None`	Inference provider

Overview

Language Models

Realtime

Speech-to-Text

Text-to-Speech

Vision & Video

Avatars

Turn Detection

Infrastructure

Edge Transport

Custom Integrations

Installation

LLM

VLM

Next Steps

Build a Voice Agent

Build a Video Agent

Overview

Language Models

Realtime

Speech-to-Text

Text-to-Speech

Vision & Video

Avatars

Turn Detection

Infrastructure

Edge Transport

Custom Integrations

Documentation Index

​Installation

​LLM

​VLM

​Next Steps

Build a Voice Agent

Build a Video Agent

Installation

LLM

VLM

Next Steps