HuggingFace

HuggingFace Inference provides access to thousands of models through a unified API. The HuggingFace plugin in the Vision Agents SDK supports multiple inference providers including Together AI, Groq, Cerebras, Replicate, and Fireworks. The HuggingFace plugin provides two integrations:

HuggingFace LLM - Text-only language model integration with streaming responses, function calling, and multi-provider support.
HuggingFace VLM - Vision language model integration with automatic video frame buffering for real-time video understanding.

These integrations are ideal for building conversational agents, visual assistants, and AI-powered applications using open-source models like Llama, Qwen, and more.

Installation

Install the Stream HuggingFace plugin with:

uv add vision-agents[huggingface]

Configuration

Set your HuggingFace API token:

export HF_TOKEN=your_huggingface_token

HuggingFace LLM

The HuggingFace LLM plugin provides text-only language model integration with streaming responses and function calling support.

Usage

from vision_agents.plugins import huggingface, getstream, deepgram
from vision_agents.core import Agent, User

agent = Agent(
    edge=getstream.Edge(),
    agent_user=User(name="AI Assistant", id="agent"),
    instructions="You are a helpful voice assistant. Keep replies short and conversational.",
    llm=huggingface.LLM(
        model="meta-llama/Meta-Llama-3-8B-Instruct",
        provider="fastest"
    ),
    stt=deepgram.STT(),
    tts=deepgram.TTS(),
)

Parameters

Name	Type	Default	Description
`model`	`str`	-	The HuggingFace model ID to use (e.g., `"meta-llama/Meta-Llama-3-8B-Instruct"`).
`api_key`	`Optional[str]`	`None`	HuggingFace API token. If not provided, reads from `HF_TOKEN` environment variable.
`provider`	`Optional[str]`	`None`	Inference provider (e.g., `"together"`, `"groq"`, `"fastest"`, `"cheapest"`). Auto-selects based on your HuggingFace settings if omitted.
`client`	`Optional[AsyncInferenceClient]`	`None`	Custom AsyncInferenceClient instance for dependency injection.

Methods

simple_response(text, processors, participant)

Generate a response to text input:

response = await llm.simple_response("Hello, how are you?")
print(response.text)

create_response(messages, input, stream)

Create a response with full control over the request:

response = await llm.create_response(
    messages=[
        {"role": "system", "content": "You are helpful."},
        {"role": "user", "content": "What's the weather?"}
    ],
    stream=True
)

Function calling

You can register functions that the model can call:

from vision_agents.plugins import huggingface

llm = huggingface.LLM(model="meta-llama/Meta-Llama-3-8B-Instruct")

@llm.register_function()
def get_weather(city: str) -> str:
    """Get the current weather for a city."""
    return f"The weather in {city} is sunny."

response = await llm.simple_response("What's the weather in Paris?")

Supported providers

HuggingFace’s Inference Providers API supports multiple backends. You can specify a provider explicitly or let HuggingFace auto-select based on your account preferences:

# Auto-select provider
llm = huggingface.LLM(model="meta-llama/Meta-Llama-3-8B-Instruct")

# Select fastest provider
llm = huggingface.LLM(
    model="meta-llama/Meta-Llama-3-8B-Instruct",
    provider="fastest"
)

# Select cheapest provider
llm = huggingface.LLM(
    model="meta-llama/Meta-Llama-3-8B-Instruct",
    provider="cheapest"
)

# Specify a provider explicitly
llm = huggingface.LLM(
    model="meta-llama/Meta-Llama-3-8B-Instruct",
    provider="groq"
)

Available providers include:

Together AI
Groq
Cerebras
Replicate
Fireworks

HuggingFace VLM

The HuggingFace VLM plugin provides vision language model integration with automatic video frame buffering for real-time video understanding.

Usage

from vision_agents.plugins import huggingface, getstream, deepgram
from vision_agents.core import Agent, User

agent = Agent(
    edge=getstream.Edge(),
    agent_user=User(name="AI Assistant", id="agent"),
    instructions="You are a helpful visual assistant.",
    llm=huggingface.VLM(
        model="Qwen/Qwen2-VL-7B-Instruct",
        fps=1,
        frame_buffer_seconds=10,
    ),
    stt=deepgram.STT(),
    tts=deepgram.TTS(),
)

Parameters

Name	Type	Default	Description
`model`	`str`	-	The HuggingFace model ID to use (e.g., `"Qwen/Qwen2-VL-7B-Instruct"`).
`api_key`	`Optional[str]`	`None`	HuggingFace API token. If not provided, reads from `HF_TOKEN` environment variable.
`provider`	`Optional[str]`	`None`	Inference provider. Auto-selects based on your HuggingFace settings if omitted.
`fps`	`int`	`1`	Number of video frames per second to buffer.
`frame_buffer_seconds`	`int`	`10`	Number of seconds of video to buffer for the model’s input.
`client`	`Optional[AsyncInferenceClient]`	`None`	Custom AsyncInferenceClient instance for dependency injection.

Methods

simple_response(text, processors, participant)

Generate a response to text input with video context:

response = await vlm.simple_response("What do you see?")
print(response.text)

watch_video_track(track, shared_forwarder)

Set up video forwarding and start buffering video frames:

await vlm.watch_video_track(video_track)

Events

Both LLM and VLM plugins emit events during conversations:

from vision_agents.core.llm.events import (
    LLMResponseChunkEvent,
    LLMResponseCompletedEvent,
)
from vision_agents.plugins.huggingface.events import LLMErrorEvent

@agent.llm.events.subscribe
async def on_chunk(event: LLMResponseChunkEvent):
    print(f"Chunk: {event.delta}")

@agent.llm.events.subscribe
async def on_complete(event: LLMResponseCompletedEvent):
    print(f"Response: {event.text}")

@agent.llm.events.subscribe
async def on_error(event: LLMErrorEvent):
    print(f"Error: {event.error_message}")

Example

Check out the HuggingFace example for a complete implementation using HuggingFace with Deepgram STT/TTS and Stream for real-time communication.

Overview

AI Providers

Custom Integrations

Installation

Configuration

HuggingFace LLM

Usage

Parameters

Methods

simple_response(text, processors, participant)

create_response(messages, input, stream)

Function calling

Supported providers

HuggingFace VLM

Usage

Parameters

Methods

simple_response(text, processors, participant)

watch_video_track(track, shared_forwarder)

Events

Example

Overview

AI Providers

Custom Integrations

​Installation

​Configuration

​HuggingFace LLM

​Usage

​Parameters

​Methods

​simple_response(text, processors, participant)

​create_response(messages, input, stream)

​Function calling

​Supported providers

​HuggingFace VLM

​Usage

​Parameters

​Methods

​simple_response(text, processors, participant)

​watch_video_track(track, shared_forwarder)

​Events

​Example

Installation

Configuration

HuggingFace LLM

Usage

Parameters

Methods

simple_response(text, processors, participant)

create_response(messages, input, stream)

Function calling

Supported providers

HuggingFace VLM

Usage

Parameters

Methods

simple_response(text, processors, participant)

watch_video_track(track, shared_forwarder)

Events

Example