Skip to main content
HuggingFace Inference provides access to thousands of models through a unified API. The HuggingFace plugin in the Vision Agents SDK supports multiple inference providers including Together AI, Groq, Cerebras, Replicate, and Fireworks. The HuggingFace plugin provides two integrations:
  1. HuggingFace LLM - Text-only language model integration with streaming responses, function calling, and multi-provider support.
  2. HuggingFace VLM - Vision language model integration with automatic video frame buffering for real-time video understanding.
These integrations are ideal for building conversational agents, visual assistants, and AI-powered applications using open-source models like Llama, Qwen, and more.

Installation

Install the Stream HuggingFace plugin with:
uv add vision-agents[huggingface]

Configuration

Set your HuggingFace API token:
export HF_TOKEN=your_huggingface_token

HuggingFace LLM

The HuggingFace LLM plugin provides text-only language model integration with streaming responses and function calling support.

Usage

from vision_agents.plugins import huggingface, getstream, deepgram
from vision_agents.core import Agent, User

agent = Agent(
    edge=getstream.Edge(),
    agent_user=User(name="AI Assistant", id="agent"),
    instructions="You are a helpful voice assistant. Keep replies short and conversational.",
    llm=huggingface.LLM(
        model="meta-llama/Meta-Llama-3-8B-Instruct",
        provider="fastest"
    ),
    stt=deepgram.STT(),
    tts=deepgram.TTS(),
)

Parameters

NameTypeDefaultDescription
modelstr-The HuggingFace model ID to use (e.g., "meta-llama/Meta-Llama-3-8B-Instruct").
api_keyOptional[str]NoneHuggingFace API token. If not provided, reads from HF_TOKEN environment variable.
providerOptional[str]NoneInference provider (e.g., "together", "groq", "fastest", "cheapest"). Auto-selects based on your HuggingFace settings if omitted.
clientOptional[AsyncInferenceClient]NoneCustom AsyncInferenceClient instance for dependency injection.

Methods

simple_response(text, processors, participant)

Generate a response to text input:
response = await llm.simple_response("Hello, how are you?")
print(response.text)

create_response(messages, input, stream)

Create a response with full control over the request:
response = await llm.create_response(
    messages=[
        {"role": "system", "content": "You are helpful."},
        {"role": "user", "content": "What's the weather?"}
    ],
    stream=True
)

Function calling

You can register functions that the model can call:
from vision_agents.plugins import huggingface

llm = huggingface.LLM(model="meta-llama/Meta-Llama-3-8B-Instruct")

@llm.register_function()
def get_weather(city: str) -> str:
    """Get the current weather for a city."""
    return f"The weather in {city} is sunny."

response = await llm.simple_response("What's the weather in Paris?")

Supported providers

HuggingFace’s Inference Providers API supports multiple backends. You can specify a provider explicitly or let HuggingFace auto-select based on your account preferences:
# Auto-select provider
llm = huggingface.LLM(model="meta-llama/Meta-Llama-3-8B-Instruct")

# Select fastest provider
llm = huggingface.LLM(
    model="meta-llama/Meta-Llama-3-8B-Instruct",
    provider="fastest"
)

# Select cheapest provider
llm = huggingface.LLM(
    model="meta-llama/Meta-Llama-3-8B-Instruct",
    provider="cheapest"
)

# Specify a provider explicitly
llm = huggingface.LLM(
    model="meta-llama/Meta-Llama-3-8B-Instruct",
    provider="groq"
)
Available providers include:
  • Together AI
  • Groq
  • Cerebras
  • Replicate
  • Fireworks

HuggingFace VLM

The HuggingFace VLM plugin provides vision language model integration with automatic video frame buffering for real-time video understanding.

Usage

from vision_agents.plugins import huggingface, getstream, deepgram
from vision_agents.core import Agent, User

agent = Agent(
    edge=getstream.Edge(),
    agent_user=User(name="AI Assistant", id="agent"),
    instructions="You are a helpful visual assistant.",
    llm=huggingface.VLM(
        model="Qwen/Qwen2-VL-7B-Instruct",
        fps=1,
        frame_buffer_seconds=10,
    ),
    stt=deepgram.STT(),
    tts=deepgram.TTS(),
)

Parameters

NameTypeDefaultDescription
modelstr-The HuggingFace model ID to use (e.g., "Qwen/Qwen2-VL-7B-Instruct").
api_keyOptional[str]NoneHuggingFace API token. If not provided, reads from HF_TOKEN environment variable.
providerOptional[str]NoneInference provider. Auto-selects based on your HuggingFace settings if omitted.
fpsint1Number of video frames per second to buffer.
frame_buffer_secondsint10Number of seconds of video to buffer for the model’s input.
clientOptional[AsyncInferenceClient]NoneCustom AsyncInferenceClient instance for dependency injection.

Methods

simple_response(text, processors, participant)

Generate a response to text input with video context:
response = await vlm.simple_response("What do you see?")
print(response.text)

watch_video_track(track, shared_forwarder)

Set up video forwarding and start buffering video frames:
await vlm.watch_video_track(video_track)

Events

Both LLM and VLM plugins emit events during conversations:
from vision_agents.core.llm.events import (
    LLMResponseChunkEvent,
    LLMResponseCompletedEvent,
)
from vision_agents.plugins.huggingface.events import LLMErrorEvent

@agent.llm.events.subscribe
async def on_chunk(event: LLMResponseChunkEvent):
    print(f"Chunk: {event.delta}")

@agent.llm.events.subscribe
async def on_complete(event: LLMResponseCompletedEvent):
    print(f"Response: {event.text}")

@agent.llm.events.subscribe
async def on_error(event: LLMErrorEvent):
    print(f"Error: {event.error_message}")

Example

Check out the HuggingFace example for a complete implementation using HuggingFace with Deepgram STT/TTS and Stream for real-time communication.