> ## Documentation Index
> Fetch the complete documentation index at: https://visionagents.ai/llms.txt
> Use this file to discover all available pages before exploring further.

# HuggingFace Transformers

Run open-weight models locally on your own hardware using [HuggingFace Transformers](https://huggingface.co/docs/transformers). Supports text LLMs, vision-language models, and real-time object detection, all without API calls.

<Info>
  Vision Agents requires a [Stream](https://getstream.io/try-for-free/) account for real-time transport. Some models on HuggingFace are gated and require a [HuggingFace account](https://huggingface.co/join) and access token (`HF_TOKEN`).
</Info>

<Tip>
  For cloud-based inference via HuggingFace's Inference Providers API (no GPU required), see [HuggingFace Inference](/integrations/infrastructure/huggingface).
</Tip>

## Installation

```sh theme={null}
# Local inference (LLM, VLM, object detection)
uv add "vision-agents-plugins-huggingface[transformers]"

# With 4-bit / 8-bit quantization support (BitsAndBytes)
uv add "vision-agents-plugins-huggingface[transformers-quantized]"
```

## Local LLM

Run text language models locally with streaming and function calling.

```python theme={null}
from vision_agents.core import Agent, User
from vision_agents.plugins import huggingface, getstream, deepgram

agent = Agent(
    edge=getstream.Edge(),
    agent_user=User(name="Assistant", id="agent"),
    instructions="You are a helpful assistant.",
    llm=huggingface.TransformersLLM(
        model="google/gemma-4-E2B-it",
    ),
    stt=deepgram.STT(),
    tts=deepgram.TTS(),
)
```

### Function Calling

```python theme={null}
llm = huggingface.TransformersLLM(model="google/gemma-3-4b-it")

@llm.register_function(description="Get current weather for a city")
async def get_weather(city: str) -> str:
    return f"The weather in {city} is sunny."
```

### Quantization

Reduce memory usage with 4-bit or 8-bit quantization. Requires the `[transformers-quantized]` extra.

```python theme={null}
llm = huggingface.TransformersLLM(
    model="meta-llama/Llama-3.2-3B-Instruct",
    quantization="4bit",
)
```

### LLM Parameters

| Name                | Type   | Default  | Description                                          |
| ------------------- | ------ | -------- | ---------------------------------------------------- |
| `model`             | `str`  | --       | HuggingFace model ID                                 |
| `device`            | `str`  | `"auto"` | `"auto"`, `"cuda"`, `"mps"`, or `"cpu"`              |
| `quantization`      | `str`  | `"none"` | `"none"`, `"4bit"`, or `"8bit"`                      |
| `torch_dtype`       | `str`  | `"auto"` | `"auto"`, `"float16"`, `"bfloat16"`, or `"float32"`  |
| `trust_remote_code` | `bool` | `False`  | Allow custom model code (needed for Qwen, Phi, etc.) |
| `max_new_tokens`    | `int`  | `512`    | Maximum tokens to generate per response              |
| `max_tool_rounds`   | `int`  | `3`      | Maximum tool-call rounds per response                |

## Local VLM

Run vision-language models that can see video frames from the call. Supports function calling.

```python theme={null}
from vision_agents.core import Agent, User
from vision_agents.plugins import huggingface, getstream, deepgram

agent = Agent(
    edge=getstream.Edge(),
    agent_user=User(name="Assistant", id="agent"),
    instructions="You are a visual assistant. Describe what you see.",
    llm=huggingface.TransformersVLM(
        model="Qwen/Qwen2-VL-2B-Instruct",
        fps=1,
        frame_buffer_seconds=10,
    ),
    stt=deepgram.STT(),
    tts=deepgram.TTS(),
)
```

### VLM Parameters

| Name                   | Type   | Default  | Description                                         |
| ---------------------- | ------ | -------- | --------------------------------------------------- |
| `model`                | `str`  | --       | HuggingFace model ID                                |
| `device`               | `str`  | `"auto"` | `"auto"`, `"cuda"`, `"mps"`, or `"cpu"`             |
| `quantization`         | `str`  | `"none"` | `"none"`, `"4bit"`, or `"8bit"`                     |
| `torch_dtype`          | `str`  | `"auto"` | `"auto"`, `"float16"`, `"bfloat16"`, or `"float32"` |
| `trust_remote_code`    | `bool` | `True`   | Allow custom model code                             |
| `fps`                  | `int`  | `1`      | Frames per second to capture from video             |
| `frame_buffer_seconds` | `int`  | `10`     | Seconds of video frames to buffer                   |
| `max_frames`           | `int`  | `4`      | Maximum frames sent per inference (evenly sampled)  |
| `max_new_tokens`       | `int`  | `512`    | Maximum tokens to generate per response             |
| `max_tool_rounds`      | `int`  | `3`      | Maximum tool-call rounds per response               |
| `do_sample`            | `bool` | `True`   | Use sampling for generation                         |

## Object Detection

Run detection models like RT-DETRv2 on live video frames. Emits `DetectionCompletedEvent` with bounding boxes for each processed frame.

```python theme={null}
from vision_agents.core import Agent, User
from vision_agents.plugins import huggingface, getstream, deepgram

processor = huggingface.TransformersDetectionProcessor(
    model="PekingU/rtdetr_v2_r101vd",
    conf_threshold=0.5,
    fps=5,
)

agent = Agent(
    edge=getstream.Edge(),
    agent_user=User(name="Assistant", id="agent"),
    instructions="You are a visual assistant.",
    llm=...,
    stt=deepgram.STT(),
    tts=deepgram.TTS(),
    processors=[processor],
)

@agent.events.subscribe
async def on_detection(event: huggingface.DetectionCompletedEvent):
    for obj in event.objects:
        print(f"{obj['label']} ({obj['confidence']:.0%})")
```

### Detection Parameters

| Name             | Type        | Default                      | Description                                        |
| ---------------- | ----------- | ---------------------------- | -------------------------------------------------- |
| `model`          | `str`       | `"PekingU/rtdetr_v2_r101vd"` | HuggingFace detection model ID                     |
| `conf_threshold` | `float`     | `0.5`                        | Confidence threshold (0--1)                        |
| `fps`            | `int`       | `10`                         | Frame processing rate                              |
| `classes`        | `list[str]` | `None`                       | Filter to specific class names (e.g. `["person"]`) |
| `device`         | `str`       | `"auto"`                     | `"auto"`, `"cuda"`, `"mps"`, or `"cpu"`            |
| `annotate`       | `bool`      | `True`                       | Draw bounding boxes on output video                |

## Next Steps

<CardGroup cols={2}>
  <Card title="HuggingFace Inference" icon="cloud" href="/integrations/infrastructure/huggingface">
    Cloud-based inference (no GPU needed)
  </Card>

  <Card title="Build a Voice Agent" icon="microphone" href="/introduction/voice-agents">
    Get started with voice
  </Card>

  <Card title="Build a Video Agent" icon="video" href="/introduction/video-agents">
    Add video processing
  </Card>

  <Card title="Video Processors" icon="eye" href="/guides/video-processors">
    Process video frames in real-time
  </Card>
</CardGroup>