Avatars consume the agent’s audio output and produce a synced video and audio feed of a virtual character. They run in passthrough mode: the avatar owns the agent’s outbound video and audio tracks, and its output never feeds back into the LLM or any video processors.Documentation Index
Fetch the complete documentation index at: https://visionagents.ai/llms.txt
Use this file to discover all available pages before exploring further.
Class Hierarchy
Thevision_agents.core.avatars module exports two classes:
| Class | Purpose |
|---|---|
Avatar | Abstract base class; consumes the agent’s audio output and publishes synced video and audio. |
AVSynchronizer | Utility that owns paired audio/video tracks and delays video frames to match the audio buffer depth, keeping lip-sync accurate. |
AVSynchronizer for output, so it’s the recommended building block for custom avatars too.
Lifecycle
The agent drives the avatar through a fixed lifecycle:Agent.__init__queriesvideo_output()and callsattach_audio_input(stream), handing the avatar the inference flow’s audio output stream.Agent.join()callsawait avatar.start(), which opens the provider connection and begins consuming the input stream.- While running, the avatar drains
input_audio_stream, forwards audio to the provider, and exposes lip-synced video viavideo_output()and audio viaaudio_output(). Agent.close()callsawait avatar.close()for teardown.
avatar.audio_output() as outbound audio instead of the TTS stream directly — TTS still synthesises, the avatar lip-syncs and republishes.
Abstract Methods
Subclasses must implement all four:| Method | Description |
|---|---|
video_output() | Return the outbound aiortc.VideoStreamTrack published to the call. |
audio_output() | Return the outbound AudioOutputStream published to the call. |
async start() | Open the provider connection and begin consuming input_audio_stream. |
async close() | Tear down the provider connection and cancel any consumer tasks. |
interrupt() method to stop the in-flight utterance at the provider during barge-in.
Properties & Helpers
| Member | Description |
|---|---|
provider_name | Class attribute identifying the provider (used in events and metrics). |
events | EventManager for emitting avatar-specific events. |
metrics | MetricsCollector for recording avatar metrics. |
input_audio_stream | The agent’s audio output stream attached via attach_audio_input. Raises ValueError if accessed before attach. |
attach_audio_input(stream) | Called by the agent to hand off its audio output stream. Override to customise how audio is consumed. |
AVSynchronizer
AVSynchronizer is a utility class that solves the lip-sync problem: provider video and audio arrive on separate streams, and pushing them straight onto the outbound WebRTC tracks usually drifts. It owns a paired audio_output and video_output, delays each video frame by the current audio buffer depth, and paces frames at the configured fps (overriding aiortc’s hardcoded 30 fps).
| Member | Description |
|---|---|
video_output | The QueuedVideoTrack to expose from Avatar.video_output(). |
audio_output | The AudioOutputStream to expose from Avatar.audio_output(). |
async write_video(frame) | Queue an av.VideoFrame from the provider, delayed by the current audio buffer depth. |
async write_audio(pcm) | Write a PcmData chunk from the provider to the audio track. |
async flush() | Discard pending video frames and flush buffered audio (use on interrupt). |
close() | Close the underlying audio stream. |
Building a Custom Avatar
A minimal subclass wraps anAVSynchronizer, exposes its tracks, and pumps provider frames into it from a consumer task started in start():
Usage
Pass an avatar to the agent at initialisation:Available Implementations
LiveAvatar
Real-time interactive avatars by HeyGen with WebSocket lip-sync.
Anam
Anam’s avatar SDK with configurable dimensions and frame rate.
LemonSlice
LemonSlice avatars delivered over LiveKit.

