Skip to main content

Documentation Index

Fetch the complete documentation index at: https://visionagents.ai/llms.txt

Use this file to discover all available pages before exploring further.

Avatars consume the agent’s audio output and produce a synced video and audio feed of a virtual character. They run in passthrough mode: the avatar owns the agent’s outbound video and audio tracks, and its output never feeds back into the LLM or any video processors.

Class Hierarchy

The vision_agents.core.avatars module exports two classes:
ClassPurpose
AvatarAbstract base class; consumes the agent’s audio output and publishes synced video and audio.
AVSynchronizerUtility that owns paired audio/video tracks and delays video frames to match the audio buffer depth, keeping lip-sync accurate.
All three built-in implementations (LiveAvatar, Anam, LemonSlice) build on AVSynchronizer for output, so it’s the recommended building block for custom avatars too.

Lifecycle

The agent drives the avatar through a fixed lifecycle:
  1. Agent.__init__ queries video_output() and calls attach_audio_input(stream), handing the avatar the inference flow’s audio output stream.
  2. Agent.join() calls await avatar.start(), which opens the provider connection and begins consuming the input stream.
  3. While running, the avatar drains input_audio_stream, forwards audio to the provider, and exposes lip-synced video via video_output() and audio via audio_output().
  4. Agent.close() calls await avatar.close() for teardown.
When an avatar is set, the agent publishes avatar.audio_output() as outbound audio instead of the TTS stream directly — TTS still synthesises, the avatar lip-syncs and republishes.

Abstract Methods

Subclasses must implement all four:
MethodDescription
video_output()Return the outbound aiortc.VideoStreamTrack published to the call.
audio_output()Return the outbound AudioOutputStream published to the call.
async start()Open the provider connection and begin consuming input_audio_stream.
async close()Tear down the provider connection and cancel any consumer tasks.
Subclasses may also implement an interrupt() method to stop the in-flight utterance at the provider during barge-in.

Properties & Helpers

MemberDescription
provider_nameClass attribute identifying the provider (used in events and metrics).
eventsEventManager for emitting avatar-specific events.
metricsMetricsCollector for recording avatar metrics.
input_audio_streamThe agent’s audio output stream attached via attach_audio_input. Raises ValueError if accessed before attach.
attach_audio_input(stream)Called by the agent to hand off its audio output stream. Override to customise how audio is consumed.

AVSynchronizer

AVSynchronizer is a utility class that solves the lip-sync problem: provider video and audio arrive on separate streams, and pushing them straight onto the outbound WebRTC tracks usually drifts. It owns a paired audio_output and video_output, delays each video frame by the current audio buffer depth, and paces frames at the configured fps (overriding aiortc’s hardcoded 30 fps).
from vision_agents.core.avatars import AVSynchronizer

sync = AVSynchronizer(
    width=1920,
    height=1080,
    fps=30,
    max_queue_size=30,  # typically int(fps * buffer_seconds)
)
MemberDescription
video_outputThe QueuedVideoTrack to expose from Avatar.video_output().
audio_outputThe AudioOutputStream to expose from Avatar.audio_output().
async write_video(frame)Queue an av.VideoFrame from the provider, delayed by the current audio buffer depth.
async write_audio(pcm)Write a PcmData chunk from the provider to the audio track.
async flush()Discard pending video frames and flush buffered audio (use on interrupt).
close()Close the underlying audio stream.

Building a Custom Avatar

A minimal subclass wraps an AVSynchronizer, exposes its tracks, and pumps provider frames into it from a consumer task started in start():
import asyncio
from vision_agents.core.avatars import Avatar, AVSynchronizer
from vision_agents.core.agents.inference import AudioOutputStream
from getstream.video.rtc.track_util import PcmData
import av

class MyAvatar(Avatar):
    provider_name = "my_avatar"

    def __init__(self, width: int = 1280, height: int = 720, fps: int = 30) -> None:
        super().__init__()
        self._sync = AVSynchronizer(width=width, height=height, fps=fps)
        self._task: asyncio.Task | None = None

    def video_output(self):
        return self._sync.video_output

    def audio_output(self) -> AudioOutputStream:
        return self._sync.audio_output

    async def start(self) -> None:
        # open provider connection, then pump agent audio into it
        self._task = asyncio.create_task(self._consume(self.input_audio_stream))

    async def close(self) -> None:
        if self._task:
            self._task.cancel()
        self._sync.close()

    async def _consume(self, stream: AudioOutputStream) -> None:
        async for chunk in stream:
            # send chunk.data to the provider; for each response frame:
            #   await self._sync.write_video(frame)   # av.VideoFrame
            #   await self._sync.write_audio(pcm)     # PcmData
            ...

Usage

Pass an avatar to the agent at initialisation:
from vision_agents.core import Agent, User
from vision_agents.plugins import deepgram, gemini, getstream, liveavatar

agent = Agent(
    edge=getstream.Edge(),
    agent_user=User(name="Assistant", id="agent"),
    llm=gemini.LLM("gemini-3-flash-preview"),
    tts=deepgram.TTS(),
    stt=deepgram.STT(),
    avatar=liveavatar.Avatar(),
)

Available Implementations

LiveAvatar

Real-time interactive avatars by HeyGen with WebSocket lip-sync.

Anam

Anam’s avatar SDK with configurable dimensions and frame rate.

LemonSlice

LemonSlice avatars delivered over LiveKit.