Skip to main content
The LLM component handles text generation and conversation logic. It supports both traditional request-response patterns and real-time streaming. The base interface provides simple_response() for generating responses from text input, includes function calling capabilities with automatic tool execution, and manages conversation context. Multiple providers are supported including OpenAI, Anthropic, Google, and others. Some LLM implementations support real-time speech-to-speech communication, eliminating the need for separate STT/TTS components:
# Traditional mode
agent = Agent(
    llm=openai.LLM(model="gpt-4o-mini"),
    stt=deepgram.STT(),
    tts=elevenlabs.TTS()
)

# Realtime mode
agent = Agent(
    llm=openai.Realtime()
)
Each LLM follows our philosophy of “thin wrapping”. Out of the box, developers can pass in their own client to the LLMs or interact with the native APIs directly including full support for passing native method args. The LLMs can be combined with other features such as processors to provide realtime feedback on the world around you. It can also be used in a simple voice only mode as shown in the previous example. For models running in the non-realtime mode, a TTS service and STT service must be provided. These models will convert the user’s speech to text which is then passed to the model. The model response is then converted into a voice output.

Chat Completions API Support

Many open-source models follow the OpenAI Chat Completions API format. Whether you’re experimenting with Kimi, Deepseek or Mistral, they call can be accessed by changing the base API url of the OpenAI SDK and setting an API key obtained from their respective dashboards. To support this, Vision Agent’s ships with support for both the OpenAI Response API (used by GPT 5 and the default) as well as the Chat Completions API with streaming. To use, you must have the OpenAI plugin installed in your project. Example
from vision_agents.plugins import deepgram, elevenlabs, getstream, openai

async def create_agent(**kwargs) -> Agent:
    # Initialize the Baseten VLM
    llm = openai.ChatCompletionsVLM(
        api_key="API-KEY"
        model="qwen3vl",
        base_url="https://model-vq0nkx7w.api.baseten.co/development/sync/v1", # Replace with your model's hosted URL 
    )

    # Create an agent with video understanding capabilities
    agent = Agent(
        edge=getstream.Edge(),
        agent_user=User(name="Video Assistant", id="agent"),
        instructions="You're a helpful video AI assistant. Analyze the video frames and respond to user questions about what you see.",
        llm=llm,
        stt=deepgram.STT(),
        tts=elevenlabs.TTS(),
        processors=[],
    )
    return agent
We offer both ChatCompletionsLLM and ChatCompletionsVLM interfaces. The VLM interface will automatically forward the user’s video feed as frames to the model. The above example demonstrates this using Qwen3-VL running on Baseten.

VLM Support

Models such as Moondream, Qwen 3 and others offer powerful APIs for visual reasoning and understanding. These models operate as a ssubset of LLM called VLM. The frames from the user’s video feeds is buffered and sent to the model at a specified interval. Each VLM is unique so be sure to check the docs and model capabilities of each but generally, each VLM also requires an STT provider and in some cases an TTS provider to vocalise the response (some models like Qwen OMNI has TTS built in).
    llm = moondream.CloudVLM(
        api_key="your-api-key",  # or set MOONDREAM_API_KEY env var
        mode="vqa",  # or "caption" for image captioning
    )

    agent = Agent(
        edge=getstream.Edge(),
        agent_user=User(name="Vision Assistant", id="agent"),
        llm=llm,
        tts=elevenlabs.TTS(),
        stt=deepgram.STT(),
    )