simple_response() for generating responses from text input, includes function calling capabilities with automatic tool execution, and manages conversation context. Multiple providers are supported including OpenAI, Anthropic, Google, and others.
Some LLM implementations support real-time speech-to-speech communication, eliminating the need for separate STT/TTS components:
Chat Completions API Support
Many open-source models follow the OpenAI Chat Completions API format. Whether you’re experimenting with Kimi, Deepseek or Mistral, they call can be accessed by changing the base API url of the OpenAI SDK and setting an API key obtained from their respective dashboards. To support this, Vision Agent’s ships with support for both the OpenAI Response API (used by GPT 5 and the default) as well as the Chat Completions API with streaming. To use, you must have the OpenAI plugin installed in your project. ExampleWe offer both
ChatCompletionsLLM and ChatCompletionsVLM interfaces. The VLM interface will automatically forward the user’s video feed as frames to the model. The above example demonstrates this using Qwen3-VL running on Baseten.VLM Support
Models such as Moondream, Qwen 3 and others offer powerful APIs for visual reasoning and understanding. These models operate as a ssubset ofLLM called VLM. The frames from the user’s video feeds is buffered and sent to the model at a specified interval. Each VLM is unique so be sure to check the docs and model capabilities of each but generally, each VLM also requires an STT provider and in some cases an TTS provider to vocalise the response (some models like Qwen OMNI has TTS built in).

