Skip to main content
Qwen3 Realtime is a low-latency API from Alibaba that provides native audio output and built-in speech recognition using WebSocket-based realtime communication. The Qwen Realtime plugin in the Vision Agents SDK is a native integration for realtime audio with out-of-the-box support for Qwen’s realtime models. With it, you can natively stream audio to Qwen over websockets and receive responses in real-time. The model includes built-in STT and TTS, so no external speech services are required. This is ideal for building conversational agents, AI avatars, customer service bots, interactive tutors, and much more!

Features

  • Native audio output: No TTS service needed - audio comes directly from the model
  • Built-in STT: Integrated speech-to-text using gummy-realtime-v1 - no external STT service required
  • Server-side VAD: Automatic turn detection with configurable silence thresholds
  • Video understanding: Optional video frame support for multimodal interactions
  • Real-time streaming: WebSocket-based bidirectional communication for low-latency responses
  • Interruption handling: Automatic cancellation when user starts speaking

Installation

Install the Qwen plugin with
uv add vision-agents[qwen]

Tutorials

The Voice AI quickstart and Video AI quickstart pages have examples to get you up and running.

Example

Check out our Qwen Realtime example to see a practical implementation of the plugin and get inspiration for your own projects, or read on for some key details.

Initialization

The Qwen plugin for Stream exists in the form of the Realtime class:
from vision_agents.plugins import qwen

realtime = qwen.Realtime()

Parameters

These are the parameters available in the qwen.Realtime plugin:
NameTypeDefaultDescription
modelstr"qwen3-omni-flash-realtime"The Qwen Realtime model identifier.
api_keystr or NoneNoneDashScope API key. If not provided, reads from DASHSCOPE_API_KEY env var.
base_urlstr or None"wss://dashscope-intl.aliyuncs.com/api-ws/v1/realtime"WebSocket API base URL.
voicestr"Cherry"Voice for audio output.
fpsint1Video frames per second to send.
include_videoboolFalseInclude video frames in requests.
video_widthint1280Video frame width.
video_heightint720Video frame height.
audio_transcription_modelstr"gummy-realtime-v1"Model used for audio transcription.
vad_thresholdfloat0.1Voice activity detection threshold.
vad_prefix_padding_msint500VAD prefix padding in milliseconds.
vad_silence_duration_msint900VAD silence duration in milliseconds.

Environment variables

Set DASHSCOPE_API_KEY in your environment or .env file:
export DASHSCOPE_API_KEY=your_dashscope_api_key_here

Usage

Here’s a complete example:
from dotenv import load_dotenv
from vision_agents.core import Agent, User, cli
from vision_agents.core.agents import AgentLauncher
from vision_agents.plugins import getstream, qwen

load_dotenv()

async def create_agent(**kwargs) -> Agent:
    llm = qwen.Realtime(fps=1)

    agent = Agent(
        edge=getstream.Edge(),
        agent_user=User(name="Qwen Assistant", id="agent"),
        instructions="You are a helpful AI assistant. Be friendly and conversational.",
        llm=llm,
    )
    return agent

async def join_call(agent: Agent, call_type: str, call_id: str, **kwargs) -> None:
    await agent.create_user()
    call = await agent.create_call(call_type, call_id)

    with await agent.join(call):
        await agent.edge.open_demo(call)
        await agent.finish()

if __name__ == "__main__":
    cli(AgentLauncher(create_agent=create_agent, join_call=join_call))

Functionality

Connect

The connect() method establishes a websocket connection to Qwen Realtime:
await realtime.connect()

Send audio

The simple_audio_response() method allows you to send audio data to Qwen:
await realtime.simple_audio_response(pcm_data)
Qwen Realtime does not support text input. Once you join the call, simply start speaking to the agent.

Watch video track

For video-enabled agents, you can watch a video track to send frames to Qwen:
await realtime.watch_video_track(track)

Events

The Qwen plugin emits standard Vision Agents events that you can listen to:
  • RealtimeAudioOutputEvent: Fired when Qwen generates audio
  • LLMResponseChunkEvent: Fired when Qwen generates text
  • RealtimeUserSpeechTranscriptionEvent: Fired for user speech transcriptions
  • RealtimeAgentSpeechTranscriptionEvent: Fired for agent speech transcriptions
  • LLMErrorEvent: Fired when an error occurs
Access these events through the Agent’s event system. See the Event System guide for more details.

Notes

  • The model is hosted in Singapore, so latency may vary depending on your location
  • The model does not support text input - once you join the call, simply start speaking to the agent
  • No external STT or TTS services are required - Qwen Realtime provides both natively