Features
- Native audio output: No TTS service needed - audio comes directly from the model
- Built-in STT: Integrated speech-to-text using
gummy-realtime-v1- no external STT service required - Server-side VAD: Automatic turn detection with configurable silence thresholds
- Video understanding: Optional video frame support for multimodal interactions
- Real-time streaming: WebSocket-based bidirectional communication for low-latency responses
- Interruption handling: Automatic cancellation when user starts speaking
Installation
Install the Qwen plugin withTutorials
The Voice AI quickstart and Video AI quickstart pages have examples to get you up and running.Example
Check out our Qwen Realtime example to see a practical implementation of the plugin and get inspiration for your own projects, or read on for some key details.Initialization
The Qwen plugin for Stream exists in the form of theRealtime class:
Parameters
These are the parameters available in theqwen.Realtime plugin:
| Name | Type | Default | Description |
|---|---|---|---|
model | str | "qwen3-omni-flash-realtime" | The Qwen Realtime model identifier. |
api_key | str or None | None | DashScope API key. If not provided, reads from DASHSCOPE_API_KEY env var. |
base_url | str or None | "wss://dashscope-intl.aliyuncs.com/api-ws/v1/realtime" | WebSocket API base URL. |
voice | str | "Cherry" | Voice for audio output. |
fps | int | 1 | Video frames per second to send. |
include_video | bool | False | Include video frames in requests. |
video_width | int | 1280 | Video frame width. |
video_height | int | 720 | Video frame height. |
audio_transcription_model | str | "gummy-realtime-v1" | Model used for audio transcription. |
vad_threshold | float | 0.1 | Voice activity detection threshold. |
vad_prefix_padding_ms | int | 500 | VAD prefix padding in milliseconds. |
vad_silence_duration_ms | int | 900 | VAD silence duration in milliseconds. |
Environment variables
SetDASHSCOPE_API_KEY in your environment or .env file:
Usage
Here’s a complete example:Functionality
Connect
Theconnect() method establishes a websocket connection to Qwen Realtime:
Send audio
Thesimple_audio_response() method allows you to send audio data to Qwen:
Qwen Realtime does not support text input. Once you join the call, simply start speaking to the agent.
Watch video track
For video-enabled agents, you can watch a video track to send frames to Qwen:Events
The Qwen plugin emits standard Vision Agents events that you can listen to:RealtimeAudioOutputEvent: Fired when Qwen generates audioLLMResponseChunkEvent: Fired when Qwen generates textRealtimeUserSpeechTranscriptionEvent: Fired for user speech transcriptionsRealtimeAgentSpeechTranscriptionEvent: Fired for agent speech transcriptionsLLMErrorEvent: Fired when an error occurs
Notes
- The model is hosted in Singapore, so latency may vary depending on your location
- The model does not support text input - once you join the call, simply start speaking to the agent
- No external STT or TTS services are required - Qwen Realtime provides both natively

