Turn Detection identifies when a speaker has finished their conversational turn and it’s appropriate for an AI to respond. It solves a critical problem in voice AI: respond too early and you interrupt the speaker; wait too long and the conversation feels awkward.
How It Works
Turn detection analyzes audio through a multi-stage pipeline:
- Voice Activity Detection (VAD): Detects when someone is speaking
- Audio Buffering: Collects speech segments for analysis
- AI Analysis: Examines speech patterns, content, and context to predict turn completion
- Event Emission: Fires
TurnStartedEvent when speech begins and TurnEndedEvent when the turn is complete
The key insight is distinguishing between “I’m pausing to think” and “I’m done talking”—something simple silence detection can’t do.
Turn Detection vs VAD
| VAD | Turn Detection |
|---|
| Question | ”Is someone speaking?" | "Has the speaker finished?” |
| Output | Speech start/end timestamps | Turn completion signal |
| Intelligence | Simple audio analysis | Conversational context |
| Best for | Detecting presence | Knowing when to respond |
Vision Agents’ turn detection uses VAD under the hood, then applies neural models to determine turn completion.
Available Plugins
| Plugin | Description |
|---|
| Smart Turn | Combines Silero VAD, Whisper features, and neural turn completion models |
| Vogent | Neural turn detection with high accuracy prediction |
For Realtime APIs (OpenAI, Gemini, AWS Bedrock, Qwen), turn detection is built-in at the model level—no separate plugin needed.
Use Cases
- Voice Assistants: Respond at the right moment without interrupting
- Customer Service Bots: Natural conversation flow with customers
- Real-time Translation: Capture complete thoughts before translating
- Meeting Intelligence: Identify natural break points for summarization
- Interview Tools: AI interviewers that don’t interrupt
Next Steps