Skip to main content
Vogent is an advanced turn detection system that uses neural models to predict when a speaker has completed their conversational turn. It provides intelligent turn-taking detection for natural conversation flow in voice agents. The Vogent plugin in the Vision Agents SDK enables sophisticated turn detection with high accuracy, making it ideal for voice assistants, customer service bots, and interactive AI applications.

Installation

Install the Vogent plugin with
uv add vision-agents[vogent]

Example

from vision_agents.core import Agent
from vision_agents.plugins import vogent
from vision_agents.core.turn_detection.events import TurnStartedEvent, TurnEndedEvent

# Create turn detection with custom settings
turn_detection = vogent.TurnDetection(
    buffer_in_seconds=2.0,
    confidence_threshold=0.5
)

# Use with an agent
agent = Agent(
    turn_detection=turn_detection,
    # ... other agent configuration
)

# Or use standalone
await turn_detection.start()

# Listen for turn events
@turn_detection.events.subscribe
async def on_turn_started(event: TurnStartedEvent):
    print(f"User {event.participant.user_id} started speaking")

@turn_detection.events.subscribe
async def on_turn_ended(event: TurnEndedEvent):
    print(f"User {event.participant.user_id} finished speaking")
    print(f"Confidence: {event.confidence}")

# Stop when finished
await turn_detection.stop()

Initialisation

The Vogent plugin is exposed via the TurnDetection class:
from vision_agents.plugins import vogent

# Default settings
turn_detection = vogent.TurnDetection()

# Custom settings for more aggressive turn detection
turn_detection = vogent.TurnDetection(
    buffer_in_seconds=1.5,
    confidence_threshold=0.7
)

# Start detection (downloads models if needed)
await turn_detection.start()

Parameters

You can customise the behaviour of Vogent through the following parameters:
NameTypeDefaultDescription
buffer_in_secondsfloat2.0Duration in seconds to buffer audio before processing.
confidence_thresholdfloat0.5Probability threshold (0.0–1.0) for determining turn completion.
sample_rateint16000Audio sample rate in Hz for processing (audio is resampled automatically).

Functionality

Start and Stop

Control turn detection with the start() and stop() methods:
# Start turn detection (downloads models if needed)
await turn_detection.start()

# Check if detection is active
if turn_detection.is_active:
    print("Turn detection is active")

# Stop turn detection
await turn_detection.stop()

Events

The plugin emits turn detection events through the Vision Agents event system:

Turn Started Event

Fired when a user begins speaking:
from vision_agents.core.turn_detection.events import TurnStartedEvent

@turn_detection.events.subscribe
async def on_turn_started(event: TurnStartedEvent):
    print(f"Turn started by {event.participant.user_id}")
    print(f"Confidence: {event.confidence}")

Turn Ended Event

Fired when a user completes their turn (based on the model’s prediction and confidence threshold):
from vision_agents.core.turn_detection.events import TurnEndedEvent

@turn_detection.events.subscribe
async def on_turn_ended(event: TurnEndedEvent):
    print(f"Turn ended by {event.participant.user_id}")
    print(f"Confidence: {event.confidence}")
    print(f"Duration: {event.duration_ms}ms")
    print(f"Trailing silence: {event.trailing_silence_ms}ms")

Event Properties

Both TurnStartedEvent and TurnEndedEvent include the following properties:
PropertyTypeDescription
participantParticipantParticipant object with user_id and metadata.
confidencefloat|NoneConfidence level of the turn detection (0.0–1.0).
trailing_silence_msfloat|NoneMilliseconds of silence after speech (TurnEnded).
duration_msfloat|NoneDuration of the turn in milliseconds (TurnEnded).
customdict|NoneAdditional model-specific data.

How It Works

Vogent uses a neural model to analyze audio and predict turn completion. The system:
  1. Buffers incoming audio based on buffer_in_seconds
  2. Processes audio through the Vogent neural model
  3. Predicts turn completion probability
  4. Emits TurnStartedEvent when speech begins
  5. Emits TurnEndedEvent when turn completion probability exceeds confidence_threshold

Model Downloads

On first run, the model downloads the following:
  • Silero VAD: Voice activity detection model
  • Whisper Feature Extractor: Semantic feature extraction
Models are cached locally to avoid repeated downloads. The first start() call may take a few seconds while models are downloaded.