Skip to main content

View Simple Agent Example on GitHub

Check out the complete Simple Agent example in our GitHub repository
In this example, we build a conversational voice AI agent using OpenAI for language understanding, ElevenLabs for natural-sounding speech, and Deepgram for speech recognition. The agent joins a video call, greets the user, handles voice conversation, and can observe the camera feed. This is the best starting point for developers new to Vision Agents.
Vision Agents requires a Stream account for real-time transport. Most providers offer free tiers to get started.

What You Will Build

  • Listen to user speech and convert it to text with Deepgram STT
  • Process conversations using OpenAI GPT-4o-mini
  • Respond with natural-sounding speech via ElevenLabs TTS
  • Detect when the user has finished speaking with Smart Turn detection
  • Run on Stream’s low-latency edge network

Next Steps

AI Golf Coach

Add video processing with YOLO pose detection

Integrations

Swap in any of 25+ supported AI providers