Fast-Whisper is a high-performance implementation of OpenAI’s Whisper speech recognition model using CTranslate2. It provides significantly faster inference speeds while maintaining the same accuracy as the original Whisper model.
The Fast-Whisper plugin for Vision Agents enables real-time audio transcription with support for multiple model sizes, automatic language detection, and both CPU and GPU acceleration.
Installation
Install the Fast-Whisper plugin with
uv add vision-agents[fast-whisper]
Example
from vision_agents.plugins import fast_whisper
from vision_agents.core import Agent, User
from vision_agents.plugins import getstream, openai, elevenlabs
# Create agent with Fast-Whisper STT
agent = Agent(
edge=getstream.Edge(),
agent_user=User(name="AI Assistant", id="agent"),
instructions="You are a helpful voice assistant.",
stt=fast_whisper.STT(model_size="base"),
llm=openai.LLM(model="gpt-4o-mini"),
tts=elevenlabs.TTS()
)
Initialization
The Fast-Whisper plugin exists in the form of the STT class:
from vision_agents.plugins import fast_whisper
# Default configuration (base model, auto language detection)
stt = fast_whisper.STT()
# Custom configuration
stt = fast_whisper.STT(
model_size="medium",
language="en",
device="cuda",
compute_type="float16"
)
Parameters
These are the parameters available in the Fast-Whisper STT plugin:
| Name | Type | Default | Description |
|---|
model_size | str | "base" | Whisper model size to use. Options: "tiny", "base", "small", "medium", "large", "large-v2", "large-v3". |
language | str or None | None | Language code for transcription (e.g., "en", "es", "fr"). If None, language is automatically detected. |
device | str | "cpu" | Device to run inference on. Options: "cpu", "cuda", "auto". |
compute_type | str | "int8" | Computation precision. Options: "int8", "float16", "float32". Lower precision = faster inference, less memory. |
sample_rate | int | 16000 | Audio sample rate in Hz. Audio is automatically resampled to this rate. |
Features
Fast Inference
Fast-Whisper uses CTranslate2 for optimized inference, providing:
- 2-4x faster transcription compared to the original Whisper implementation
- Lower memory usage through quantization support
- GPU acceleration when available
Model Sizes
Choose the right model size for your use case:
| Model | Parameters | Speed | Accuracy | Use Case |
|---|
tiny | 39M | Fastest | Good | Real-time, resource-constrained |
base | 74M | Very Fast | Better | General purpose, real-time |
small | 244M | Fast | Good | Balanced speed and accuracy |
medium | 769M | Moderate | Better | Higher accuracy needs |
large | 1550M | Slower | Best | Maximum accuracy |
large-v2 | 1550M | Slower | Best | Improved multilingual support |
large-v3 | 1550M | Slower | Best | Latest improvements |
Automatic Language Detection
When language is not specified, Fast-Whisper automatically detects the spoken language:
stt = fast_whisper.STT(model_size="base") # Auto-detect language
For better performance with a known language, specify it explicitly:
stt = fast_whisper.STT(model_size="base", language="en") # English only
GPU Acceleration
Enable GPU acceleration for faster transcription:
stt = fast_whisper.STT(
model_size="medium",
device="cuda",
compute_type="float16" # Use float16 for GPU
)
GPU acceleration requires CUDA to be installed. If CUDA is not available, the plugin will automatically fall back to CPU.
Functionality
Process Audio
The process_audio() method processes incoming audio data and emits transcription events:
from getstream.video import rtc
async with rtc.join(call, bot_user_id) as connection:
@connection.on("audio")
async def on_audio(pcm: PcmData, user):
await stt.process_audio(pcm, user)
Events
The plugin emits standard Vision Agents STT events:
Transcript Event
Fired when a transcription is completed:
from vision_agents.core.stt.events import TranscriptEvent
@stt.events.on(TranscriptEvent)
async def on_transcript(event: TranscriptEvent):
print(f"User {event.user_id} said: {event.text}")
print(f"Language: {event.language}")
print(f"Confidence: {event.confidence}")
Error Event
Fired when an error occurs during transcription:
from vision_agents.core.stt.events import STTErrorEvent
@stt.events.on(STTErrorEvent)
async def on_error(event: STTErrorEvent):
print(f"Transcription error: {event.error_message}")
CPU Optimization
For CPU-only environments, use int8 quantization for best performance:
stt = fast_whisper.STT(
model_size="base",
device="cpu",
compute_type="int8"
)
GPU Optimization
For GPU environments, use float16 for optimal speed and accuracy:
stt = fast_whisper.STT(
model_size="medium",
device="cuda",
compute_type="float16"
)
Model Selection
- Real-time applications: Use
tiny or base models
- Balanced use cases: Use
small or medium models
- Maximum accuracy: Use
large-v3 model (requires more resources)
Usage with Agent
Use Fast-Whisper as part of a complete voice agent:
from vision_agents.core import Agent, User
from vision_agents.plugins import fast_whisper, getstream, openai, elevenlabs
agent = Agent(
edge=getstream.Edge(),
agent_user=User(name="Voice Assistant", id="agent"),
instructions="You are a helpful voice assistant.",
stt=fast_whisper.STT(
model_size="base",
language="en",
device="cpu",
compute_type="int8"
),
llm=openai.LLM(model="gpt-4o-mini"),
tts=elevenlabs.TTS()
)
# Join a call
call = client.video.call("default", call_id)
await call.get_or_create(data={"created_by_id": agent.agent_user.id})
with await agent.join(call):
await agent.finish()
Model Downloads
On first run, Fast-Whisper downloads the selected model from Hugging Face. Models are cached locally to avoid repeated downloads:
- tiny: ~39 MB
- base: ~74 MB
- small: ~244 MB
- medium: ~769 MB
- large: ~1.5 GB
The first initialization may take a few seconds while the model is downloaded.
Supported Languages
Fast-Whisper supports 99 languages, including:
- English (en)
- Spanish (es)
- French (fr)
- German (de)
- Italian (it)
- Portuguese (pt)
- Dutch (nl)
- Russian (ru)
- Chinese (zh)
- Japanese (ja)
- Korean (ko)
- And many more…
See the Whisper documentation for the complete list.
Links