How Accurate Is Whisper in 2026?
OpenAI Whisper Large-v3: 2.7% WER on LibriSpeech, 8–12% on real-world audio. Among the best open-source ASR models available. But accuracy varies dramatically by language and audio condition.
By NovaScribe Editorial · Updated April 2026
Whisper Accuracy in One Sentence
Whisper is OpenAI's open-source speech recognition model. It matches or beats most commercial APIs on English accuracy, powers many paid tools (NovaScribe, TurboScribe, Descript), and is free to run yourself. But "Whisper" is really a family of models — size, language, and audio condition all affect accuracy significantly.
Whisper Is Not One Model: Size Matters
Whisper comes in 7 sizes from Tiny (39M parameters) to Large-v3 (1.5B). Accuracy and speed trade off dramatically. Most commercial tools use Large-v2 or Large-v3; self-hosted setups often use Medium or Small for speed.
| Model | Parameters | English WER (clean) | Speed | Use Case |
|---|---|---|---|---|
| Whisper Tiny | 39M | ~10–15% | 32× real-time | Draft, constrained devices |
| Whisper Base | 74M | ~8–12% | 16× real-time | Mobile apps |
| Whisper Small | 244M | ~6–9% | 6× real-time | Balanced |
| Whisper Medium | 769M | ~4–6% | 2× real-time | Quality focused |
| Whisper Large-v2 | 1.5B | ~3–5% | 1× real-time | Production (older) |
| Whisper Large-v3 | 1.5B | ~2.7% | 1× real-time | Production (current best) |
| Whisper Large-v3 Turbo | 809M | ~3–4% | 8× real-time | Fast production |
Real-time multipliers assume modern GPU (RTX 3090 or better). On CPU, all models run 5–20× slower. Large-v3 Turbo, released late 2024, is a distilled version of Large-v3 with most of the accuracy at 8× the speed.
Accuracy by Audio Condition
Same Whisper Large-v3 model, radically different results depending on audio conditions. Benchmark accuracy is not real-world accuracy.
| Audio Condition | WER | Notes |
|---|---|---|
| LibriSpeech test-clean (audiobook) | 2.7% | Benchmark baseline |
| LibriSpeech test-other (varied) | 5.2% | More realistic |
| Clean studio speech, 1 speaker | 3–5% | Podcasts, interviews |
| Conference call, 2 speakers | 7–10% | Business meetings |
| Zoom/Teams call, 3 speakers | 10–14% | Real-world meetings |
| Phone audio (8 kHz bandwidth) | 12–18% | Telephony |
| Accented English (Indian, Scottish) | 8–15% | Depending on accent strength |
| Noisy environment (cafe, street) | 15–25% | Degrades significantly |
| Far-field mic (room audio) | 18–28% | Lapel or laptop mic in large room |
Accuracy by Language
Whisper's training data is ~65% English, with the remaining 35% split across 99+ languages. Accuracy correlates strongly with training data volume per language.
| Language | Tier | WER | vs English |
|---|---|---|---|
| English | Tier 1 | 2.7–5% | Baseline |
| Spanish | Tier 1 | 3–6% | Near-parity |
| French | Tier 1 | 4–7% | Near-parity |
| German | Tier 1 | 4–8% | Slight drop |
| Italian | Tier 1 | 5–8% | Slight drop |
| Portuguese | Tier 1 | 5–8% | Slight drop |
| Dutch | Tier 1 | 5–9% | Tier 1 low end |
| Japanese | Tier 2 | 8–12% (CER) | Script complexity |
| Korean | Tier 2 | 8–12% (CER) | Script complexity |
| Russian | Tier 2 | 7–11% | Morphology complexity |
| Arabic | Tier 2 | 9–14% | Dialect challenge |
| Hindi | Tier 2 | 9–14% | Code-switching |
| Turkish | Tier 2 | 9–13% | Agglutination |
| Vietnamese | Tier 3 | 15–22% | Tonal + limited training |
| Thai | Tier 3 | 18–26% | Tonal + script |
| Low-resource (Welsh, etc.) | Tier 4 | 30%+ | Limited training data |
Tier 1: near-English parity. Tier 2: usable with editing. Tier 3: draft-quality. Tier 4: experimental. For language-specific tool comparisons, see our multilingual transcription comparison.
Whisper vs Commercial APIs
How Whisper compares to commercial-only APIs on real-world English audio. Whisper matches or beats most commercial APIs — the gap is narrow (~1–3% WER).
| Engine | Type | English WER (real-world) | Price |
|---|---|---|---|
| Whisper Large-v3 | Open source | ~8–12% | Free (self-hosted) |
| Google Chirp | Commercial API | ~8–11% | $0.016/min |
| AWS Transcribe | Commercial API | ~9–13% | $0.024/min |
| Azure Speech | Commercial API | ~9–12% | $1/hr |
| Deepgram Nova-2 | Commercial API | ~8–11% | $0.0043/min |
| AssemblyAI | Commercial API | ~8–12% | $0.00025/sec |
| Rev AI | Commercial API | ~10–14% | $0.25/min |
What Whisper Can't Do
Honest limitations — what you'll hit when deploying Whisper in production.
⚠No custom vocabulary boosting
Major weakness vs Deepgram and Google. Whisper will mis-transcribe proper nouns, jargon, and technical terms consistently.
⚠Speaker diarization not built-in
Transcription only. Requires separate tools (pyannote, WhisperX) for speaker labels.
⚠Real-time streaming not native
Designed for batch transcription. Streaming requires workarounds with chunking — quality drops on boundaries.
⚠Poor on music + speech mixed audio
Hallucinates lyrics when music overlays speech. Mute music tracks before transcribing.
⚠Hallucinates on silence
Invents text during long pauses — a known issue in Large-v3. Use VAD preprocessing to skip silent sections.
⚠Repeated tokens on loops
Can get stuck repeating the same phrase on certain audio patterns. Less frequent in v3 than v2.
⚠Language detection errors
Misidentifies similar languages — Ukrainian as Russian, Catalan as Spanish. Specify language explicitly for reliability.
⚠2GB file size recommended limit
Very long files (>2 hours) should be chunked for stable processing.
Tools That Use Whisper
Many commercial transcription tools use Whisper under the hood — they're essentially Whisper plus a user interface, file management, and features like diarization or SRT export.
NovaScribe
Whisper Large-v3, $2–$20/mo, 100+ languages, SRT/VTT/TXT/DOCX export, speaker diarization.
TurboScribe
Whisper Large-v3, $10/mo unlimited, batch processing up to 50 files.
Descript
Whisper-based engine in a full video/podcast editor. $12–$24/mo depending on tier.
Fireflies.ai
Mix of Whisper + custom models for meeting transcription with CRM integration.
whisper.cpp (open source)
C++ port by Georgi Gerganov. Runs on CPU efficiently, Apple Silicon optimized.
faster-whisper (open source)
CTranslate2 reimplementation. 4× faster than original Whisper at same accuracy.
WhisperX (open source)
Whisper + forced alignment + diarization. Best free option with speaker labels.
Replicate / HuggingFace APIs
Pay-per-use Whisper APIs for developers who don't want to self-host.
How to Run Whisper Yourself
Whisper is MIT-licensed and free to run locally. Technical setup takes 15–60 minutes depending on your familiarity with Python.
Option 1: Official OpenAI Whisper (Python)
pip install openai-whisper
whisper audio.mp3 --model large-v3Easiest setup, GPU recommended. CPU works but 5–20× slower.
Option 2: faster-whisper (recommended for speed)
pip install faster-whisper
# Python: load model + transcribe via API4× faster than official Whisper, same accuracy. Uses CTranslate2.
Option 3: whisper.cpp (no GPU needed)
git clone https://github.com/ggerganov/whisper.cpp
# make + runRuns fast on CPU, especially Apple Silicon. Best for local privacy-focused setups.
Don't want the hassle? Use NovaScribe.
Whisper Large-v3 accuracy with zero setup, from $2/mo. 100+ languages, SRT/VTT export, speaker diarization included.
Try NovaScribe FreeRelated Guides
Frequently Asked Questions
What is Whisper's word error rate?
Whisper Large-v3 achieves ~2.7% WER on the LibriSpeech test-clean benchmark (clean audiobook audio) and 8–12% WER on real-world English audio (meetings, podcasts, calls). Accuracy drops further on noisy audio, strong accents, or languages other than English.
Is Whisper better than Google Speech-to-Text?
On English audio, Whisper Large-v3 and Google Chirp are roughly equal (both 8–11% WER on real-world audio). Whisper has broader language support (99+) and is free to self-host. Google has better custom vocabulary support and native streaming. For raw transcription accuracy alone, Whisper is competitive with the best commercial APIs.
Which Whisper model is most accurate?
Whisper Large-v3 (1.5B parameters) is the current most accurate, achieving 2.7% WER on LibriSpeech clean. Large-v2 is slightly less accurate (~3–5%). The Tiny, Base, Small, and Medium models trade accuracy for speed — Tiny achieves only 10–15% WER but runs 32× real-time on a GPU.
Is Whisper accurate for Spanish?
Yes. Spanish is a Tier 1 language for Whisper with 3–6% WER on clean audio — near-parity with English. French, Italian, Portuguese, German, and Dutch perform similarly. Lower-resource languages (Vietnamese, Thai, Welsh) have significantly higher WER.
Why is Whisper sometimes wrong?
Whisper accuracy degrades with: noisy audio (+5–15% WER), strong accents (+5–10%), phone audio vs studio (+5–10%), multiple overlapping speakers (+5–10%), technical/domain vocabulary (no custom vocab support), and long silences (Whisper occasionally hallucinates text during silence).
Can Whisper handle multiple speakers?
Whisper transcribes all speech but does not natively identify speakers (no diarization). For speaker labels, you need to combine Whisper with tools like pyannote-audio or use WhisperX, which adds forced alignment and diarization. Commercial tools built on Whisper (NovaScribe, TurboScribe) include diarization.
Is Whisper free to use commercially?
Yes. Whisper is released under the MIT license, which permits unrestricted commercial use. You can self-host, modify, and include it in products you sell. OpenAI also offers a paid Whisper API ($0.006/min) for those who don't want to self-host.
Does Whisper work offline?
Yes. Once the model is downloaded, Whisper runs entirely locally with no internet connection required. This makes it suitable for privacy-sensitive applications, offline environments, and air-gapped systems. Model sizes range from 39MB (Tiny) to 3GB (Large-v3).