Home/How Accurate Is Whisper?

How Accurate Is Whisper in 2026?

OpenAI Whisper Large-v3: 2.7% WER on LibriSpeech, 8–12% on real-world audio. Among the best open-source ASR models available. But accuracy varies dramatically by language and audio condition.

By NovaScribe Editorial · Updated April 2026

Whisper Accuracy in One Sentence

2.7%
WER on benchmark
LibriSpeech clean
8–12%
Real-world English
meetings, calls, podcasts
99+
Languages supported
accuracy varies by tier
$0
MIT license
free to self-host

Whisper is OpenAI's open-source speech recognition model. It matches or beats most commercial APIs on English accuracy, powers many paid tools (NovaScribe, TurboScribe, Descript), and is free to run yourself. But "Whisper" is really a family of models — size, language, and audio condition all affect accuracy significantly.

Whisper Is Not One Model: Size Matters

Whisper comes in 7 sizes from Tiny (39M parameters) to Large-v3 (1.5B). Accuracy and speed trade off dramatically. Most commercial tools use Large-v2 or Large-v3; self-hosted setups often use Medium or Small for speed.

ModelParametersEnglish WER (clean)SpeedUse Case
Whisper Tiny39M~10–15%32× real-timeDraft, constrained devices
Whisper Base74M~8–12%16× real-timeMobile apps
Whisper Small244M~6–9%6× real-timeBalanced
Whisper Medium769M~4–6%2× real-timeQuality focused
Whisper Large-v21.5B~3–5%1× real-timeProduction (older)
Whisper Large-v31.5B~2.7%1× real-timeProduction (current best)
Whisper Large-v3 Turbo809M~3–4%8× real-timeFast production

Real-time multipliers assume modern GPU (RTX 3090 or better). On CPU, all models run 5–20× slower. Large-v3 Turbo, released late 2024, is a distilled version of Large-v3 with most of the accuracy at 8× the speed.

Accuracy by Audio Condition

Same Whisper Large-v3 model, radically different results depending on audio conditions. Benchmark accuracy is not real-world accuracy.

Audio ConditionWERNotes
LibriSpeech test-clean (audiobook)2.7%Benchmark baseline
LibriSpeech test-other (varied)5.2%More realistic
Clean studio speech, 1 speaker3–5%Podcasts, interviews
Conference call, 2 speakers7–10%Business meetings
Zoom/Teams call, 3 speakers10–14%Real-world meetings
Phone audio (8 kHz bandwidth)12–18%Telephony
Accented English (Indian, Scottish)8–15%Depending on accent strength
Noisy environment (cafe, street)15–25%Degrades significantly
Far-field mic (room audio)18–28%Lapel or laptop mic in large room
Key insight: Audio quality affects Whisper accuracy more than any other factor. Moving from a laptop mic to a $30 USB mic typically improves WER by 5–10%. See our verdict page for when this level of accuracy is sufficient for your use case.

Accuracy by Language

Whisper's training data is ~65% English, with the remaining 35% split across 99+ languages. Accuracy correlates strongly with training data volume per language.

LanguageTierWERvs English
EnglishTier 12.7–5%Baseline
SpanishTier 13–6%Near-parity
FrenchTier 14–7%Near-parity
GermanTier 14–8%Slight drop
ItalianTier 15–8%Slight drop
PortugueseTier 15–8%Slight drop
DutchTier 15–9%Tier 1 low end
JapaneseTier 28–12% (CER)Script complexity
KoreanTier 28–12% (CER)Script complexity
RussianTier 27–11%Morphology complexity
ArabicTier 29–14%Dialect challenge
HindiTier 29–14%Code-switching
TurkishTier 29–13%Agglutination
VietnameseTier 315–22%Tonal + limited training
ThaiTier 318–26%Tonal + script
Low-resource (Welsh, etc.)Tier 430%+Limited training data

Tier 1: near-English parity. Tier 2: usable with editing. Tier 3: draft-quality. Tier 4: experimental. For language-specific tool comparisons, see our multilingual transcription comparison.

Whisper vs Commercial APIs

How Whisper compares to commercial-only APIs on real-world English audio. Whisper matches or beats most commercial APIs — the gap is narrow (~1–3% WER).

EngineTypeEnglish WER (real-world)Price
Whisper Large-v3Open source~8–12%Free (self-hosted)
Google ChirpCommercial API~8–11%$0.016/min
AWS TranscribeCommercial API~9–13%$0.024/min
Azure SpeechCommercial API~9–12%$1/hr
Deepgram Nova-2Commercial API~8–11%$0.0043/min
AssemblyAICommercial API~8–12%$0.00025/sec
Rev AICommercial API~10–14%$0.25/min
Why Whisper wins on value: It matches commercial APIs on raw accuracy but is free to self-host. Commercial APIs' main advantages are custom vocabulary boosting, built-in speaker diarization, and real-time streaming — not raw transcription accuracy. For accuracy alone, Whisper is competitive with the best.

What Whisper Can't Do

Honest limitations — what you'll hit when deploying Whisper in production.

No custom vocabulary boosting

Major weakness vs Deepgram and Google. Whisper will mis-transcribe proper nouns, jargon, and technical terms consistently.

Speaker diarization not built-in

Transcription only. Requires separate tools (pyannote, WhisperX) for speaker labels.

Real-time streaming not native

Designed for batch transcription. Streaming requires workarounds with chunking — quality drops on boundaries.

Poor on music + speech mixed audio

Hallucinates lyrics when music overlays speech. Mute music tracks before transcribing.

Hallucinates on silence

Invents text during long pauses — a known issue in Large-v3. Use VAD preprocessing to skip silent sections.

Repeated tokens on loops

Can get stuck repeating the same phrase on certain audio patterns. Less frequent in v3 than v2.

Language detection errors

Misidentifies similar languages — Ukrainian as Russian, Catalan as Spanish. Specify language explicitly for reliability.

2GB file size recommended limit

Very long files (>2 hours) should be chunked for stable processing.

Tools That Use Whisper

Many commercial transcription tools use Whisper under the hood — they're essentially Whisper plus a user interface, file management, and features like diarization or SRT export.

NovaScribe

Whisper Large-v3, $2–$20/mo, 100+ languages, SRT/VTT/TXT/DOCX export, speaker diarization.

TurboScribe

Whisper Large-v3, $10/mo unlimited, batch processing up to 50 files.

Descript

Whisper-based engine in a full video/podcast editor. $12–$24/mo depending on tier.

Fireflies.ai

Mix of Whisper + custom models for meeting transcription with CRM integration.

whisper.cpp (open source)

C++ port by Georgi Gerganov. Runs on CPU efficiently, Apple Silicon optimized.

faster-whisper (open source)

CTranslate2 reimplementation. 4× faster than original Whisper at same accuracy.

WhisperX (open source)

Whisper + forced alignment + diarization. Best free option with speaker labels.

Replicate / HuggingFace APIs

Pay-per-use Whisper APIs for developers who don't want to self-host.

How to Run Whisper Yourself

Whisper is MIT-licensed and free to run locally. Technical setup takes 15–60 minutes depending on your familiarity with Python.

Option 1: Official OpenAI Whisper (Python)

pip install openai-whisper
whisper audio.mp3 --model large-v3

Easiest setup, GPU recommended. CPU works but 5–20× slower.

Option 2: faster-whisper (recommended for speed)

pip install faster-whisper
# Python: load model + transcribe via API

4× faster than official Whisper, same accuracy. Uses CTranslate2.

Option 3: whisper.cpp (no GPU needed)

git clone https://github.com/ggerganov/whisper.cpp
# make + run

Runs fast on CPU, especially Apple Silicon. Best for local privacy-focused setups.

Don't want the hassle? Use NovaScribe.

Whisper Large-v3 accuracy with zero setup, from $2/mo. 100+ languages, SRT/VTT export, speaker diarization included.

Try NovaScribe Free

Related Guides

Frequently Asked Questions

What is Whisper's word error rate?

Whisper Large-v3 achieves ~2.7% WER on the LibriSpeech test-clean benchmark (clean audiobook audio) and 8–12% WER on real-world English audio (meetings, podcasts, calls). Accuracy drops further on noisy audio, strong accents, or languages other than English.

Is Whisper better than Google Speech-to-Text?

On English audio, Whisper Large-v3 and Google Chirp are roughly equal (both 8–11% WER on real-world audio). Whisper has broader language support (99+) and is free to self-host. Google has better custom vocabulary support and native streaming. For raw transcription accuracy alone, Whisper is competitive with the best commercial APIs.

Which Whisper model is most accurate?

Whisper Large-v3 (1.5B parameters) is the current most accurate, achieving 2.7% WER on LibriSpeech clean. Large-v2 is slightly less accurate (~3–5%). The Tiny, Base, Small, and Medium models trade accuracy for speed — Tiny achieves only 10–15% WER but runs 32× real-time on a GPU.

Is Whisper accurate for Spanish?

Yes. Spanish is a Tier 1 language for Whisper with 3–6% WER on clean audio — near-parity with English. French, Italian, Portuguese, German, and Dutch perform similarly. Lower-resource languages (Vietnamese, Thai, Welsh) have significantly higher WER.

Why is Whisper sometimes wrong?

Whisper accuracy degrades with: noisy audio (+5–15% WER), strong accents (+5–10%), phone audio vs studio (+5–10%), multiple overlapping speakers (+5–10%), technical/domain vocabulary (no custom vocab support), and long silences (Whisper occasionally hallucinates text during silence).

Can Whisper handle multiple speakers?

Whisper transcribes all speech but does not natively identify speakers (no diarization). For speaker labels, you need to combine Whisper with tools like pyannote-audio or use WhisperX, which adds forced alignment and diarization. Commercial tools built on Whisper (NovaScribe, TurboScribe) include diarization.

Is Whisper free to use commercially?

Yes. Whisper is released under the MIT license, which permits unrestricted commercial use. You can self-host, modify, and include it in products you sell. OpenAI also offers a paid Whisper API ($0.006/min) for those who don't want to self-host.

Does Whisper work offline?

Yes. Once the model is downloaded, Whisper runs entirely locally with no internet connection required. This makes it suitable for privacy-sensitive applications, offline environments, and air-gapped systems. Model sizes range from 39MB (Tiny) to 3GB (Large-v3).