Home/How Accurate Is Whisper?

Verified July 2026

How Accurate Is Whisper in 2026? WER Benchmarks & Sources

OpenAI Whisper Large-v3 (1.5B parameters, MIT license, September 2023): 2.7% Word Error Rate on LibriSpeech test-clean, 5.2% on test-other, 8–12% on real-world English audio. Trained on 680,000 hours of multilingual audio. Among the most accurate open-source ASR models in 2026; competitive with Deepgram Nova-3, AssemblyAI Universal-2, Google Chirp 2, and Speechmatics Ursa on the Hugging Face Open ASR Leaderboard.

To validate these paper claims independently, we ran our own July 2026 first-hand benchmark of 14 speech-to-text models across 904 audio files and 16 datasets. OpenAI's hosted Whisper-1 API averaged 11.9% aggregate WER — competitive with Deepgram Nova-3 English (12.3%) and AssemblyAI Universal-2 (11.9%) — but the newer GPT-4o Transcribe collapsed to 43.8% WER on long-form financial earnings calls (Earnings21) where Whisper-1 held at 9.7%. Whisper Large hosted via Deepgram delivered the strongest French Common Voice result at 6.2% WER, beating every proprietary API tested on that dataset.

WER (Word Error Rate) = (Substitutions + Deletions + Insertions) / total reference words. The NIST-standard ASR accuracy metric. Lower is better; 5% WER means 95% of words match a ground-truth transcript. All numbers below are sourced from the original Whisper paper (OpenAI 2022), the Hugging Face Open ASR Leaderboard, and independently published vendor benchmarks — links in the Methodology & Sources section.

By VexaScribe Editorial · Published April 15, 2026 · Verified July 15, 2026

See Accuracy by Audio Accuracy by Language

Whisper Accuracy in One Sentence

2.7%

WER on benchmark

LibriSpeech clean

8–12%

Real-world English

meetings, calls, podcasts

99+

Languages supported

accuracy varies by tier

MIT license

free to self-host

Whisper is OpenAI's open-source speech recognition model. It matches or beats most commercial APIs on English accuracy, powers many paid tools (VexaScribe, TurboScribe, Descript), and is free to run yourself. But "Whisper" is really a family of models — size, language, and audio condition all affect accuracy significantly.

Whisper Is Not One Model: Size Matters

Whisper comes in 7 sizes from Tiny (39M parameters) to Large-v3 (1.5B). Accuracy and speed trade off dramatically. Most commercial tools use Large-v2 or Large-v3; self-hosted setups often use Medium or Small for speed.

Model	Parameters	English WER (clean)	Speed	Use Case
Whisper Tiny	39M	~10–15%	32× real-time	Draft, constrained devices
Whisper Base	74M	~8–12%	16× real-time	Mobile apps
Whisper Small	244M	~6–9%	6× real-time	Balanced
Whisper Medium	769M	~4–6%	2× real-time	Quality focused
Whisper Large-v2	1.5B	~3–5%	1× real-time	Production (older)
Whisper Large-v3	1.5B	~2.7%	1× real-time	Production (current best)
Whisper Large-v3 Turbo	809M	~3–4%	8× real-time	Fast production

Real-time multipliers assume modern GPU (RTX 3090 or better). On CPU, all models run 5–20× slower. Large-v3 Turbo, released late 2024, is a distilled version of Large-v3 with most of the accuracy at 8× the speed.

Accuracy by Audio Condition

Same Whisper Large-v3 model, radically different results depending on audio conditions. Benchmark accuracy is not real-world accuracy.

Audio Condition	WER	Notes
LibriSpeech test-clean (audiobook)	2.7%	Benchmark baseline
LibriSpeech test-other (varied)	5.2%	More realistic
Clean studio speech, 1 speaker	3–5%	Podcasts, interviews
Conference call, 2 speakers	7–10%	Business meetings
Zoom/Teams call, 3 speakers	10–14%	Real-world meetings
Phone audio (8 kHz bandwidth)	12–18%	Telephony
Accented English (Indian, Scottish)	8–15%	Depending on accent strength
Noisy environment (cafe, street)	15–25%	Degrades significantly
Far-field mic (room audio)	18–28%	Lapel or laptop mic in large room

Key insight: Audio quality affects Whisper accuracy more than any other factor. Moving from a laptop mic to a $30 USB mic typically improves WER by 5–10%. See our verdict page for when this level of accuracy is sufficient for your use case.

Open ASR Leaderboard: 8-Benchmark Composite

The Hugging Face Open ASR Leaderboard is the standard reference for cross-model English ASR evaluation. It scores models on 8 datasets covering audiobook reading (LibriSpeech), conference talks (TED-LIUM), multi-speaker meetings (AMI), web speech (GigaSpeech), financial calls (Earnings-22), conversational phone audio (CallHome), and crowdsourced speech (Common Voice). Lower WER = better accuracy.

Benchmark	Domain	Whisper Large-v3	Whisper v3-turbo	Deepgram Nova-3	AssemblyAI Universal-2
LibriSpeech test-clean	Read English audiobook	2.7%	3.4%	2.6%	2.8%
LibriSpeech test-other	Read English, varied	5.2%	6.1%	5.1%	5.5%
TED-LIUM 3	Conference talks	4.0%	4.7%	3.6%	3.9%
AMI (meeting headset)	Multi-speaker meetings	15.9%	17.2%	13.4%	14.1%
GigaSpeech	Diverse web English	10.2%	11.4%	9.7%	9.8%
Earnings-22	Financial calls	12.3%	13.7%	10.2%	11.0%
CallHome	Conversational phone	26.4%	28.1%	21.8%	23.4%
CommonVoice 9 (English)	Crowdsourced diverse	8.8%	9.8%	8.4%	8.6%

Numbers compiled June 2026 from the Hugging Face Open ASR Leaderboard and from Deepgram Nova-3 documentation, AssemblyAI Universal-2 release notes, and the original Whisper paper (Radford et al., OpenAI 2022). Real-world performance varies; vendor-claimed numbers should be cross-checked against the public leaderboard.

Takeaway: Whisper Large-v3 trails Deepgram Nova-3 and AssemblyAI Universal-2 by 1–3 percentage points on most benchmarks, but is competitive on read English audio (LibriSpeech) and leads on most multilingual benchmarks (FLEURS — see the Accuracy by Language section). Combined with its MIT license and 99-language support, it remains the leading general-purpose open-source choice in 2026.

Novascribe's July 2026 Benchmark: How Whisper API Compares to GPT-4o

In July 2026 we ran 904 audio files across 16 standard benchmarks through 9 major speech-to-text models — including all three of OpenAI's current transcription APIs. The result upends the intuitive assumption that OpenAI's newest models replace Whisper-1 for every use case.

Answer capsule: In our July 2026 test of 904 audio files across 16 datasets, OpenAI Whisper-1 averaged 11.9% WER on English and 6.7% on multilingual audio, holding at 9.7% WER on financial earnings calls where GPT-4o Transcribe collapsed to 43.8%. Whisper-1 is the strongest OpenAI transcription model for real-world English audio despite being the oldest — GPT-4o's transcription models are optimized for short, clean speech.

English WER: Whisper-1 vs GPT-4o vs GPT-4o Mini

Dataset	Domain	Whisper-1	GPT-4o	GPT-4o Mini
LibriSpeech test-clean	Audiobook read speech	4.7%	3.1%	4.5%
AMI IHM	Multi-speaker meetings	26.4%	40.9%	41.8%
Earnings21	Financial earnings calls	9.7%	43.8%	44.2%
TED-LIUM 3	Long prepared speech	5.0%	27.1%	27.2%
GigaSpeech shard0	Mixed web audio	13.9%	20.6%	12.2%
CommonVoice 9 English	Crowdsourced diverse	8.8%	n/a	n/a

Novascribe internal benchmark, July 2026. Whisper-1 tested via OpenAI batch API; GPT-4o and GPT-4o Mini via gpt-4o-transcribe and gpt-4o-mini-transcribe. WER computed via jiwer with lowercased, punctuation-stripped normalization; 95% bootstrap CI computed on datasets with ≥2 samples.

The GPT-4o failure pattern. Both GPT-4o Transcribe and GPT-4o Mini are near-best on LibriSpeech (3.1% and 4.5%) and completely collapse on Earnings21 (43.8% and 44.2%) and TED-LIUM (27.1% and 27.2%). The same models that are near-perfect on 10-second read audiobook clips cannot reliably transcribe a 30-minute earnings call. If your users upload podcasts, earnings calls, lectures, or interviews longer than a few minutes, Whisper-1 outperforms GPT-4o by a wide margin at similar cost.

Multilingual WER: FLEURS-DE / FR / ES / IT / PT

Language (FLEURS)	Whisper-1	GPT-4o	GPT-4o Mini
German (FLEURS-DE)	3.9%	3.3%	4.7%
French (FLEURS-FR)	8.1%	5.9%	6.7%
Spanish (FLEURS-ES)	1.5%	1.3%	1.2%
Italian (FLEURS-IT)	3.7%	2.1%	2.5%
Portuguese (FLEURS-PT)	3.0%	2.5%	4.2%

Clean read multilingual audio (FLEURS 20-file test sets per language). GPT-4o and GPT-4o Mini win most languages on clean short clips; the failure pattern above applies only to long-form or noisy content. See per-language deep dives on our French, German, Spanish, Italian, and Portuguese pages.

Cross-model context: AssemblyAI Universal-3.5 led our benchmark on aggregate accuracy (7.0% average WER across 16 datasets), and Deepgram Nova-3 English tied Whisper-1 on English aggregate but underperformed on non-English audio. Full per-provider results: our AssemblyAI accuracy page and Deepgram accuracy page.

Accuracy by Language

Whisper's training data is ~65% English, with the remaining 35% split across 99+ languages. Accuracy correlates strongly with training data volume per language.

Language	Tier	WER	vs English
English	Tier 1	2.7–5%	Baseline
Spanish	Tier 1	3–6%	Near-parity
French	Tier 1	4–7%	Near-parity
German	Tier 1	4–8%	Slight drop
Italian	Tier 1	5–8%	Slight drop
Portuguese	Tier 1	5–8%	Slight drop
Dutch	Tier 1	5–9%	Tier 1 low end
Japanese	Tier 2	8–12% (CER)	Script complexity
Korean	Tier 2	8–12% (CER)	Script complexity
Russian	Tier 2	7–11%	Morphology complexity
Arabic	Tier 2	9–14%	Dialect challenge
Hindi	Tier 2	9–14%	Code-switching
Turkish	Tier 2	9–13%	Agglutination
Vietnamese	Tier 3	15–22%	Tonal + limited training
Thai	Tier 3	18–26%	Tonal + script
Low-resource (Welsh, etc.)	Tier 4	30%+	Limited training data

Tier 1: near-English parity. Tier 2: usable with editing. Tier 3: draft-quality. Tier 4: experimental. For language-specific tool comparisons, see our multilingual transcription comparison.

Whisper vs Commercial APIs

How Whisper compares to commercial-only APIs on real-world English audio. Whisper matches or beats most commercial APIs — the gap is narrow (~1–3% WER).

Engine	Type	English WER (real-world)	Price
Whisper Large-v3	Open source (MIT)	~8–12%	Free (self-hosted)
Whisper Large-v3-turbo	Open source (MIT)	~9–13%	Free (self-hosted)
Deepgram Nova-3	Commercial API	~7–10%	$0.0043/min
AssemblyAI Universal-2	Commercial API	~7–10%	$0.006/min
Speechmatics Ursa	Commercial API	~7–10%	$0.025/min
Google Chirp 2 / USM-2	Commercial API	~8–11%	$0.016/min
Azure Speech	Commercial API	~9–12%	$1/hr ($0.017/min)
AWS Transcribe	Commercial API	~9–13%	$0.024/min
Rev AI (Reverb)	Commercial API	~10–14%	$0.02/min API ($0.25/min consumer)

Why Whisper wins on value: It matches commercial APIs on raw accuracy but is free to self-host. Commercial APIs' main advantages are custom vocabulary boosting, built-in speaker diarization, and real-time streaming — not raw transcription accuracy. For accuracy alone, Whisper is competitive with the best. Note: AssemblyAI shipped Universal-3 Pro (February 2026), a promptable model that measured 2.3% WER on Artificial Analysis's AgentTalk benchmark. In our own July 2026 benchmark, AssemblyAI Universal-3.5 averaged 7.0% WER across 16 datasets — the best of 9 models tested. See the Novascribe 2026 Benchmark section above for cross-model results, plus full vendor-claims-vs-independent-data breakdowns on our Deepgram accuracy and AssemblyAI accuracy pages.

What Whisper Can't Do

Honest limitations — what you'll hit when deploying Whisper in production.

⚠No custom vocabulary boosting

Major weakness vs Deepgram and Google. Whisper will mis-transcribe proper nouns, jargon, and technical terms consistently.

⚠Speaker diarization not built-in

Transcription only. Requires separate tools (pyannote, WhisperX) for speaker labels.

⚠Real-time streaming not native

Designed for batch transcription. Streaming requires workarounds with chunking — quality drops on boundaries.

⚠Poor on music + speech mixed audio

Hallucinates lyrics when music overlays speech. Mute music tracks before transcribing.

⚠Hallucinates on silence

Invents text during long pauses — a known issue in Large-v3. Use VAD preprocessing to skip silent sections.

⚠Repeated tokens on loops

Can get stuck repeating the same phrase on certain audio patterns. Less frequent in v3 than v2.

⚠Language detection errors

Misidentifies similar languages — Ukrainian as Russian, Catalan as Spanish. Specify language explicitly for reliability.

⚠2GB file size recommended limit

Very long files (>2 hours) should be chunked for stable processing.

Tools That Use Whisper

Many commercial transcription tools use Whisper under the hood — they're essentially Whisper plus a user interface, file management, and features like diarization or SRT export.

VexaScribe

Whisper Large-v3, $2–$20/mo, 100+ languages, SRT/VTT/TXT/DOCX export, speaker diarization.

TurboScribe

Whisper Large-v3, $10/mo unlimited, batch processing up to 50 files.

Descript

Whisper-based engine in a full video/podcast editor. $12–$24/mo depending on tier.

Fireflies.ai

Mix of Whisper + custom models for meeting transcription with CRM integration.

whisper.cpp (open source)

C++ port by Georgi Gerganov. Runs on CPU efficiently, Apple Silicon optimized.

faster-whisper (open source)

CTranslate2 reimplementation. 4× faster than original Whisper at same accuracy.

WhisperX (open source)

Whisper + forced alignment + diarization. Best free option with speaker labels.

Replicate / HuggingFace APIs

Pay-per-use Whisper APIs for developers who don't want to self-host.

How to Run Whisper Yourself

Whisper is MIT-licensed and free to run locally. Technical setup takes 15–60 minutes depending on your familiarity with Python.

Option 1: Official OpenAI Whisper (Python)

pip install openai-whisper
whisper audio.mp3 --model large-v3

Easiest setup, GPU recommended. CPU works but 5–20× slower.

Option 2: faster-whisper (recommended for speed)

pip install faster-whisper
# Python: load model + transcribe via API

4× faster than official Whisper, same accuracy. Uses CTranslate2.

Option 3: whisper.cpp (no GPU needed)

git clone https://github.com/ggerganov/whisper.cpp
# make + run

Runs fast on CPU, especially Apple Silicon. Best for local privacy-focused setups.

Don't want the hassle? Use VexaScribe.

Whisper Large-v3 accuracy with zero setup, from $2/mo. 100+ languages, SRT/VTT export, speaker diarization included.

Try VexaScribe Free

Related Guides

How Accurate Is Deepgram?

Nova-3's 5.26% WER claim vs 7–10% independent measurements — and why speed is its real win.

How Accurate Is AssemblyAI?

Universal-3.5 Pro vs Universal-2 — Novascribe July 2026 benchmark shows U-3.5 Pro at 7.0% aggregate WER (#1 among promptable AI-transcription APIs).

How Accurate Is Speechmatics?

Melia-1 leads aggregate WER at 6.4% across all 14 models in our July 2026 benchmark — best of all models tested.

Translate Spanish Audio to English

Language-pair tool built on Whisper's translation mode — Spanish audio → English text or SRT subtitles.

Transcript Generator (Hub)

Which transcript tool to use per input type — YouTube, audio, video, Zoom, Teams, and more.

What Is a Transcript?

Definition, types (audio/video, academic, legal), format examples, and how transcripts are made in 2026.

What Is Speaker Diarization?

Authority hub on the 'who spoke when' problem — pyannote 3.1 DER benchmarks, WhisperX pipeline, open-source vs commercial APIs.

Is AI Transcription Accurate Enough?

Is Whisper accuracy sufficient for your use case? Our verdict.

Whisper Transcription

How VexaScribe uses Whisper Large-v3.

Whisper Real-Time Transcription

Streaming Whisper transcription implementations.

Whisper Speaker Diarization

Whisper doesn't label speakers — working code for WhisperX, whisper-diarization, and OpenAI's new gpt-4o-transcribe-diarize.

Sermon Transcription

Vertical guide for churches: how Whisper-based services handle weekly sermons, multilingual ministry, and theological vocabulary realistically.

WAV to Text

Why WAV file size trips up most free online converters, and how Whisper handles WAV directly (no MP3 conversion needed).

M4A to Text

iPhone Voice Memo to text — exact iOS export workflow plus M4A vs MP3 reality for transcription accuracy.

Most Accurate Transcription Software

WER benchmarks across 10 tools — see where Whisper-based tools rank.

Multilingual Transcription Software

Best tools for non-English audio using Whisper and others.

AI vs Human Transcription

When Whisper's accuracy matches human and when it doesn't.

Best Transcription APIs for Developers

Deepgram, AssemblyAI, Whisper API, Speechmatics — benchmarked for latency, accuracy, and pricing.

Methodology & Sources

What WER actually measures

Word Error Rate is the NIST-standard ASR accuracy metric. Formula:

WER = (Substitutions + Deletions + Insertions) / Words in reference transcript

A WER of 5% means 95 of 100 reference words appear correctly in the hypothesis transcript. For Chinese, Japanese, and Korean, Character Error Rate (CER) is used instead because word boundaries are ambiguous in those languages.

Where the benchmark numbers come from

Original Whisper paper: Robust Speech Recognition via Large-Scale Weak Supervision (Radford et al., OpenAI 2022). Covers Whisper architecture, 680,000-hour training corpus, and zero-shot WER on 14 datasets.
Hugging Face Open ASR Leaderboard: huggingface.co/spaces/hf-audio/open_asr_leaderboard. Live leaderboard with WER across 8 English ASR benchmarks. Continuously updated as new models release.
Whisper GitHub repository: github.com/openai/whisper — official source, model checkpoints, evaluation scripts.
FLEURS multilingual benchmark: Google FLEURS (Few-shot Learning Evaluation of Universal Representations of Speech) — 102 languages, source for multilingual WER numbers in this page.
LibriSpeech dataset: openslr.org/12 — the standard read-English audiobook benchmark since 2015.
Common Voice: commonvoice.mozilla.org — Mozilla's crowd-sourced multilingual speech dataset, used for diverse-speaker evaluation.

Real-world WER estimates

The "real-world" WER ranges (8–12% English, etc.) reflect VexaScribe's internal aggregation across customer audio samples spanning the categories listed in the "Accuracy by Audio Condition" table. We do not publish raw test files due to customer confidentiality. For independently reproducible numbers, refer to the public benchmarks above. Real-world performance on your specific audio will vary based on microphone quality, background noise, accent, and domain vocabulary.

Novascribe July 2026 Benchmark methodology

Test date: July 2026. 904 audio files across 16 standard benchmarks: LibriSpeech test-clean, AMI IHM, VoxConverse, Earnings21, TED-LIUM 3, GigaSpeech shard0, FLEURS (DE / FR / ES / IT / PT), CommonVoice 9 (DE / FR / ES / IT / PT), MLS-PT, plus 18 files of real Vexascribe production audio. 9 models tested through official APIs with identical inputs: AssemblyAI Universal-2 and Universal-3.5, Deepgram Nova-2, Nova-3 English, Nova-3 Multilingual and hosted Whisper Large, OpenAI Whisper-1, GPT-4o Transcribe, and GPT-4o Mini Transcribe. WER computed via jiwer with lowercase, punctuation-stripped normalization — the standard academic method. 95% bootstrap confidence intervals computed on datasets with ≥2 samples. No cherry-picking: all datasets included regardless of result; failures counted as errors.

Dataset licenses: LibriSpeech (CC BY 4.0), AMI (CC BY 4.0), VoxConverse (CC BY 4.0), Earnings21 (CC BY 4.0), TED-LIUM 3 (CC BY-NC-ND 3.0 — no transcripts reproduced on this page), GigaSpeech (Apache 2.0), FLEURS (CC BY 4.0), CommonVoice (CC0), MLS (CC BY 4.0). Vexascribe-prod files are real user uploads used only for latency measurement (no WER computed where no ground-truth transcript exists).

Model versions: AssemblyAI Universal-2 and Universal-3.5, Deepgram Nova-2, Nova-3 English, Nova-3 Multilingual and hosted Whisper Large, OpenAI Whisper-1, gpt-4o-transcribe, and gpt-4o-mini-transcribe — all as of July 2026. Providers update models frequently; results reflect performance at time of test. Novascribe runs Whisper Large-v3 locally in production; the Whisper-1 numbers above are the OpenAI batch API version, not the self-hosted model in Novascribe's stack.

Verification and update window

Originally published April 15, 2026. Verified and refreshed June 6, 2026. Competitor model versions, pricing, and benchmark numbers cross-checked against vendor documentation and the Open ASR Leaderboard on the verification date. Model versions specifically tracked: Whisper Large-v3 (Sept 2023), Whisper Large-v3-turbo (Oct 2024), Deepgram Nova-3 (February 2025), AssemblyAI Universal-2 (October 2024), Universal-3 Pro (February 2026, deprecated July 2026), Universal-3.5 Pro (July 2026 — current flagship, the model tested in our benchmark), Speechmatics Ursa (2025), Google Chirp 2 / USM-2 (2024–2025). Refreshed with Novascribe July 2026 Benchmark data: July 15, 2026.

Frequently Asked Questions

What is Whisper's word error rate (WER)?

Whisper Large-v3 achieves 2.7% WER on the LibriSpeech test-clean benchmark (clean read English audiobook audio) and 5.2% WER on the LibriSpeech test-other benchmark (varied audio conditions). On real-world English audio — meetings, podcasts, phone calls, interviews — WER ranges from 8% to 12%. WER is calculated as (Substitutions + Deletions + Insertions) / total reference words, the standard ASR accuracy metric defined by NIST. Lower is better.

Is Whisper better than Google Speech-to-Text or Deepgram in 2026?

On English audio, Whisper Large-v3 (~8-12% real-world WER) is roughly tied with Google Chirp 2 / USM-2 (~8-11%) and Deepgram Nova-3 (~7-10%), and slightly behind AssemblyAI Universal-2 (~7-10%) and Speechmatics Ursa (~7-10%). All five sit within 2-3 percentage points on the Hugging Face Open ASR Leaderboard composite score. Whisper's advantages: 99+ languages out of the box, MIT license (free to self-host), large open ecosystem. Commercial API advantages: custom vocabulary boosting, native streaming, built-in speaker diarization. For raw transcription accuracy alone on English, Whisper is competitive with the best.

Which Whisper model is most accurate?

Whisper Large-v3 (1.5B parameters) is the current most accurate, achieving 2.7% WER on LibriSpeech clean. Large-v2 is slightly less accurate (~3–5%). The Tiny, Base, Small, and Medium models trade accuracy for speed — Tiny achieves only 10–15% WER but runs 32× real-time on a GPU.

Is Whisper accurate for Spanish?

Yes. Spanish is a Tier 1 language for Whisper with 3–6% WER on clean audio — near-parity with English. French, Italian, Portuguese, German, and Dutch perform similarly. Lower-resource languages (Vietnamese, Thai, Welsh) have significantly higher WER.

Why is Whisper sometimes wrong?

Whisper accuracy degrades with: noisy audio (+5–15% WER), strong accents (+5–10%), phone audio vs studio (+5–10%), multiple overlapping speakers (+5–10%), technical/domain vocabulary (no custom vocab support), and long silences (Whisper occasionally hallucinates text during silence).

Can Whisper handle multiple speakers?

Whisper transcribes all speech but does not natively identify speakers (no diarization). For speaker labels, you need to combine Whisper with tools like pyannote-audio or use WhisperX, which adds forced alignment and diarization. Commercial tools built on Whisper (VexaScribe, TurboScribe) include diarization.

Is Whisper free to use commercially?

Yes. Whisper is released under the MIT license, which permits unrestricted commercial use. You can self-host, modify, and include it in products you sell. OpenAI also offers a paid Whisper API ($0.006/min) for those who don't want to self-host.

Does Whisper work offline?

Yes. Once the model is downloaded, Whisper runs entirely locally with no internet connection required. This makes it suitable for privacy-sensitive applications, offline environments, and air-gapped systems. Model sizes range from 39MB (Tiny) to 3GB (Large-v3).

How is Whisper's accuracy measured? What is the WER formula?

Word Error Rate (WER) is the standard ASR accuracy metric defined by NIST: WER = (Substitutions + Deletions + Insertions) / Number of words in reference. Substitutions are wrong words, Deletions are missed words, Insertions are extra words. WER of 5% means 95% of words are correct relative to a ground-truth transcript. For Chinese, Japanese, and Korean, Character Error Rate (CER) is used instead because word boundaries are ambiguous. Whisper's 2.7% WER on LibriSpeech is measured by OpenAI in the original Whisper paper (cdn.openai.com/papers/whisper.pdf), and reproduced on the Hugging Face Open ASR Leaderboard.

What benchmarks should I trust for Whisper accuracy?

The most reliable Whisper benchmarks come from: (1) The Hugging Face Open ASR Leaderboard (huggingface.co/spaces/hf-audio/open_asr_leaderboard) — composite score across 8 datasets including LibriSpeech, TED-LIUM, AMI, GigaSpeech, Earnings-22, CallHome, CommonVoice, and SPGISpeech. (2) The original Whisper paper (Radford et al., OpenAI 2022) — covers LibriSpeech, FLEURS, and zero-shot evaluation on 14 datasets. (3) Papers With Code WER leaderboards (paperswithcode.com/task/speech-recognition). Vendor blog claims should be cross-checked against these independent sources.

Whisper Large-v3 vs Large-v3-turbo: which should I use?

Large-v3 (1.5B parameters, ~2.7% WER on LibriSpeech clean) gives maximum accuracy at 1× real-time on a modern GPU. Large-v3-turbo (809M parameters, released October 2024) is a distilled version that runs at 8× real-time with only a 0.3-0.7 percentage-point WER increase — roughly 3-4% WER on LibriSpeech clean. Use Large-v3 when accuracy is paramount and processing time is tolerable. Use Large-v3-turbo for production workloads where latency or throughput matters. Both are MIT-licensed and free to self-host.

Does Whisper hallucinate? What are its known failure modes?

Yes — three documented failure modes. (1) Hallucination on silence: during long pauses or silent segments, Whisper sometimes invents plausible-sounding text. Mitigation: use Voice Activity Detection (VAD) preprocessing to skip silence. (2) Repeated tokens: Whisper occasionally gets stuck repeating the same phrase on certain audio patterns. Less frequent in v3 than v2; faster-whisper's `repetition_penalty` parameter helps. (3) Language detection errors: Whisper can confuse linguistically similar languages — Ukrainian misclassified as Russian, Catalan as Spanish, Welsh as English. Mitigation: specify the language explicitly via the `language` parameter rather than relying on auto-detection. These are open known issues tracked in the openai/whisper GitHub repository.

Is GPT-4o Transcribe more accurate than Whisper?

Depends entirely on audio length and cleanliness. In Novascribe's July 2026 benchmark of 904 audio files, GPT-4o Transcribe beat Whisper-1 on LibriSpeech clean read speech (3.1% vs 4.7% WER) but collapsed on longer, harder audio: Earnings21 financial calls 43.8% (Whisper-1: 9.7%), TED-LIUM long prepared speech 27.1% (Whisper-1: 5.0%), AMI meetings 40.9% (Whisper-1: 26.4%). GPT-4o Mini follows the same pattern at half the price. For voice memos and short clean clips, GPT-4o models are more accurate and cheaper. For anything longer than a few minutes — podcasts, meetings, earnings calls, lectures, interviews — Whisper-1 outperforms GPT-4o by 4-5×.

Which OpenAI transcription model should I use for long recordings?

Whisper-1, decisively. Novascribe's July 2026 benchmark measured Whisper-1 at 9.7% WER on the Earnings21 financial-call benchmark while GPT-4o Transcribe scored 43.8% and GPT-4o Mini scored 44.2% on the same audio. The pattern repeats on TED-LIUM (Whisper-1: 5.0% vs GPT-4o: 27.1%) and AMI meetings (Whisper-1: 26.4% vs GPT-4o: 40.9%). Whisper-1 is OpenAI's oldest transcription model but was trained on 680,000 hours of long-form real-world audio, while the GPT-4o transcription models appear optimized for short conversational turns. If your audio is longer than ~2 minutes, choose Whisper-1 or a competitor like AssemblyAI Universal-3.5 or Deepgram Nova-3, not the GPT-4o transcription line.

För Whisper på svenska, se — KB-Whisper: Kungliga bibliotekets svenska modell