How Accurate Is Deepgram? Nova-3 Benchmarks, Independently Checked
Deepgram claims a 5.26% median word error rate for Nova-3 (February 2025) on batch audio — but independent benchmarks, including Artificial Analysis's WER Index, measure Nova-3 at roughly 7–10% on real-world recordings. That gap doesn't make Deepgram inaccurate: it is the fastest major STT API in independent testing and beats Whisper Large-v3 on streaming latency. It does mean vendor numbers and your audio are different things. Here's the evidence, scenario by scenario.
WER (Word Error Rate) = (Substitutions + Deletions + Insertions) / total reference words — the NIST-standard ASR accuracy metric. Lower is better. Every number on this page is labeled as either a vendor claim or an independent measurement, with links in the Methodology & Sources section.
By VexaScribe Editorial · Published July 5, 2026 · Verified
Deepgram Accuracy in One Sentence
Deepgram Nova-3 is a top-tier commercial speech-to-text engine whose defining independent result is speed, not peak accuracy. Peer-reviewed testing (arXiv 2408.16287) found Whisper and AssemblyAI slightly more accurate on raw WER — but Deepgram the most efficient engine once processing speed is factored in. If your use case is voice agents, live captions, or millions of minutes of telephony, that tradeoff usually lands in Deepgram's favor. If it's squeezing out the last percentage point of accuracy on recorded audio, it usually doesn't.
Vendor Claims vs Independent Measurements
Nearly every page ranking for "Deepgram accuracy" is written by Deepgram. That doesn't make the numbers false — it makes them unverified. Here is each headline claim next to what independent sources actually measure.
| Metric | Deepgram's claim | Independent data | Context |
|---|---|---|---|
| Batch WER (English) | 5.26% median | ~7–10% real-world | Vendor median across vendor-selected domains; independent indexes measure diverse real-world audio |
| Streaming WER (English) | 6.84% | Varies by audio; not directly indexed | Down from Nova-2's 8.4% — the biggest single-generation streaming improvement Deepgram has shipped |
| “47.4% better than next-best” | Batch, vs 10% competitor WER | Not reproduced | Relative-improvement framing depends entirely on which competitor model and dataset were chosen |
| Speed (batch + streaming) | Fastest major STT API | Confirmed fastest | The one claim independent testing agrees with unambiguously (arXiv 2408.16287) |
| Nova-3 Medical (“63.7% better”) | vs leading alternatives | No independent replication | Vendor-run benchmark; treat as directional until third-party medical WER data exists |
Deepgram's Nova Lineage: Which Model Are You Getting?
"Deepgram accuracy" depends on which Nova generation the integration actually calls. Many products built on Deepgram before 2025 still run Nova-2 — roughly 8.4% median WER by Deepgram's own measurement, a full generation behind Nova-3.
| Model | Released | Headline accuracy claim | Status |
|---|---|---|---|
| Nova (Nova-1) | April 2023 | “22% better than next-best” at launch | Legacy |
| Nova-2 | November 2023 | ~8.4% median real-world WER (vendor) | Still widely deployed |
| Nova-3 | February 2025 | 5.26% median batch WER, 6.84% streaming | Current flagship |
| Nova-3 Medical | March 2025 | “63.7% better” on medical terms (vendor) | Domain variant |
| Nova-3 Multilingual expansions | 2025 – March 2026 | 20+ added languages, multilingual keyterm prompting | Rolling updates |
Sources: Deepgram's Nova-3 launch post, Nova-2 vs Nova-3 developer comparison, and Nova-3 Medical announcement. Verified July 5, 2026.
Nova-3 also introduced keyterm prompting — you can pass up to 100 domain terms (product names, drug names, jargon) per request, and the model biases toward them. This is Deepgram's answer to the custom-vocabulary problem that open-source Whisper simply doesn't solve, and in jargon-heavy audio it matters more than a point of headline WER.
Where Nova-3 Lands on Standard Benchmarks
Cross-model WER on the eight standard English ASR test sets, compiled from the Hugging Face Open ASR Leaderboard and vendor documentation. Lower is better. These are the same numbers we publish on our Whisper accuracy page — one consistent dataset across our accuracy guides.
| Benchmark | Domain | Deepgram Nova-3 | Whisper Large-v3 | AssemblyAI Universal-2 |
|---|---|---|---|---|
| LibriSpeech test-clean | Read English audiobook | 2.6% | 2.7% | 2.8% |
| LibriSpeech test-other | Read English, varied | 5.1% | 5.2% | 5.5% |
| TED-LIUM 3 | Conference talks | 3.6% | 4.0% | 3.9% |
| AMI (meeting headset) | Multi-speaker meetings | 13.4% | 15.9% | 14.1% |
| GigaSpeech | Diverse web English | 9.7% | 10.2% | 9.8% |
| Earnings-22 | Financial calls | 10.2% | 12.3% | 11.0% |
| CallHome | Conversational phone | 21.8% | 26.4% | 23.4% |
| CommonVoice 9 (English) | Crowdsourced diverse | 8.4% | 8.8% | 8.6% |
Accuracy by Audio Condition
What Nova-3's benchmark results translate to per audio scenario. Ranges combine leaderboard data with the independent real-world spread (~7–10% typical).
| Audio Condition | Expected WER | Notes |
|---|---|---|
| Clean studio speech, 1 speaker | 3–5% | Podcasts, dictation, voiceover |
| Conference talks, prepared speech | 3–4% | TED-LIUM-like audio |
| Conference call, 2 speakers | 7–10% | Business calls, good mics |
| Multi-speaker meetings (headset) | 12–15% | AMI benchmark: 13.4% |
| Financial/jargon-heavy calls | 9–12% | Earnings-22: 10.2%; keyterm prompting reduces jargon misses |
| Conversational phone (8 kHz) | 18–24% | CallHome: 21.8% — hardest common scenario |
| Accented English | 9–15% | Non-native speech degrades all engines (arXiv 2503.06924) |
| Noisy / far-field audio | 15–25%+ | Degrades sharply; mic quality dominates |
The Speed–Accuracy Tradeoff: Deepgram's Actual Win
The most rigorous independent evaluation of commercial STT engines to date — "Measuring the Accuracy of Automatic Speech Recognition Solutions" (arXiv 2408.16287) — reached a two-part conclusion that Deepgram's marketing understandably doesn't quote in full:
⚠ On raw accuracy
Whisper and AssemblyAI achieved the highest transcription accuracy in the study. Deepgram trailed the leaders on WER across the tested audio.
✓ On efficiency
Deepgram's processing speed made it the most efficient system once speed and accuracy were considered together — the best transcription-per-second of any engine tested.
This is why "how accurate is Deepgram" has a two-part answer. For a voice agent that must respond inside a ~500 ms latency budget, an engine that is 1 WER point worse but returns results in a fraction of the time is the more accurate choice in practice — the alternative engines can't operate in that window at all. For transcribing recorded interviews where nobody is waiting, speed is irrelevant and the raw-WER leaders win. Deepgram prices this positioning aggressively too: at $0.0043/min for Nova-3 pay-as-you-go, it undercuts AssemblyAI (~$0.006/min) and every major cloud vendor (Google ~$0.016/min, AWS ~$0.024/min).
Deepgram vs Whisper vs AssemblyAI
The three engines most evaluations shortlist, on the axes that actually differ. Real-world WER from independent indexes; prices from vendor pricing pages, verified July 5, 2026.
| Engine | English WER (real-world) | Speed | Price | Best for |
|---|---|---|---|---|
| Deepgram Nova-3 | ~7–10% | Fastest batch + sub-300 ms streaming | $0.0043/min | Voice agents, high-volume, telephony |
| Deepgram Nova-2 | ~8–12% | Same infrastructure | $0.0036/min | Cost-sensitive existing integrations |
| AssemblyAI Universal-3 Pro | 2.3% (AgentTalk, AA-WER v2.0) | Streaming variant available | See vendor pricing | Max accuracy, entity-heavy audio |
| AssemblyAI Universal-2 | ~7–10% | Slower batch than Deepgram | $0.006/min | 99-language commercial API |
| Whisper Large-v3 | ~8–12% | 1× real-time self-hosted (GPU) | Free (MIT, self-hosted) | Self-hosting, multilingual, budget |
| Whisper Large-v3-turbo | ~9–13% | 8× real-time self-hosted | Free (MIT, self-hosted) | Fast self-hosted pipelines |
AssemblyAI's Universal-3 Pro (February 2026) measured 2.3% WER on the AgentTalk subset of Artificial Analysis's AA-WER v2.0 index — a newer benchmark not directly comparable to the real-world ranges in this column. Full treatment on our AssemblyAI accuracy page.
When Deepgram Is the Right Choice — and When It Isn't
Choose Deepgram when:
- You're building voice agents or live captions — sub-300 ms streaming latency is the category benchmark
- You process high volumes on a budget — $0.0043/min is the lowest major-API rate
- Your audio is jargon-heavy — keyterm prompting (up to 100 terms) fixes what generic models miss
- You transcribe telephony at scale — Nova-3's biggest benchmark margins are on phone-quality audio
Look elsewhere when:
- You need maximum raw accuracy on recordings — independent testing puts Whisper and AssemblyAI ahead on WER
- You want free self-hosting — Whisper Large-v3 is MIT-licensed and competitive within 1–3 points
- You need broad multilingual coverage — Whisper covers 99+ languages out of the box; Nova-3's list is growing but shorter
- You don't write code — Deepgram is an API. There is no upload-a-file consumer product
Want the accuracy without the API integration?
VexaScribe gives you Whisper Large-v3 accuracy through a simple upload interface — no code, from $2/mo. 100+ languages, speaker diarization, SRT/VTT/DOCX export.
Try VexaScribe FreeRelated Guides
Methodology & Sources
What WER actually measures
WER = (Substitutions + Deletions + Insertions) / Words in reference transcriptA WER of 5% means 95 of 100 reference words appear correctly. WER comparisons are only valid when the same audio and the same text normalization are used for every engine — which is why this page separates vendor-published numbers from cross-engine index numbers throughout.
Sources
- Deepgram Nova-3 launch: Introducing Nova-3 (February 2025) — source of the 5.26% batch / 6.84% streaming claims and the 47.4% / 54.3% relative-improvement framing.
- Nova-2 vs Nova-3 developer comparison: deepgram.com/learn — Nova-2's ~8.4% median real-world WER.
- Artificial Analysis WER Index: artificialanalysis.ai/speech-to-text — independent cross-engine WER, speed, and price measurement. Its AA-WER v2 index weights 50% AA-AgentTalk (conversational), 25% VoxPopuli (accented speech), 25% Earnings-22 (financial calls) — deliberately harder audio than vendor demo sets.
- Peer-reviewed evaluation: Measuring the Accuracy of Automatic Speech Recognition Solutions (arXiv 2408.16287) — the accuracy-vs-efficiency finding cited throughout this page.
- Non-native English study: ASR for Non-Native English: Accuracy and Disfluency Handling (arXiv 2503.06924) — accent degradation data.
- Hugging Face Open ASR Leaderboard: huggingface.co/spaces/hf-audio/open_asr_leaderboard — benchmark composite reference.
- Deepgram pricing: deepgram.com/pricing — $0.0043/min Nova-3 pay-as-you-go rate, checked on the verification date.
- Nova-3 Medical: announcement post (March 2025) — vendor-run medical benchmark.
Verification and update window
Published and verified July 5, 2026. Model versions tracked: Deepgram Nova-3 (February 2025), Nova-3 Medical (March 2025), Nova-2 (November 2023), Whisper Large-v3 (September 2023), AssemblyAI Universal-2 (October 2024) and Universal-3 Pro (February 2026). Vendor claims, pricing, and benchmark numbers were cross-checked against the linked sources on the verification date. Where a claim has no independent replication, the page says so explicitly.
Frequently Asked Questions
What word error rate (WER) does Deepgram Nova-3 actually achieve?
Deepgram claims a 5.26% median WER for Nova-3 batch transcription and 6.84% for streaming (launch post, February 2025). Independent measurements, including Artificial Analysis's WER Index, put Nova-3 at roughly 7–10% on diverse real-world audio. Both numbers are real: the vendor figure is a median across vendor-selected test domains, while independent indexes apply one text normalization to every engine across harder, more varied audio. On specific benchmarks Nova-3 scores 2.6% on LibriSpeech test-clean, 13.4% on AMI meetings, and 21.8% on CallHome conversational phone audio.
Is Deepgram more accurate than Whisper?
On standard English benchmarks, Deepgram Nova-3 leads or ties Whisper Large-v3 on all eight Open ASR Leaderboard test sets — with its biggest margins on hard audio: AMI meetings (13.4% vs 15.9%), Earnings-22 financial calls (10.2% vs 12.3%), and CallHome phone audio (21.8% vs 26.4%). However, peer-reviewed testing on other real-world audio (arXiv 2408.16287) found Whisper and AssemblyAI slightly ahead on raw accuracy. The honest summary: they are within 1–3 percentage points of each other; Deepgram is decisively faster, Whisper is free to self-host and covers 99+ languages.
Is Deepgram more accurate than AssemblyAI?
They trade places depending on the test. On the Open ASR Leaderboard composite, Nova-3 edges Universal-2 on most datasets by 0.2–1.6 percentage points. In the arXiv 2408.16287 evaluation, AssemblyAI ranked among the most accurate engines while Deepgram ranked fastest. AssemblyAI's newer Universal-3 Pro (February 2026) measured 2.3% WER on Artificial Analysis's AgentTalk benchmark. Practical rule: for maximum accuracy on recorded audio, AssemblyAI's newest model has the edge; for streaming latency and cost per minute, Deepgram wins.
How much more accurate is Nova-3 than Nova-2?
By Deepgram's own published numbers: streaming WER dropped from 8.4% (Nova-2) to 6.84% (Nova-3) — an 18.6% relative improvement — and batch median WER reached 5.26%. Nova-3 also added keyterm prompting (up to 100 custom terms per request) and expanded multilingual support through 2025–2026. Many products integrated before 2025 still call Nova-2, so if a tool 'powered by Deepgram' seems less accurate than these numbers, check which model generation it actually uses.
How accurate is Deepgram on phone calls?
Phone audio is the hardest common scenario for every STT engine. Nova-3 scores 21.8% WER on the CallHome conversational telephone benchmark — the best result among major engines (Whisper Large-v3: 26.4%, AssemblyAI Universal-2: 23.4%), but still roughly 1 word in 5 wrong. The 8 kHz bandwidth, crosstalk, and casual speech of phone audio triple typical WER. Deepgram's strongest relative margins are on telephony, which is why call-center and voice-agent platforms disproportionately build on it.
Does Deepgram's keyterm prompting actually improve accuracy?
Yes, for the vocabulary it targets. Keyterm prompting (introduced with Nova-3) lets you pass up to 100 domain-specific terms — product names, people, jargon — per request, and the model biases recognition toward them. It doesn't change headline WER on generic benchmarks, but on jargon-dense audio it prevents exactly the errors that matter most in practice: misrendered names, drug names, and technical terms. Whisper has no equivalent feature; this is one of the strongest practical reasons to choose a commercial API over self-hosted open source.
Why do Deepgram's published numbers differ from independent benchmarks?
Three mechanical reasons: (1) dataset selection — vendors benchmark on domains their model was tuned for; (2) median vs pooled reporting — a median across test sets hides the worst domains, and phone audio can run 3–4× the median; (3) text normalization — how punctuation, casing, and numerals are handled before scoring can swing WER by 1–3 points. Independent indexes like Artificial Analysis and the Hugging Face Open ASR Leaderboard apply identical normalization to every engine, which makes their absolute numbers higher but their comparisons fairer.
How fast is Deepgram compared to other transcription APIs?
Fastest among major engines — this is Deepgram's one claim that independent testing confirms without qualification. The arXiv 2408.16287 evaluation found Deepgram the most efficient system when speed and accuracy are considered together, and its streaming mode targets sub-300 ms latency, which is why it dominates voice-agent infrastructure. At $0.0043/min for Nova-3 pay-as-you-go it is also the cheapest major commercial API per minute (AssemblyAI ~$0.006/min, Google ~$0.016/min, AWS ~$0.024/min).