Home/How Accurate Is Deepgram?

Verified July 2026

How Accurate Is Deepgram? Nova-3 Benchmarks, Independently Checked

Deepgram claims a 5.26% median word error rate for Nova-3 (February 2025) on batch audio — but independent benchmarks, including Artificial Analysis's WER Index, measure Nova-3 at roughly 7–10% on real-world recordings. That gap doesn't make Deepgram inaccurate: it is the fastest major STT API in independent testing and beats Whisper Large-v3 on streaming latency. It does mean vendor numbers and your audio are different things. Here's the evidence, scenario by scenario.

WER (Word Error Rate) = (Substitutions + Deletions + Insertions) / total reference words — the NIST-standard ASR accuracy metric. Lower is better. Every number on this page is labeled as either a vendor claim or an independent measurement, with links in the Methodology & Sources section.

By VexaScribe Editorial · Published July 5, 2026 · Verified July 5, 2026

Claims vs Independent Data Accuracy by Audio Condition

Deepgram Accuracy in One Sentence

5.26%

Vendor-claimed WER

Nova-3 batch, median

7–10%

Independent real-world

meetings, calls, mixed audio

6.84%

Streaming WER claim

vs 8.4% on Nova-2

Speed, independently

fastest major STT API

Deepgram Nova-3 is a top-tier commercial speech-to-text engine whose defining independent result is speed, not peak accuracy. Peer-reviewed testing (arXiv 2408.16287) found Whisper and AssemblyAI slightly more accurate on raw WER — but Deepgram the most efficient engine once processing speed is factored in. If your use case is voice agents, live captions, or millions of minutes of telephony, that tradeoff usually lands in Deepgram's favor. If it's squeezing out the last percentage point of accuracy on recorded audio, it usually doesn't.

Vendor Claims vs Independent Measurements

Nearly every page ranking for "Deepgram accuracy" is written by Deepgram. That doesn't make the numbers false — it makes them unverified. Here is each headline claim next to what independent sources actually measure.

Metric	Deepgram's claim	Independent data	Context
Batch WER (English)	5.26% median	~7–10% real-world	Vendor median across vendor-selected domains; independent indexes measure diverse real-world audio
Streaming WER (English)	6.84%	Varies by audio; not directly indexed	Down from Nova-2's 8.4% — the biggest single-generation streaming improvement Deepgram has shipped
“47.4% better than next-best”	Batch, vs 10% competitor WER	Not reproduced	Relative-improvement framing depends entirely on which competitor model and dataset were chosen
Speed (batch + streaming)	Fastest major STT API	Confirmed fastest	The one claim independent testing agrees with unambiguously (arXiv 2408.16287)
Nova-3 Medical (“63.7% better”)	vs leading alternatives	No independent replication	Vendor-run benchmark; treat as directional until third-party medical WER data exists

Why vendor WER runs lower: three mechanical reasons, none of them fraud. (1) Dataset selection — vendors benchmark on domains where their model was tuned. (2) Median vs pooled — a median across test sets hides the worst domains (phone audio can run 3× the median). (3) Text normalization — how you handle punctuation, numerals, and casing before scoring can swing WER by 1–3 points. Independent indexes apply one normalization to every engine, which is why their numbers run higher and are more comparable.

Deepgram's Nova Lineage: Which Model Are You Getting?

"Deepgram accuracy" depends on which Nova generation the integration actually calls. Many products built on Deepgram before 2025 still run Nova-2 — roughly 8.4% median WER by Deepgram's own measurement, a full generation behind Nova-3.

Model	Released	Headline accuracy claim	Status
Nova (Nova-1)	April 2023	“22% better than next-best” at launch	Legacy
Nova-2	November 2023	~8.4% median real-world WER (vendor)	Still widely deployed
Nova-3	February 2025	5.26% median batch WER, 6.84% streaming	Current flagship
Nova-3 Medical	March 2025	“63.7% better” on medical terms (vendor)	Domain variant
Nova-3 Multilingual expansions	2025 – March 2026	20+ added languages, multilingual keyterm prompting	Rolling updates

Sources: Deepgram's Nova-3 launch post, Nova-2 vs Nova-3 developer comparison, and Nova-3 Medical announcement. Verified July 5, 2026.

Nova-3 also introduced keyterm prompting — you can pass up to 100 domain terms (product names, drug names, jargon) per request, and the model biases toward them. This is Deepgram's answer to the custom-vocabulary problem that open-source Whisper simply doesn't solve, and in jargon-heavy audio it matters more than a point of headline WER.

Where Nova-3 Lands on Standard Benchmarks

Cross-model WER on the eight standard English ASR test sets, compiled from the Hugging Face Open ASR Leaderboard and vendor documentation. Lower is better. These are the same numbers we publish on our Whisper accuracy page — one consistent dataset across our accuracy guides.

Benchmark	Domain	Deepgram Nova-3	Whisper Large-v3	AssemblyAI Universal-2
LibriSpeech test-clean	Read English audiobook	2.6%	2.7%	2.8%
LibriSpeech test-other	Read English, varied	5.1%	5.2%	5.5%
TED-LIUM 3	Conference talks	3.6%	4.0%	3.9%
AMI (meeting headset)	Multi-speaker meetings	13.4%	15.9%	14.1%
GigaSpeech	Diverse web English	9.7%	10.2%	9.8%
Earnings-22	Financial calls	10.2%	12.3%	11.0%
CallHome	Conversational phone	21.8%	26.4%	23.4%
CommonVoice 9 (English)	Crowdsourced diverse	8.4%	8.8%	8.6%

Takeaway: Nova-3 leads or ties on every one of the eight benchmarks — its strongest margins are exactly where audio gets hard: meetings (13.4% vs Whisper's 15.9% on AMI), financial calls (10.2% vs 12.3% on Earnings-22), and phone conversation (21.8% vs 26.4% on CallHome). But notice the absolute numbers: even the best engine misses roughly 1 word in 5 on conversational phone audio. No engine's marketing page tells you that.

Accuracy by Audio Condition

What Nova-3's benchmark results translate to per audio scenario. Ranges combine leaderboard data with the independent real-world spread (~7–10% typical).

Audio Condition	Expected WER	Notes
Clean studio speech, 1 speaker	3–5%	Podcasts, dictation, voiceover
Conference talks, prepared speech	3–4%	TED-LIUM-like audio
Conference call, 2 speakers	7–10%	Business calls, good mics
Multi-speaker meetings (headset)	12–15%	AMI benchmark: 13.4%
Financial/jargon-heavy calls	9–12%	Earnings-22: 10.2%; keyterm prompting reduces jargon misses
Conversational phone (8 kHz)	18–24%	CallHome: 21.8% — hardest common scenario
Accented English	9–15%	Non-native speech degrades all engines (arXiv 2503.06924)
Noisy / far-field audio	15–25%+	Degrades sharply; mic quality dominates

Reading this honestly: Deepgram's 5.26% claim is real for clean, prepared speech. Multi-speaker meetings run 2–3× that. Conversational phone audio runs 4× that — on every engine, not just Deepgram. If a vendor quotes you one accuracy number without naming the audio condition, the number is marketing. See our verdict on when AI accuracy is enough.

The Speed–Accuracy Tradeoff: Deepgram's Actual Win

The most rigorous independent evaluation of commercial STT engines to date — "Measuring the Accuracy of Automatic Speech Recognition Solutions" (arXiv 2408.16287) — reached a two-part conclusion that Deepgram's marketing understandably doesn't quote in full:

⚠ On raw accuracy

Whisper and AssemblyAI achieved the highest transcription accuracy in the study. Deepgram trailed the leaders on WER across the tested audio.

✓ On efficiency

Deepgram's processing speed made it the most efficient system once speed and accuracy were considered together — the best transcription-per-second of any engine tested.

This is why "how accurate is Deepgram" has a two-part answer. For a voice agent that must respond inside a ~500 ms latency budget, an engine that is 1 WER point worse but returns results in a fraction of the time is the more accurate choice in practice — the alternative engines can't operate in that window at all. For transcribing recorded interviews where nobody is waiting, speed is irrelevant and the raw-WER leaders win. Deepgram prices this positioning aggressively too: at $0.0043/min for Nova-3 pay-as-you-go, it undercuts AssemblyAI (~$0.006/min) and every major cloud vendor (Google ~$0.016/min, AWS ~$0.024/min).

Deepgram vs Whisper vs AssemblyAI

The three engines most evaluations shortlist, on the axes that actually differ. Real-world WER from independent indexes; prices from vendor pricing pages, verified July 5, 2026.

Engine	English WER (real-world)	Speed	Price	Best for
Deepgram Nova-3	~7–10%	Fastest batch + sub-300 ms streaming	$0.0043/min	Voice agents, high-volume, telephony
Deepgram Nova-2	~8–12%	Same infrastructure	$0.0036/min	Cost-sensitive existing integrations
AssemblyAI Universal-3 Pro	2.3% (AgentTalk, AA-WER v2.0)	Streaming variant available	See vendor pricing	Max accuracy, entity-heavy audio
AssemblyAI Universal-2	~7–10%	Slower batch than Deepgram	$0.006/min	99-language commercial API
Whisper Large-v3	~8–12%	1× real-time self-hosted (GPU)	Free (MIT, self-hosted)	Self-hosting, multilingual, budget
Whisper Large-v3-turbo	~9–13%	8× real-time self-hosted	Free (MIT, self-hosted)	Fast self-hosted pipelines

AssemblyAI's Universal-3 Pro (February 2026) measured 2.3% WER on the AgentTalk subset of Artificial Analysis's AA-WER v2.0 index — a newer benchmark not directly comparable to the real-world ranges in this column. Full treatment on our AssemblyAI accuracy page.

When Deepgram Is the Right Choice — and When It Isn't

Choose Deepgram when:

You're building voice agents or live captions — sub-300 ms streaming latency is the category benchmark
You process high volumes on a budget — $0.0043/min is the lowest major-API rate
Your audio is jargon-heavy — keyterm prompting (up to 100 terms) fixes what generic models miss
You transcribe telephony at scale — Nova-3's biggest benchmark margins are on phone-quality audio

Look elsewhere when:

You need maximum raw accuracy on recordings — independent testing puts Whisper and AssemblyAI ahead on WER
You want free self-hosting — Whisper Large-v3 is MIT-licensed and competitive within 1–3 points
You need broad multilingual coverage — Whisper covers 99+ languages out of the box; Nova-3's list is growing but shorter
You don't write code — Deepgram is an API. There is no upload-a-file consumer product

Want the accuracy without the API integration?

VexaScribe gives you Whisper Large-v3 accuracy through a simple upload interface — no code, from $2/mo. 100+ languages, speaker diarization, SRT/VTT/DOCX export.

Try VexaScribe Free

Related Guides

How Accurate Is Whisper?

The same treatment for OpenAI's open-source model — 2.7% benchmark WER, 8–12% real-world, by language and audio condition.

How Accurate Is AssemblyAI?

Universal-3 Pro vs Universal-2 — vendor claims vs Artificial Analysis measurements, entity errors, diarization accuracy.

Best Transcription APIs for Developers

Deepgram, AssemblyAI, Whisper API, Speechmatics — benchmarked for latency, accuracy, and pricing.

Is AI Transcription Accurate Enough?

When 90–95% word accuracy is sufficient — and when it absolutely isn't.

What Is ASR?

Automatic speech recognition explained — how engines like Nova-3 actually work.

What Is Speaker Diarization?

The 'who spoke when' problem — DER benchmarks, pyannote 3.1, commercial APIs compared.

Phone Call Transcription

Why 8 kHz audio is the hardest common scenario — and how to get usable transcripts from it.

AI Transcription

How modern AI transcription works, what it costs, and what accuracy to expect.

Methodology & Sources

What WER actually measures

WER = (Substitutions + Deletions + Insertions) / Words in reference transcript

A WER of 5% means 95 of 100 reference words appear correctly. WER comparisons are only valid when the same audio and the same text normalization are used for every engine — which is why this page separates vendor-published numbers from cross-engine index numbers throughout.

Sources

Deepgram Nova-3 launch: Introducing Nova-3 (February 2025) — source of the 5.26% batch / 6.84% streaming claims and the 47.4% / 54.3% relative-improvement framing.
Nova-2 vs Nova-3 developer comparison: deepgram.com/learn — Nova-2's ~8.4% median real-world WER.
Artificial Analysis WER Index: artificialanalysis.ai/speech-to-text — independent cross-engine WER, speed, and price measurement. Its AA-WER v2 index weights 50% AA-AgentTalk (conversational), 25% VoxPopuli (accented speech), 25% Earnings-22 (financial calls) — deliberately harder audio than vendor demo sets.
Peer-reviewed evaluation: Measuring the Accuracy of Automatic Speech Recognition Solutions (arXiv 2408.16287) — the accuracy-vs-efficiency finding cited throughout this page.
Non-native English study: ASR for Non-Native English: Accuracy and Disfluency Handling (arXiv 2503.06924) — accent degradation data.
Hugging Face Open ASR Leaderboard: huggingface.co/spaces/hf-audio/open_asr_leaderboard — benchmark composite reference.
Deepgram pricing: deepgram.com/pricing — $0.0043/min Nova-3 pay-as-you-go rate, checked on the verification date.
Nova-3 Medical: announcement post (March 2025) — vendor-run medical benchmark.

Verification and update window

Published and verified July 5, 2026. Model versions tracked: Deepgram Nova-3 (February 2025), Nova-3 Medical (March 2025), Nova-2 (November 2023), Whisper Large-v3 (September 2023), AssemblyAI Universal-2 (October 2024) and Universal-3 Pro (February 2026). Vendor claims, pricing, and benchmark numbers were cross-checked against the linked sources on the verification date. Where a claim has no independent replication, the page says so explicitly.

Frequently Asked Questions

What word error rate (WER) does Deepgram Nova-3 actually achieve?

Deepgram claims a 5.26% median WER for Nova-3 batch transcription and 6.84% for streaming (launch post, February 2025). Independent measurements, including Artificial Analysis's WER Index, put Nova-3 at roughly 7–10% on diverse real-world audio. Both numbers are real: the vendor figure is a median across vendor-selected test domains, while independent indexes apply one text normalization to every engine across harder, more varied audio. On specific benchmarks Nova-3 scores 2.6% on LibriSpeech test-clean, 13.4% on AMI meetings, and 21.8% on CallHome conversational phone audio.

Is Deepgram more accurate than Whisper?

On standard English benchmarks, Deepgram Nova-3 leads or ties Whisper Large-v3 on all eight Open ASR Leaderboard test sets — with its biggest margins on hard audio: AMI meetings (13.4% vs 15.9%), Earnings-22 financial calls (10.2% vs 12.3%), and CallHome phone audio (21.8% vs 26.4%). However, peer-reviewed testing on other real-world audio (arXiv 2408.16287) found Whisper and AssemblyAI slightly ahead on raw accuracy. The honest summary: they are within 1–3 percentage points of each other; Deepgram is decisively faster, Whisper is free to self-host and covers 99+ languages.

Is Deepgram more accurate than AssemblyAI?

They trade places depending on the test. On the Open ASR Leaderboard composite, Nova-3 edges Universal-2 on most datasets by 0.2–1.6 percentage points. In the arXiv 2408.16287 evaluation, AssemblyAI ranked among the most accurate engines while Deepgram ranked fastest. AssemblyAI's newer Universal-3 Pro (February 2026) measured 2.3% WER on Artificial Analysis's AgentTalk benchmark. Practical rule: for maximum accuracy on recorded audio, AssemblyAI's newest model has the edge; for streaming latency and cost per minute, Deepgram wins.

How much more accurate is Nova-3 than Nova-2?

By Deepgram's own published numbers: streaming WER dropped from 8.4% (Nova-2) to 6.84% (Nova-3) — an 18.6% relative improvement — and batch median WER reached 5.26%. Nova-3 also added keyterm prompting (up to 100 custom terms per request) and expanded multilingual support through 2025–2026. Many products integrated before 2025 still call Nova-2, so if a tool 'powered by Deepgram' seems less accurate than these numbers, check which model generation it actually uses.

How accurate is Deepgram on phone calls?

Phone audio is the hardest common scenario for every STT engine. Nova-3 scores 21.8% WER on the CallHome conversational telephone benchmark — the best result among major engines (Whisper Large-v3: 26.4%, AssemblyAI Universal-2: 23.4%), but still roughly 1 word in 5 wrong. The 8 kHz bandwidth, crosstalk, and casual speech of phone audio triple typical WER. Deepgram's strongest relative margins are on telephony, which is why call-center and voice-agent platforms disproportionately build on it.

Does Deepgram's keyterm prompting actually improve accuracy?

Yes, for the vocabulary it targets. Keyterm prompting (introduced with Nova-3) lets you pass up to 100 domain-specific terms — product names, people, jargon — per request, and the model biases recognition toward them. It doesn't change headline WER on generic benchmarks, but on jargon-dense audio it prevents exactly the errors that matter most in practice: misrendered names, drug names, and technical terms. Whisper has no equivalent feature; this is one of the strongest practical reasons to choose a commercial API over self-hosted open source.

Why do Deepgram's published numbers differ from independent benchmarks?

Three mechanical reasons: (1) dataset selection — vendors benchmark on domains their model was tuned for; (2) median vs pooled reporting — a median across test sets hides the worst domains, and phone audio can run 3–4× the median; (3) text normalization — how punctuation, casing, and numerals are handled before scoring can swing WER by 1–3 points. Independent indexes like Artificial Analysis and the Hugging Face Open ASR Leaderboard apply identical normalization to every engine, which makes their absolute numbers higher but their comparisons fairer.

How fast is Deepgram compared to other transcription APIs?

Fastest among major engines — this is Deepgram's one claim that independent testing confirms without qualification. The arXiv 2408.16287 evaluation found Deepgram the most efficient system when speed and accuracy are considered together, and its streaming mode targets sub-300 ms latency, which is why it dominates voice-agent infrastructure. At $0.0043/min for Nova-3 pay-as-you-go it is also the cheapest major commercial API per minute (AssemblyAI ~$0.006/min, Google ~$0.016/min, AWS ~$0.024/min).