Home/How Accurate Is Deepgram?
Verified July 2026

How Accurate Is Deepgram? Nova-3 Benchmarks, Independently Checked

Deepgram claims a 5.26% median word error rate for Nova-3 (February 2025) on batch audio — but independent benchmarks, including Artificial Analysis's WER Index, measure Nova-3 at roughly 7–10% on real-world recordings. That gap doesn't make Deepgram inaccurate: it is the fastest major STT API in independent testing and beats Whisper Large-v3 on streaming latency. It does mean vendor numbers and your audio are different things. Here's the evidence, scenario by scenario.

WER (Word Error Rate) = (Substitutions + Deletions + Insertions) / total reference words — the NIST-standard ASR accuracy metric. Lower is better. Every number on this page is labeled as either a vendor claim or an independent measurement, with links in the Methodology & Sources section.

By VexaScribe Editorial · Published July 5, 2026 · Verified

Deepgram Accuracy in One Sentence

5.26%
Vendor-claimed WER
Nova-3 batch, median
7–10%
Independent real-world
meetings, calls, mixed audio
6.84%
Streaming WER claim
vs 8.4% on Nova-2
#1
Speed, independently
fastest major STT API

Deepgram Nova-3 is a top-tier commercial speech-to-text engine whose defining independent result is speed, not peak accuracy. Peer-reviewed testing (arXiv 2408.16287) found Whisper and AssemblyAI slightly more accurate on raw WER — but Deepgram the most efficient engine once processing speed is factored in. If your use case is voice agents, live captions, or millions of minutes of telephony, that tradeoff usually lands in Deepgram's favor. If it's squeezing out the last percentage point of accuracy on recorded audio, it usually doesn't.

Vendor Claims vs Independent Measurements

Nearly every page ranking for "Deepgram accuracy" is written by Deepgram. That doesn't make the numbers false — it makes them unverified. Here is each headline claim next to what independent sources actually measure.

MetricDeepgram's claimIndependent dataContext
Batch WER (English)5.26% median~7–10% real-worldVendor median across vendor-selected domains; independent indexes measure diverse real-world audio
Streaming WER (English)6.84%Varies by audio; not directly indexedDown from Nova-2's 8.4% — the biggest single-generation streaming improvement Deepgram has shipped
“47.4% better than next-best”Batch, vs 10% competitor WERNot reproducedRelative-improvement framing depends entirely on which competitor model and dataset were chosen
Speed (batch + streaming)Fastest major STT APIConfirmed fastestThe one claim independent testing agrees with unambiguously (arXiv 2408.16287)
Nova-3 Medical (“63.7% better”)vs leading alternativesNo independent replicationVendor-run benchmark; treat as directional until third-party medical WER data exists
Why vendor WER runs lower: three mechanical reasons, none of them fraud. (1) Dataset selection — vendors benchmark on domains where their model was tuned. (2) Median vs pooled — a median across test sets hides the worst domains (phone audio can run 3× the median). (3) Text normalization — how you handle punctuation, numerals, and casing before scoring can swing WER by 1–3 points. Independent indexes apply one normalization to every engine, which is why their numbers run higher and are more comparable.

Deepgram's Nova Lineage: Which Model Are You Getting?

"Deepgram accuracy" depends on which Nova generation the integration actually calls. Many products built on Deepgram before 2025 still run Nova-2 — roughly 8.4% median WER by Deepgram's own measurement, a full generation behind Nova-3.

ModelReleasedHeadline accuracy claimStatus
Nova (Nova-1)April 2023“22% better than next-best” at launchLegacy
Nova-2November 2023~8.4% median real-world WER (vendor)Still widely deployed
Nova-3February 20255.26% median batch WER, 6.84% streamingCurrent flagship
Nova-3 MedicalMarch 2025“63.7% better” on medical terms (vendor)Domain variant
Nova-3 Multilingual expansions2025 – March 202620+ added languages, multilingual keyterm promptingRolling updates

Sources: Deepgram's Nova-3 launch post, Nova-2 vs Nova-3 developer comparison, and Nova-3 Medical announcement. Verified July 5, 2026.

Nova-3 also introduced keyterm prompting — you can pass up to 100 domain terms (product names, drug names, jargon) per request, and the model biases toward them. This is Deepgram's answer to the custom-vocabulary problem that open-source Whisper simply doesn't solve, and in jargon-heavy audio it matters more than a point of headline WER.

Where Nova-3 Lands on Standard Benchmarks

Cross-model WER on the eight standard English ASR test sets, compiled from the Hugging Face Open ASR Leaderboard and vendor documentation. Lower is better. These are the same numbers we publish on our Whisper accuracy page — one consistent dataset across our accuracy guides.

BenchmarkDomainDeepgram Nova-3Whisper Large-v3AssemblyAI Universal-2
LibriSpeech test-cleanRead English audiobook2.6%2.7%2.8%
LibriSpeech test-otherRead English, varied5.1%5.2%5.5%
TED-LIUM 3Conference talks3.6%4.0%3.9%
AMI (meeting headset)Multi-speaker meetings13.4%15.9%14.1%
GigaSpeechDiverse web English9.7%10.2%9.8%
Earnings-22Financial calls10.2%12.3%11.0%
CallHomeConversational phone21.8%26.4%23.4%
CommonVoice 9 (English)Crowdsourced diverse8.4%8.8%8.6%
Takeaway: Nova-3 leads or ties on every one of the eight benchmarks — its strongest margins are exactly where audio gets hard: meetings (13.4% vs Whisper's 15.9% on AMI), financial calls (10.2% vs 12.3% on Earnings-22), and phone conversation (21.8% vs 26.4% on CallHome). But notice the absolute numbers: even the best engine misses roughly 1 word in 5 on conversational phone audio. No engine's marketing page tells you that.

Accuracy by Audio Condition

What Nova-3's benchmark results translate to per audio scenario. Ranges combine leaderboard data with the independent real-world spread (~7–10% typical).

Audio ConditionExpected WERNotes
Clean studio speech, 1 speaker3–5%Podcasts, dictation, voiceover
Conference talks, prepared speech3–4%TED-LIUM-like audio
Conference call, 2 speakers7–10%Business calls, good mics
Multi-speaker meetings (headset)12–15%AMI benchmark: 13.4%
Financial/jargon-heavy calls9–12%Earnings-22: 10.2%; keyterm prompting reduces jargon misses
Conversational phone (8 kHz)18–24%CallHome: 21.8% — hardest common scenario
Accented English9–15%Non-native speech degrades all engines (arXiv 2503.06924)
Noisy / far-field audio15–25%+Degrades sharply; mic quality dominates
Reading this honestly: Deepgram's 5.26% claim is real for clean, prepared speech. Multi-speaker meetings run 2–3× that. Conversational phone audio runs 4× that — on every engine, not just Deepgram. If a vendor quotes you one accuracy number without naming the audio condition, the number is marketing. See our verdict on when AI accuracy is enough.

The Speed–Accuracy Tradeoff: Deepgram's Actual Win

The most rigorous independent evaluation of commercial STT engines to date — "Measuring the Accuracy of Automatic Speech Recognition Solutions" (arXiv 2408.16287) — reached a two-part conclusion that Deepgram's marketing understandably doesn't quote in full:

On raw accuracy

Whisper and AssemblyAI achieved the highest transcription accuracy in the study. Deepgram trailed the leaders on WER across the tested audio.

On efficiency

Deepgram's processing speed made it the most efficient system once speed and accuracy were considered together — the best transcription-per-second of any engine tested.

This is why "how accurate is Deepgram" has a two-part answer. For a voice agent that must respond inside a ~500 ms latency budget, an engine that is 1 WER point worse but returns results in a fraction of the time is the more accurate choice in practice — the alternative engines can't operate in that window at all. For transcribing recorded interviews where nobody is waiting, speed is irrelevant and the raw-WER leaders win. Deepgram prices this positioning aggressively too: at $0.0043/min for Nova-3 pay-as-you-go, it undercuts AssemblyAI (~$0.006/min) and every major cloud vendor (Google ~$0.016/min, AWS ~$0.024/min).

Deepgram vs Whisper vs AssemblyAI

The three engines most evaluations shortlist, on the axes that actually differ. Real-world WER from independent indexes; prices from vendor pricing pages, verified July 5, 2026.

EngineEnglish WER (real-world)SpeedPriceBest for
Deepgram Nova-3~7–10%Fastest batch + sub-300 ms streaming$0.0043/minVoice agents, high-volume, telephony
Deepgram Nova-2~8–12%Same infrastructure$0.0036/minCost-sensitive existing integrations
AssemblyAI Universal-3 Pro2.3% (AgentTalk, AA-WER v2.0)Streaming variant availableSee vendor pricingMax accuracy, entity-heavy audio
AssemblyAI Universal-2~7–10%Slower batch than Deepgram$0.006/min99-language commercial API
Whisper Large-v3~8–12%1× real-time self-hosted (GPU)Free (MIT, self-hosted)Self-hosting, multilingual, budget
Whisper Large-v3-turbo~9–13%8× real-time self-hostedFree (MIT, self-hosted)Fast self-hosted pipelines

AssemblyAI's Universal-3 Pro (February 2026) measured 2.3% WER on the AgentTalk subset of Artificial Analysis's AA-WER v2.0 index — a newer benchmark not directly comparable to the real-world ranges in this column. Full treatment on our AssemblyAI accuracy page.

When Deepgram Is the Right Choice — and When It Isn't

Choose Deepgram when:

  • You're building voice agents or live captions — sub-300 ms streaming latency is the category benchmark
  • You process high volumes on a budget — $0.0043/min is the lowest major-API rate
  • Your audio is jargon-heavy — keyterm prompting (up to 100 terms) fixes what generic models miss
  • You transcribe telephony at scale — Nova-3's biggest benchmark margins are on phone-quality audio

Look elsewhere when:

  • You need maximum raw accuracy on recordings — independent testing puts Whisper and AssemblyAI ahead on WER
  • You want free self-hosting — Whisper Large-v3 is MIT-licensed and competitive within 1–3 points
  • You need broad multilingual coverage — Whisper covers 99+ languages out of the box; Nova-3's list is growing but shorter
  • You don't write code — Deepgram is an API. There is no upload-a-file consumer product

Want the accuracy without the API integration?

VexaScribe gives you Whisper Large-v3 accuracy through a simple upload interface — no code, from $2/mo. 100+ languages, speaker diarization, SRT/VTT/DOCX export.

Try VexaScribe Free

Related Guides

Methodology & Sources

What WER actually measures

WER = (Substitutions + Deletions + Insertions) / Words in reference transcript

A WER of 5% means 95 of 100 reference words appear correctly. WER comparisons are only valid when the same audio and the same text normalization are used for every engine — which is why this page separates vendor-published numbers from cross-engine index numbers throughout.

Sources

Verification and update window

Published and verified July 5, 2026. Model versions tracked: Deepgram Nova-3 (February 2025), Nova-3 Medical (March 2025), Nova-2 (November 2023), Whisper Large-v3 (September 2023), AssemblyAI Universal-2 (October 2024) and Universal-3 Pro (February 2026). Vendor claims, pricing, and benchmark numbers were cross-checked against the linked sources on the verification date. Where a claim has no independent replication, the page says so explicitly.

Frequently Asked Questions

What word error rate (WER) does Deepgram Nova-3 actually achieve?

Deepgram claims a 5.26% median WER for Nova-3 batch transcription and 6.84% for streaming (launch post, February 2025). Independent measurements, including Artificial Analysis's WER Index, put Nova-3 at roughly 7–10% on diverse real-world audio. Both numbers are real: the vendor figure is a median across vendor-selected test domains, while independent indexes apply one text normalization to every engine across harder, more varied audio. On specific benchmarks Nova-3 scores 2.6% on LibriSpeech test-clean, 13.4% on AMI meetings, and 21.8% on CallHome conversational phone audio.

Is Deepgram more accurate than Whisper?

On standard English benchmarks, Deepgram Nova-3 leads or ties Whisper Large-v3 on all eight Open ASR Leaderboard test sets — with its biggest margins on hard audio: AMI meetings (13.4% vs 15.9%), Earnings-22 financial calls (10.2% vs 12.3%), and CallHome phone audio (21.8% vs 26.4%). However, peer-reviewed testing on other real-world audio (arXiv 2408.16287) found Whisper and AssemblyAI slightly ahead on raw accuracy. The honest summary: they are within 1–3 percentage points of each other; Deepgram is decisively faster, Whisper is free to self-host and covers 99+ languages.

Is Deepgram more accurate than AssemblyAI?

They trade places depending on the test. On the Open ASR Leaderboard composite, Nova-3 edges Universal-2 on most datasets by 0.2–1.6 percentage points. In the arXiv 2408.16287 evaluation, AssemblyAI ranked among the most accurate engines while Deepgram ranked fastest. AssemblyAI's newer Universal-3 Pro (February 2026) measured 2.3% WER on Artificial Analysis's AgentTalk benchmark. Practical rule: for maximum accuracy on recorded audio, AssemblyAI's newest model has the edge; for streaming latency and cost per minute, Deepgram wins.

How much more accurate is Nova-3 than Nova-2?

By Deepgram's own published numbers: streaming WER dropped from 8.4% (Nova-2) to 6.84% (Nova-3) — an 18.6% relative improvement — and batch median WER reached 5.26%. Nova-3 also added keyterm prompting (up to 100 custom terms per request) and expanded multilingual support through 2025–2026. Many products integrated before 2025 still call Nova-2, so if a tool 'powered by Deepgram' seems less accurate than these numbers, check which model generation it actually uses.

How accurate is Deepgram on phone calls?

Phone audio is the hardest common scenario for every STT engine. Nova-3 scores 21.8% WER on the CallHome conversational telephone benchmark — the best result among major engines (Whisper Large-v3: 26.4%, AssemblyAI Universal-2: 23.4%), but still roughly 1 word in 5 wrong. The 8 kHz bandwidth, crosstalk, and casual speech of phone audio triple typical WER. Deepgram's strongest relative margins are on telephony, which is why call-center and voice-agent platforms disproportionately build on it.

Does Deepgram's keyterm prompting actually improve accuracy?

Yes, for the vocabulary it targets. Keyterm prompting (introduced with Nova-3) lets you pass up to 100 domain-specific terms — product names, people, jargon — per request, and the model biases recognition toward them. It doesn't change headline WER on generic benchmarks, but on jargon-dense audio it prevents exactly the errors that matter most in practice: misrendered names, drug names, and technical terms. Whisper has no equivalent feature; this is one of the strongest practical reasons to choose a commercial API over self-hosted open source.

Why do Deepgram's published numbers differ from independent benchmarks?

Three mechanical reasons: (1) dataset selection — vendors benchmark on domains their model was tuned for; (2) median vs pooled reporting — a median across test sets hides the worst domains, and phone audio can run 3–4× the median; (3) text normalization — how punctuation, casing, and numerals are handled before scoring can swing WER by 1–3 points. Independent indexes like Artificial Analysis and the Hugging Face Open ASR Leaderboard apply identical normalization to every engine, which makes their absolute numbers higher but their comparisons fairer.

How fast is Deepgram compared to other transcription APIs?

Fastest among major engines — this is Deepgram's one claim that independent testing confirms without qualification. The arXiv 2408.16287 evaluation found Deepgram the most efficient system when speed and accuracy are considered together, and its streaming mode targets sub-300 ms latency, which is why it dominates voice-agent infrastructure. At $0.0043/min for Nova-3 pay-as-you-go it is also the cheapest major commercial API per minute (AssemblyAI ~$0.006/min, Google ~$0.016/min, AWS ~$0.024/min).