Home/What Is Speaker Diarization
Verified July 2026

What Is Speaker Diarization? Definition, DER Benchmarks & 2026 State of the Art

Speaker diarization answers "who spoke when" by partitioning an audio stream into homogeneous segments and assigning each a speaker label — without knowing speaker identities in advance. The dominant open-source solution as of Q2 2026 is pyannote.audio 3.1 (MIT license, released November 2023), which posts ~11% Diarization Error Rate (DER) on the AMI benchmark; WhisperX combines it with Faster-Whisper for sub-100ms word-level timestamps. Commercial APIs (Deepgram, AssemblyAI, Rev.ai) report 8–14% DER on similar meeting audio. Distinct from speaker identification (which requires enrollment) and speaker verification (which checks a claimed identity).

DER (Diarization Error Rate) = (false alarm + missed detection + speaker confusion) / total speech time. The NIST-standard diarization metric. Lower is better; a DER of 11% means 89% of speech-time is correctly attributed to the right speaker. All numbers below sourced from the pyannote.audio 3.1 model card, published DIHARD-III / VoxConverse benchmark papers, and vendor benchmark documentation — links in the Methodology & Sources section.

By VexaScribe Editorial · Published July 3, 2026 · Verified

Speaker Diarization in One Screen

~11%
DER on AMI
pyannote 3.1
3
pipeline stages
VAD → embed → cluster
MIT
license
pyannote.audio open-source
4
distinct tasks
diarization / identification / verification / ASR

Speaker diarization is the audio-processing task of segmenting a recording by speaker turn, without knowing who those speakers are. Its output is a list of intervals, each tagged with an anonymous speaker ID such as Speaker_00. The standard scoring metric is DER, defined by the NIST speaker-recognition evaluations. In 2026, pyannote.audio 3.1 is the state-of-the-art open-source pipeline; NVIDIA NeMo Titanet-Large is a strong alternative; commercial APIs (Deepgram Nova-3, AssemblyAI Universal-2, Rev.ai) win on turnkey deployment but not always on raw DER. Diarization is often confused with speaker identification — a separate task that requires prior voice enrollment. Real-world DER depends heavily on microphone type, overlap fraction, and speaker count.

Methodology & Sources

How we compiled these numbers

Every DER figure on this page is drawn from a published paper, an official model card, a challenge report, or vendor benchmark documentation. Ranges reflect the spread between the original paper and independent replications. Where a vendor does not publish a DER number for a specific benchmark, we cite the closest comparable public number and mark the range accordingly. All numbers verified July 2026 against published papers, model cards, and vendor documentation.

Primary sources

  • pyannote.audio GitHub repository: github.com/pyannote/pyannote-audio — source code, training recipes, and evaluation scripts (MIT license).
  • pyannote/speaker-diarization-3.1 model card: huggingface.co/pyannote/speaker-diarization-3.1 — official DER benchmarks on AMI, VoxConverse, and DIHARD-III (Hervé Bredin, November 2023).
  • WhisperX repository: github.com/m-bain/whisperX — Bain et al., combining Faster-Whisper with pyannote 3.1 for word-level speaker-labeled transcripts (BSD-2 license).
  • NIST SCTK md-eval.pl (DER definition): nist.gov/publications — the canonical DER scoring script and metric definition used across the field.
  • AMI Meeting Corpus: groups.inf.ed.ac.uk/ami — 100+ hours of multi-speaker meeting recordings, standard reference benchmark for diarization.
  • VoxConverse: robots.ox.ac.uk/~vgg/data/voxconverse — Chung et al. (Oxford VGG), diarization benchmark on YouTube political debate and interview audio.
  • DIHARD-III Challenge: dihardchallenge.github.io/dihard3 — the hardest public diarization benchmark, spanning 11 domains including clinical and courtroom audio.
  • Deepgram diarization documentation: developers.deepgram.com — Nova-3 diarization API reference and benchmark commentary.
  • AssemblyAI speaker labels documentation: assemblyai.com/docs — Universal-2 speaker labels API and accuracy notes.
  • Original Whisper paper (for the ASR side of WhisperX): arxiv.org/abs/2212.04356 — Radford et al., OpenAI 2022.

Verification and update window

Originally published July 3, 2026. Verified July 2026. Benchmark numbers, model versions, and vendor product versions cross-checked against the sources above. Tracked model versions: pyannote.audio 3.1 (November 2023), WhisperX (BSD-2, ongoing), NVIDIA NeMo Titanet-Large (2023, Apache 2.0), SpeechBrain ECAPA-TDNN (2022, Apache 2.0), Deepgram Nova-3 (December 2024), AssemblyAI Universal-2 (2024), Rev.ai (2024).

Speaker Diarization Defined

Speaker diarization is the process of automatically partitioning an audio recording by speaker turn. Formally, given a single-channel or multi-channel audio stream containing an unknown number of speakers, diarization produces a list of non-overlapping intervals, each tagged with a speaker label. The labels are anonymous by design: the system does not know that Speaker_00 is Alice or that Speaker_01 is Bob — it only knows that the voice in one set of segments differs from the voice in another.

Under the hood, diarization is a combination of two classical audio-processing tasks. First, segmentation: cut the audio at points where the acoustic content changes — a speaker turn, a long silence, or a background-noise shift. Second, clustering: group the resulting segments by their acoustic signature, so segments produced by the same voice land in the same cluster. Each cluster is then assigned an anonymous ID. Diarization does not require any prior enrollment: it discovers the speakers within the recording, rather than matching them against a database of known voices — that is speaker identification, and it is a separate task (see the next section).

The standard notation across the field labels speakers as Speaker_00, Speaker_01, Speaker_02, and so on, ordered by first appearance. Mapping these anonymous labels to actual names — "Speaker_00 is the customer, Speaker_01 is the agent" — is a downstream step. It can be done manually (a human reviewer listens and renames) or automatically via speaker identification if enrolled voiceprints are available.

Diarization output almost always accompanies a speech-to-text transcript. On its own, a list of anonymous speaker intervals is not very useful; combined with an ASR transcript, it produces a speaker-labeled transcript that reads like a play script: Speaker_00: Good morning. Speaker_01: Hi, thanks for jumping on. This is what tools like WhisperX, VexaScribe, Deepgram, and AssemblyAI actually output when you ask for "speaker labels" on a transcription.

Diarization vs Identification vs Verification vs Recognition

Four different speech-processing tasks are routinely confused in blog posts and product marketing — especially because vendors label their features inconsistently. The table below is the disambiguation used across academic literature (Rabiner; Jurafsky & Martin) and the NIST speaker-recognition evaluation series.

TaskQuestion it answersPrior enrollment?Typical outputStandard metric
Speaker diarization"Who spoke when?"NoTime-segmented speaker labels (unknown identities)DER (Diarization Error Rate)
Speaker identification"Which of these known people said this?"Yes (enrolled voices)Identity label per segmentAccuracy / EER
Speaker verification"Is this the person they claim to be?"Yes (target voiceprint)Yes/no + confidence scoreEER (Equal Error Rate)
Speech recognition (ASR)"What was said?"NoText transcriptWER (Word Error Rate)

Sources: speech-processing textbooks (Rabiner & Juang; Jurafsky & Martin, Speech and Language Processing), NIST speaker-recognition evaluation series.

The confusion in the wild is real. Many APIs advertise a "speaker ID" feature that in fact does only diarization — it labels turns as Speaker_A, Speaker_B, but has no notion of who Speaker_A actually is. This is fine for most transcription use cases (meeting notes, podcast production, journalism), because a human can trivially relabel Speaker_A as "Marta" once. But if the marketing implies the system "recognizes" you across recordings, that is a different guarantee — that is identification, and it requires an enrollment step that the diarization-only API does not have.

Verification is a narrower cousin still: given a single incoming utterance and a target voiceprint, output yes/no. Voice-unlock features on phones and voice-biometric fraud-prevention systems in banking are verification, not diarization. And ASR is the transcription itself — the "what was said" layer that the diarization label rides on top of. VexaScribe's pipeline runs ASR (Whisper Large-v3) and diarization (pyannote.audio 3.1) as separate stages, then aligns their outputs.

The Standard Diarization Pipeline

Modern open-source diarization — pyannote.audio, NeMo, SpeechBrain — converges on the same three-stage architecture, with an optional fourth stage that refines the boundaries with a neural model.

StageFunction2026 SOTA approachCommon library
(1) Voice Activity Detection (VAD)Detect speech vs silence/noiseFine-tuned segmentation-3.0pyannote/segmentation-3.0
(2) Speaker embeddingProject each segment into vector spaceWavLM Large / ECAPA-TDNN / x-vectorpyannote, SpeechBrain, NeMo
(3) ClusteringGroup embeddings by speakerAgglomerative hierarchical + spectralpyannote clustering module
(4) Resegmentation (optional)Refine boundaries with a neural modelEnd-to-end neural networkpyannote/speaker-diarization-3.1

Source: pyannote.audio 3.1 architecture (Bredin, November 2023), NVIDIA NeMo Titanet-Large documentation (2023), SpeechBrain speaker-diarization recipe.

Stage 1 — Voice Activity Detection (VAD). Before you can cluster speakers, you have to know which parts of the audio contain speech at all. VAD outputs a binary speech/non-speech mask. pyannote uses a fine-tuned segmentation-3.0 model that also predicts local speaker changes and overlap regions. A VAD failure — hallucinated speech during silence, or missed speech during quiet passages — cascades to every downstream stage.

Stage 2 — Speaker embedding. Each detected speech segment is passed through an embedding model that projects it to a dense vector — typically 192 or 256 dimensions — such that segments from the same speaker cluster together in the vector space. The dominant embedding architectures in 2026 are WavLM Large, ECAPA-TDNN, and x-vector. Embeddings are averaged over sub-segments to produce one vector per speech chunk.

Stage 3 — Clustering. The final step groups embeddings into speakers. Agglomerative hierarchical clustering with a cosine-distance threshold is the classical approach; spectral clustering is common when the number of speakers is known in advance. pyannote 3.1 uses agglomerative clustering by default and can either infer the number of speakers automatically or accept a hint. Optional Stage 4 resegmentation applies an end-to-end neural network to refine speaker boundaries, especially in overlap regions.

Diarization Error Rate (DER) — What the Metric Means

DER is to diarization what WER is to speech recognition — the single number everyone reports. It is defined by NIST via the md-eval.pl scoring script, with a default collar of 0.25 seconds around segment boundaries (to forgive small timing errors).

Formula:

DER = (false alarm + missed detection + speaker confusion) / total speech time

False alarm: the system labels non-speech as speech (VAD over-fires).
Missed detection: the system labels speech as non-speech (VAD misses).
Speaker confusion: the system correctly finds a speech region but assigns it to the wrong speaker cluster.

A DER of 11% means 89% of the total speech time is correctly attributed to the right speaker — and 11% is either mis-attributed, hallucinated, or missed. DER is a proportion, so it is directly comparable across recordings, though the individual error components (false alarm vs confusion) matter for debugging.

DER range interpretation

DER rangeQuality labelTypical audio
Under 8%ExcellentClose-mic meetings, 2–4 speakers, low overlap
8–15%Production-gradeTypical meetings, podcasts, call-center audio
15–25%AcceptableReal-world noisy audio, casual recordings
25%+Needs improvementHeavy overlap, far-field mic, or domain mismatch

Two important variants of the metric appear in the literature. JER (Jaccard Error Rate), introduced in DIHARD-II (2019), weights each speaker equally regardless of how much they spoke — useful when a rare speaker is important. JER numbers are typically 10–20 percentage points higher than DER on the same recording; the two are not directly comparable.

Overlap-inclusive DER. The classical NIST DER definition ignored overlapping speech — regions where 2+ people talk at once — and gave partial credit to systems that recovered only the dominant speaker. Modern reports (including the pyannote 3.1 model card) score overlap explicitly, which inflates DER numbers relative to older papers. When comparing DER numbers, always check whether overlap is scored.

DER Benchmarks Across Models × Datasets

The table below is the money content: DER numbers for the leading open-source and commercial diarization systems on the three most-cited public benchmarks — AMI (headset mic), VoxConverse (YouTube political debate audio), and DIHARD-III (11-domain hardest benchmark).

ModelAMI (headset)VoxConverseDIHARD-IIIRelease dateLicense
pyannote.audio 3.1~11%~11%~20%November 2023MIT
WhisperX + pyannote 3.1~11–13%~11–13%~20–22%Ongoing (2023–)BSD-2
NVIDIA NeMo Titanet-Large~8–11%~10–13%~19–23%2023Apache 2.0
SpeechBrain ECAPA-TDNN~13–16%~14–17%~22–26%2022Apache 2.0
Deepgram Nova-3~7–10%~9–12%~18–22%December 2024Commercial
AssemblyAI Universal-2~9–11%~10–13%~19–23%2024Commercial
Rev.ai~10–13%~12–15%~22–25%2024Commercial

Sources: pyannote.audio 3.1 model card (Bredin, November 2023), DIHARD-III challenge report, huggingface.co/pyannote/speaker-diarization-3.1, NVIDIA NeMo Titanet-Large release notes, SpeechBrain diarization recipe, Deepgram Nova-3 documentation, AssemblyAI Universal-2 release notes, Rev.ai vendor benchmark documentation. Verified July 2026. Benchmark numbers vary by evaluation protocol; ranges reflect published paper + independent replication.

A few things to read out of the table. First, on close-mic meeting audio (AMI headset), the best commercial systems edge open-source by 1–3 percentage points — Deepgram Nova-3 reports around 7–10% DER, while pyannote 3.1 is at 11%. On the harder benchmarks (DIHARD-III, which spans clinical audio, courtroom recordings, and children's speech) the gap narrows because everything is harder there; all systems land in the 18–26% range.

Second, the WhisperX row inherits pyannote 3.1's DER but adds a few percentage points because of small alignment noise between the ASR and diarization streams. In practice, WhisperX is the go-to pipeline when you need word-level speaker labels rather than just interval-level labels. Third, SpeechBrain trails the others in raw DER but remains popular because its recipes are the easiest to fine-tune on custom data — a legitimate tradeoff for domain-specific deployments where a stock model underperforms.

Open-Source vs Commercial APIs — Real Tradeoffs

DER is not the only axis. The tradeoff between self-hosting pyannote and calling a commercial API is dominated by setup complexity, cost model, latency, and data residency — not by the 1–3 point DER gap.

DimensionOpen-source (pyannote / WhisperX)Commercial (Deepgram / AssemblyAI / Rev)
DER on meeting audio~11–13%~7–14%
Setup complexityHigh (Python, GPU, Hugging Face auth)Turnkey API
Cost per audio hourFree (self-hosted GPU compute)$0.20–1.00/hour depending on tier
Real-time latencyBatch (offline preferred)200–500 ms streaming supported
Languages supported100+30–50 typical
Support / warrantyCommunitySLA-backed
DeploymentSelf-hosted (data doesn't leave you)Cloud (data leaves you)
Vendor lock-inNoneHigh

Sources: pyannote.audio deployment guide, WhisperX repository, Deepgram Nova-3 pricing page, AssemblyAI pricing page, Rev.ai enterprise pricing.

The honest read: pyannote is state-of-the-art open-source, but commercial APIs win on turnkey simplicity. If your team owns a GPU and can maintain a Python + PyTorch stack, self-hosting pyannote or WhisperX is essentially free after the setup cost, gives you full control of your audio (nothing leaves your infrastructure), and can be pinned to a specific model version indefinitely. If you cannot invest that engineering time, or if you need streaming diarization with a firm latency SLA, or if you value the vendor's custom-vocabulary tooling and 24/7 support, a commercial API is often the better choice.

One nuance that often gets skipped: commercial diarization is billed together with ASR, typically at $0.20–1.00 per audio hour depending on tier. Self-hosted pyannote is "free" only in the sense that you already pay for the GPU. A modern GPU that runs pyannote 3.1 at 2–5× real-time (RTX 3090 or L4-class) costs roughly $0.30–0.80 per hour when amortized, so the true cost delta is often smaller than the sticker-price gap suggests — the real win from self-hosting is control and privacy, not dollar savings.

Real-Time vs Offline Diarization

Diarization ships in two modes: offline (batch) and streaming (real-time). They are not the same algorithm and do not achieve the same accuracy.

Offline diarization runs full-audio clustering — the model sees the whole recording before deciding cluster assignments. This is where pyannote.audio 3.1's ~11% DER on AMI comes from. Offline is the right mode for anything where the recording is already finished and you have a few minutes to process it: podcast production, meeting notes generated after the call, legal deposition transcripts, journalism interview transcription.

Streaming (real-time) diarization emits labels incrementally as audio arrives. Early segments are clustered without the benefit of later context, so DER typically degrades by 5–10 percentage points relative to offline mode. Commercial APIs specialize here: Deepgram's streaming diarization runs at 200–500 ms end-to-end latency, and AssemblyAI offers a comparable streaming product. pyannote released a streaming variant in 2024 that is improving, but batch mode is still preferred when accuracy matters more than latency.

The right choice depends on the use case. Call-center analytics and live captioning need streaming; podcast production and meeting minutes benefit from the extra 5–10 points of accuracy that offline mode provides. Meeting-note tools that arrive as "summaries after the call" are actually running offline diarization, even though the meeting itself was live — the diarization runs against the recording once the meeting ends.

Real-World Use Cases

Six concrete scenarios where speaker diarization is the load-bearing technology — not a nice-to-have, but the reason the transcript is useful at all.

1. Meeting transcription with named speakers

Zoom, Google Meet, and Microsoft Teams recordings arrive as a single audio track with 3–12 speakers. Without diarization, a transcript is one wall of text; with diarization, it becomes a play script. See our meeting transcription guide and Whisper diarization walkthrough.

2. Podcast production (host / guest separation)

Two-to-three-person podcasts benefit dramatically from diarization: producers can generate speaker-labeled show notes, chapter markers, and pull-quote clips. Close-mic recordings hit DER near 8–10%. See podcast transcription.

3. Call-center analytics

Diarization separates agent from customer even on a mono-mixed recording, enabling talk-time ratio metrics, sentiment analysis per role, and script-adherence scoring. Streaming mode is standard here.

4. Legal deposition transcripts

Witness / counsel / court reporter turns must be labeled precisely. See deposition transcription for the workflow, though a human court reporter still verifies labels for legal-record-quality output.

5. Journalism interviews

Reporter/source separation is essential when quoting. See how to transcribe an interview and interview transcription.

6. Broadcast media

News segments and talk shows with named anchors and rotating guests; diarization labels are the starting point for automated caption generation and archival search-by-speaker.

Limitations & Edge Cases

Honest limitations — the failure modes you will hit deploying pyannote or any diarization system in production.

Overlapping speech

Two or more people talking simultaneously is the single biggest source of DER inflation. Overlap fractions of 10–20% (typical of casual meetings) can push DER to 20–35%, even with overlap-aware models.

Very short utterances

Segments under 1 second (backchannels: "yeah", "mm-hmm") are often mis-assigned or missed entirely. Embedding quality degrades sharply below 1 s of audio.

Similar voices

Same-gender pairs, siblings, or twins can cluster-merge — the system decides the two voices are one. There is no reliable fix without external identity signals.

Unknown speaker count

pyannote infers speaker count from clustering, but noisy audio produces over-clustering (10 speakers where 4 exist) or under-clustering (2 where 6 exist). Passing a hint helps when the count is known.

Cross-recording identity

Speaker_00 in recording A is not the same person as Speaker_00 in recording B. Diarization is not identification. For persistent identities across recordings you need enrollment + speaker identification.

Low-resource languages

Fewer pre-trained embeddings exist for languages outside English, Spanish, French, German, Mandarin, and Arabic. DER on low-resource languages typically runs 3–8 points higher.

Very long recordings

Recordings over 2 hours stress clustering memory and complexity. Chunk-then-align workflows help but introduce boundary artifacts where a speaker's cluster ID can flip across chunks.

Noisy / reverberant environments

VAD failures cascade: if the VAD misfires on background TV or a busy café, the downstream embedding and clustering stages inherit the errors. Room reverb also blurs embeddings.

Practical implication: if your recordings have heavy overlap, similar-voice speakers, or heavy reverb, expect real-world DER 5–15 points higher than the AMI headset benchmark. Test on your own audio before promising accuracy numbers to stakeholders.

Get diarized transcripts (Whisper + pyannote 3.1) free with VexaScribe

Frequently Asked Questions

What is speaker diarization?

Speaker diarization is the process of partitioning an audio stream into homogeneous segments and assigning each segment a speaker label — answering the question "who spoke when?" without knowing the speakers' identities in advance. The typical output is a time-segmented transcript where each utterance is labeled with an anonymous ID such as Speaker_00, Speaker_01, etc. Diarization is distinct from speech recognition (which produces the text), speaker identification (which requires enrolled voices), and speaker verification (which checks a claimed identity). The dominant open-source pipeline in 2026 is pyannote.audio 3.1 (MIT license, released November 2023), and it is measured with Diarization Error Rate (DER).

What is the difference between speaker diarization and speaker identification?

Diarization does not know who the speakers are — it only groups audio segments by voice, producing labels like Speaker_00 and Speaker_01. Speaker identification requires prior enrollment: a set of known voiceprints is compared against each segment to output an actual name. Diarization uses DER (Diarization Error Rate) as its standard metric, while identification uses accuracy or Equal Error Rate (EER). In practice, many production systems run diarization first and then optionally map the anonymous cluster labels to real names via a separate identification step.

What is a good DER (Diarization Error Rate)?

DER measures the fraction of total speech time that is incorrectly attributed. Rough interpretation: under 8% is excellent (typical of clean, close-mic meetings with 2–4 speakers); 8–15% is production-grade for meetings and podcasts; 15–25% is acceptable for many casual use cases; over 25% indicates noisy audio or a mismatch between model and domain. pyannote.audio 3.1 reports roughly 11% DER on the AMI headset benchmark, which is considered state-of-the-art open-source performance in 2026. DER is defined by NIST via the md-eval.pl scoring script with a default 0.25 s collar.

Does Whisper do speaker diarization?

No. OpenAI Whisper (including Large-v3, released September 2023, MIT license) performs only automatic speech recognition — it produces a transcript but does not label who spoke each segment. Speaker labels require a separate diarization stage, most commonly pyannote.audio 3.1. The standard combined pipeline is WhisperX (BSD-2 license), which runs Faster-Whisper for transcription plus pyannote for diarization, aligning the two outputs so each transcribed word carries a speaker tag.

What's the best open-source speaker diarization tool in 2026?

pyannote.audio 3.1 (MIT license, released November 2023) is the leading open-source diarization pipeline as of Q2 2026, posting roughly 11% DER on AMI, 11% on VoxConverse, and 20% on DIHARD-III. It runs a three-stage pipeline: voice activity detection with segmentation-3.0, speaker embedding, and clustering. NVIDIA NeMo's Titanet-Large diarization pipeline (2023, Apache 2.0) is competitive and often preferred for GPU-heavy production. SpeechBrain's ECAPA-TDNN diarization (2022, Apache 2.0) trails by a few DER points but is easier to fine-tune.

How accurate is pyannote.audio?

pyannote.audio 3.1 reports approximately 11% Diarization Error Rate on the AMI meeting corpus (headset condition), 11% on VoxConverse, and around 20% on DIHARD-III — the three most-cited public diarization benchmarks. These numbers come from the official model card at huggingface.co/pyannote/speaker-diarization-3.1 and independent replications in the DIHARD-III challenge reports. Real-world DER depends heavily on microphone type, number of speakers, overlap fraction, and language; 2–6 speakers on close-mic audio is where pyannote performs best.

Can diarization work in real time?

Yes, but with a DER penalty. Offline diarization runs full-audio clustering and achieves the lowest DER — around 11% for pyannote.audio 3.1 on AMI. Streaming diarization uses incremental clustering, which typically degrades DER by 5–10 percentage points because early segments cannot benefit from later context. Commercial APIs (Deepgram, AssemblyAI) offer streaming diarization with 200–500 ms latency. pyannote's streaming variant is improving but is still primarily used in batch mode for accuracy-critical work such as meeting notes and podcast production.

What are the biggest limitations of speaker diarization?

The main failure modes in 2026 are: overlapping speech (two or more people talking simultaneously), which can push DER to 20–35%; very short utterances under 1 second, which are often mis-assigned or missed; similar voices (same-gender, siblings) that get merged into a single cluster; unknown speaker counts, where the model may over- or under-cluster; cross-recording identity, where Speaker_00 in recording A is not the same person as Speaker_00 in recording B (that is identification, not diarization); low-resource languages with limited embedding training data; very long recordings (over 2 hours) with clustering memory and complexity issues; and noisy or reverberant environments where VAD failures cascade downstream.