Whisper Speaker Diarization — How to Add Speaker Labels to Whisper (2026)

OpenAI's Whisper transcribes speech but doesn't label who is speaking. To get speaker labels you bolt on a diarization model (pyannote.audio or NVIDIA NeMo) and align it to Whisper's word timestamps. Below: working code for the three main open-source paths (WhisperX, whisper-diarization, raw pyannote), an honest comparison, real failure modes, production deployment realities, and self-host vs managed cost math.

Working code for WhisperX, whisper-diarization, raw pyannoteHonest comparison + failure modesSelf-host vs managed cost math

Supported formats:

MP3WAVM4AMP4FLACOGG

Does Whisper Do Speaker Diarization?

No. The open-source Whisper model outputs text + segment timestamps, not speaker IDs. The OpenAI whisper-1 hosted API doesn't either.

Whisper is a sequence-to-sequence ASR + translation model. Diarization (speaker embedding + clustering over time) is a different task that needs a separate model family. To get speaker-labeled transcripts you combine Whisper with a diarization model in a multi-stage pipeline.

2026 update: OpenAI released gpt-4o-transcribe-diarize, a separate model on the same audio endpoint that returns speaker-labeled output via a diarized_json format. More on that below.

Reference: openai/whisper Discussion #264 is where Whisper users get pointed for third-party diarization solutions.

The Standard Whisper + Diarization Pipeline

Every working open-source solution follows the same five-stage pipeline. Understanding this lets you debug when speaker labels come out wrong.

audio file

↓

1. VAD (Silero or pyannote)

strip silence → reduces Whisper hallucinations

↓

2. ASR (Whisper or faster-whisper)

segment-level transcription with rough timestamps

↓

3. Forced alignment (wav2vec2 or ctc-forced-aligner)

convert segment timestamps → word-level timestamps

↓

4. Diarization (pyannote.audio or NVIDIA NeMo)

cluster speaker embeddings → speaker turns

↓

5. Assignment

map each word's midpoint timestamp to a speaker turn

↓

speaker-labeled transcript

Why all five stages matter: skipping forced alignment is the #1 reason laptop tutorials produce garbage speaker labels. Whisper's native segment timestamps are too coarse to attribute words to speakers reliably — you need word-level timing first.

Path 1 — WhisperX (Default Choice)

github.com/m-bain/whisperX · ~22.6k stars · MIT

Architecture: faster-whisper (CTranslate2 backend) + wav2vec2 forced alignment + pyannote diarization + VAD pre-pass. Batched inference, under 8GB VRAM for large-v2, ~70× realtime on a single GPU per their README.

Python (the canonical block every tutorial copies):

import whisperx

device = "cuda"
audio_file = "audio.mp3"
batch_size = 16

# 1-2. VAD + ASR via faster-whisper
model = whisperx.load_model("large-v2", device, compute_type="float16")
audio = whisperx.load_audio(audio_file)
result = model.transcribe(audio, batch_size=batch_size)

# 3. Forced alignment for word-level timestamps
model_a, metadata = whisperx.load_align_model(
    language_code=result["language"], device=device)
result = whisperx.align(result["segments"], model_a, metadata,
                       audio, device, return_char_alignments=False)

# 4-5. Diarization + word→speaker assignment
diarize_model = whisperx.DiarizationPipeline(
    use_auth_token=HF_TOKEN, device=device)
diarize_segments = diarize_model(audio)
result = whisperx.assign_word_speakers(diarize_segments, result)

# result["segments"] now contains { start, end, text, speaker }

CLI:

whisperx audio.wav \
  --model large-v2 \
  --diarize \
  --highlight_words True \
  --hf_token <YOUR_HF_TOKEN>

Practical notes

• Requires a Hugging Face token AND accepting terms for pyannote/speaker-diarization-3.1 (or community-1)
• batch_size default 16 fits under 8GB VRAM with large-v2 — drop to 4-8 if you hit OOM
• compute_type="float16" halves memory vs float32; use int8 for older GPUs
• Pass min_speakers / max_speakers to diarize_model(audio, min_speakers=2, max_speakers=4) when you know N — significantly better than auto-detect
• CUDA 12.8 recommended; version pinning headaches between PyTorch / CTranslate2 / pyannote are well-documented

Best for: the default choice. Fast batch processing, multilingual support, most actively maintained, biggest community.

Path 2 — whisper-diarization (No HuggingFace Token)

github.com/MahmoudAshraf97/whisper-diarization · ~5.6k stars · BSD-2

Architecture: Whisper + NVIDIA NeMo (MarbleNet VAD + TitaNet speaker embeddings) + Demucs vocal separation + ctc-forced-aligner + punctuation realignment. Uses NeMo instead of pyannote — no HF token, but heavier (~10GB VRAM recommended).

CLI:

# basic
python diarize.py -a audio.wav

# with options
python diarize.py -a audio.wav \
  --whisper-model large-v3 \
  --device cuda \
  --suppress_numerals

Practical notes

• No Hugging Face token required — uses NeMo models from NVIDIA NGC instead of pyannote
• Demucs vocal separation pre-pass strips music/background → cleaner diarization on podcasts and recorded interviews
• Output: per-speaker SRT + TXT in the same directory as the input
• Heavier than WhisperX — Demucs + NeMo + Whisper all loaded at once
• Slower than WhisperX (no batched inference)

Best for: environments where you can't accept Hugging Face model gating (compliance, air-gapped, enterprise procurement); audio with significant background music or noise where the Demucs pre-pass helps.

Path 3 — Roll Your Own (Whisper + Raw pyannote)

If you want full control or are learning the pipeline, build from primitives. This is also the approach if you need to integrate with an existing custom ASR/diarization stack.

import whisper
from pyannote.audio import Pipeline

# 1-2. Transcribe with Whisper (word_timestamps=True is critical)
model = whisper.load_model("large-v3")
asr_result = model.transcribe("audio.wav", word_timestamps=True)

# 3-4. Diarize with pyannote
pipeline = Pipeline.from_pretrained(
    "pyannote/speaker-diarization-3.1",
    use_auth_token="YOUR_HF_TOKEN")
diarization = pipeline("audio.wav")

# 5. Align: map each word's midpoint timestamp to its speaker turn
for segment in asr_result["segments"]:
    for word in segment.get("words", []):
        midpoint = (word["start"] + word["end"]) / 2
        word["speaker"] = "UNKNOWN"
        for turn, _, speaker in diarization.itertracks(yield_label=True):
            if turn.start <= midpoint <= turn.end:
                word["speaker"] = speaker
                break

# Now each word has a speaker label

A good reference for a clean class-based version is the Scalastic three-class architecture (separate WhisperAudioTranscriber, PyannoteDiarizer, SpeakerAligner classes) — useful if you're building this into an existing codebase.

Best for: learning, custom pipelines, integrating diarization into an existing ASR system, or when you want to swap individual components (e.g., a custom alignment model).

Honest Comparison — WhisperX vs whisper-diarization vs Raw pyannote

The SERP for "whisper diarization" has both repos in the top 5 but nobody compares them side-by-side. Here's the breakdown.

Feature	WhisperX	whisper-diarization	Raw pyannote
Stars	~22.6k	~5.6k	N/A (library)
ASR backend	faster-whisper	Whisper	Your choice
Diarization	pyannote	NeMo TitaNet	pyannote
Vocal separation	No	Demucs	DIY
HF token required	Yes	No	Yes (for pyannote)
VRAM (typical)	<8 GB	~10 GB	Varies
Speed (per their docs)	~70× realtime (batched)	Slower than WhisperX	Slowest (no batching)
Setup difficulty	Easy	Easy	Medium
Best for	Default choice	No HF gating / noisy audio	Learning / custom pipelines

OpenAI's gpt-4o-transcribe-diarize (2026)

The biggest recent change in the diarization landscape: OpenAI shipped a new model in 2026, gpt-4o-transcribe-diarize, that bundles ASR + diarization natively. Many existing tutorials (and older blog posts ranking on this query) don't mention it.

Native diarization via /v1/audio/transcriptions with diarized_json response format
Supports up to 4 known_speaker_references[] (2-10s audio clips) to map speaker IDs to known names
Reported around $0.006/min at launch (verify on platform.openai.com)
NOT available on the Realtime API yet — /v1/audio/transcriptions only

When to use it: simplest integration if you're already on OpenAI APIs, you don't need custom vocabulary or on-prem deployment, and basic speaker labeling is enough. When it's not enough: if you need consistent speaker identity across files (the "same person in episode 1 and episode 5" problem), custom-trained ASR, or features beyond diarization (summaries, meeting bot, multi-format export).

Pipeline Failure Modes — What Actually Breaks

The READMEs admit these in one-line caveats; in practice they show up constantly. Knowing them up front saves debugging time.

Overlapping speech

Whisper transcribes only the dominant voice. pyannote can flag overlap regions, but the word-to-speaker assignment step can only pick one speaker per word. Both WhisperX and whisper-diarization explicitly admit this in their READMEs. For panel discussions or phone calls with crosstalk, expect noticeably lower accuracy.

Short backchannels (<1 second)

Speaker embeddings are unreliable on under one second of audio. "Yeah", "mhm", "right" tend to get merged into the surrounding speaker. Usually fine — except for therapy, coaching, or research transcripts where listener affirmations matter analytically.

Speaker switches mid-segment

Whisper sometimes emits a long segment that spans two speakers. After forced alignment, the word-level timestamps let you split — but the split usually lands at the nearest clause boundary, not the actual speaker boundary. Result: one or two words attributed to the wrong speaker at every turn.

Words without alignable tokens

Numbers like "2014." or currency like "£13.60" may have no entry in the wav2vec2 alignment dictionary and get dropped from word timestamps entirely. Common in finance, scientific, or technical content. Workarounds: pre-process numbers into spelled-out form, or use a larger alignment model.

Similar voices or distant mics

Embedding clustering merges similar speakers or splits one speaker into multiple clusters. Two co-hosts with similar voice pitch on the same mic, a meeting recorded from a single laptop in the middle of a table — both produce predictable errors. Fix: provide min_speakers / max_speakers when you know N, or use higher-quality close-mic audio per speaker.

Speaker re-identification across files

pyannote and NeMo both treat each file independently. SPEAKER_00 in file 1 is not necessarily SPEAKER_00 in file 2 — they're just cluster IDs. For podcast series, interview archives, or recurring meetings, you need additional speaker recognition (embedding extraction + cosine similarity against a reference set). No popular open-source pipeline does this end-to-end out of the box.

The GPT-4o Post-Processing Hack

From the OpenAI community thread on Whisper diarization — undocumented in any major tutorial: after running WhisperX or similar, send the rough transcript to GPT-4o and ask it to relabel speakers based on conversational cues. Works surprisingly well for content with clear conversational structure (interviews, podcasts where the host asks questions, customer-support calls).

prompt = f"""Below is a transcript with speakers labeled
SPEAKER_00, SPEAKER_01, etc. Based on conversational context
(questions, responses, name mentions, role indicators), relabel
the speakers with their likely roles or names if mentioned.

If you cannot confidently identify a speaker, keep the generic
SPEAKER_XX label rather than guessing.

Return the transcript with corrected speaker labels.

Transcript:
{transcript}
"""

# Then send to GPT-4o via the chat completions API

When it works: structured conversations (host + guest, interviewer + interviewee, agent + customer) where one speaker introduces themselves or asks distinctive questions.

When it doesn't: technical panel discussions, group meetings with multiple peers, audio without role-distinctive language. GPT-4o will sometimes hallucinate confident-sounding names — explicitly instruct it not to guess.

Self-Host Cost Math — Realistic Numbers

Most tutorials hand-wave on cost. Here are concrete ballparks (verify current spot pricing before budgeting):

GPU	Spot price	WhisperX throughput	Effective cost / audio-hour
T4	~$0.30-0.50/hr	~20-30× realtime	~$0.01-0.025
A10G	~$0.50-1.00/hr	~50-70× realtime	~$0.01-0.02
RTX 4090 (Vast.ai / RunPod)	~$0.40-0.70/hr	~60-80× realtime	~$0.007-0.015

So at scale, self-hosting can land around $0.01-0.03 per audio-hour in raw GPU costs. That sounds cheap — and it is, if you only count GPU time.

What the math hides:

Initial integration: 1-2 weeks of engineering for a working production pipeline (queueing, retries, monitoring, GPU pool management)
Ongoing maintenance: dependency pinning between PyTorch / CUDA / CTranslate2 / pyannote / wav2vec2 is genuinely fragile — pyannote ships a new pipeline roughly yearly
Hugging Face token management for production (token rotation, gated model acceptance for every new pipeline version)
GPU pool sizing for variable load — cold starts of 30-90s are real
Engineering time to handle the failure modes above (overlap detection, speaker count hints, retry logic)

Honest break-even ranges vs managed APIs at $0.10-0.30/audio-hour:

Under ~500 audio-hr/month: managed wins on dev time + reliability
500-5,000 audio-hr/month: depends on whether you have ML engineering capacity
5,000+ audio-hr/month: self-host wins economically if you have the engineering team

Production Deployment Realities

What every laptop tutorial skips. If you're going past hobby usage:

• GPU cold starts — loading large-v3 + alignment + pyannote takes 30-90 seconds per cold start. Use warm pools or persistent workers. Don't spin up a fresh container per request.
• Queueing — for multi-tenant or batch workloads, queue audio files. Don't try to run multiple diarizations in parallel on one GPU beyond batch_size limits — you'll OOM.
• Memory management — keep models loaded across requests; loading per-request adds the cold-start hit every time. Watch out for memory growth in long-running workers.
• Retries — pyannote pipelines occasionally hang. Implement timeouts + exponential backoff. The Hugging Face Forum has an active thread on this.
• Dependency pinning — pin PyTorch, CUDA, cuDNN, CTranslate2, pyannote, faster-whisper versions in a Docker image. Update one at a time. Skip a major pyannote release and you'll spend a day re-pinning.
• Speaker count hints — if you know N speakers, pass it. Auto-detect works but constrained always wins.
• Multi-region / latency — if you serve users globally, plan GPU pools per region. A diarization job that takes 5 minutes round-trip from Sydney to a US-East GPU pool feels broken even when it's technically working.

Self-Host vs Use a Managed Service — Honest Framework

Self-host when…

• You have ML engineering capacity (1+ engineer fluent in PyTorch)
• >1,000 audio-hours/month sustained
• Data residency requirements (on-prem, EU-only, air-gapped)
• You need custom models or fine-tuning
• You're comfortable maintaining Python + CUDA dependencies

Managed wins when…

• <500 audio-hours/month
• No dedicated ML engineer on the team
• You need diarization PLUS other features (summaries, meeting bot, multi-format export)
• Time-to-market matters more than per-hour cost
• You'd rather not manage HF tokens, CUDA, and GPU pools

Managed Services for Whisper + Diarization

Pricing changes — verify each before budgeting. As of mid-2026:

• OpenAI gpt-4o-transcribe-diarize — native diarization on the OpenAI audio API. ~$0.006/min. Simplest if you're already on OpenAI; basic feature set.
• AssemblyAI Universal-2 — strong all-around. Diarization is a small add-on fee. Good docs.
• Deepgram Nova-3 — fast, has real-time streaming with diarization. Base ~$0.46/audio-hour PAYG plus diarization add-on.
• Rev AI Turbo — cheapest mainstream API (~$0.10/audio-hour). Good for batch.
• Speechmatics — strong accuracy reputation, on-prem option available, enterprise pricing.
• VexaScribe — Whisper-based service with diarization, AI summaries, and meeting bot bundled. Best fit if you want speaker-labeled transcripts as part of a full transcription product rather than just a diarization API. From $0.30/audio-hour on the $2/month Starter plan.

Where VexaScribe Fits

If you're reading this because you want speaker-labeled transcripts for your own audio (not because you're building a transcription product yourself), VexaScribe is one option that handles the whole pipeline for you.

We're a Whisper-based transcription service that runs the same pipeline this page describes — VAD, Whisper Large-v3 transcription, forced alignment, pyannote-based diarization (up to 50 speakers) — and adds: AI-generated meeting summaries (action items, decisions, blockers, open questions), a Zoom / Google Meet / Microsoft Teams meeting bot for live calls, word-level SRT and VTT export with proper cue splitting (so subtitles import cleanly into Premiere / Final Cut / DaVinci), and 99-language support.

When VexaScribe is a fit

• You don't want to manage Python + CUDA + pyannote + HF tokens yourself
• You need diarization PLUS summaries, meeting bot, or multi-format export — not just a raw diarization API
• Under ~2,000 audio-hours/month (above this, self-host math becomes attractive)
• You want a product (dashboard + editor + exports), not an API endpoint

When VexaScribe is NOT a fit

• You're building a developer-facing transcription product — look at Deepgram, AssemblyAI, or OpenAI
• You need on-prem deployment for compliance
• You're processing >5,000 audio-hours/month and have ML engineers — self-host wins
• You only need raw Whisper transcription with no speaker labels — OpenAI's whisper-1 API is cheaper

30 minutes free per month if you want to try the output quality on your own files — no credit card required.

Try VexaScribe Free →

Whisper Diarization FAQ

Does the OpenAI Whisper API include speaker diarization?

The original whisper-1 hosted API does NOT include speaker diarization — it returns text and segment timestamps only. In 2026 OpenAI released gpt-4o-transcribe-diarize, a separate model exposed through the same /v1/audio/transcriptions endpoint, which DOES return speaker-labeled output via a diarized_json response format. It supports up to 4 known_speaker_references[] (2-10s audio clips) to map anonymous speaker IDs to known names. Verify current pricing on platform.openai.com — it was reported around $0.006/min at launch.

WhisperX vs whisper-diarization — which should I use?

WhisperX (github.com/m-bain/whisperX, ~22.6k stars) is the default choice: faster-whisper backend, pyannote diarization, batched inference at ~70× realtime on a single GPU per their README, under 8GB VRAM for large-v2. Requires a Hugging Face token and accepting pyannote model terms. whisper-diarization (github.com/MahmoudAshraf97/whisper-diarization, ~5.6k stars) uses NVIDIA NeMo models (MarbleNet VAD + TitaNet embeddings) plus Demucs vocal separation — no HuggingFace token needed, heavier (~10GB VRAM), but better on audio with background music. Pick WhisperX if you want speed and the most active project. Pick whisper-diarization if you can't use HF-gated models or your audio has music/noise.

Do I need a Hugging Face token for Whisper diarization?

Yes, if you use pyannote.audio (which WhisperX depends on). You need to (1) create a free HF account, (2) accept the gated terms for pyannote/speaker-diarization-3.1 (or community-1), and (3) generate a token at hf.co/settings/tokens. This is the single biggest packaging complaint in the WhisperX issue tracker. To avoid it entirely, use MahmoudAshraf97/whisper-diarization (NeMo-based, no HF token), or use the raw NVIDIA NeMo pipeline directly.

Can I run Whisper diarization locally without a GPU?

Technically yes — set device="cpu" — but it is hours-per-hour-of-audio rather than minutes. Not practical for production. A consumer GPU (RTX 3060/4070, 8GB+) is the realistic minimum for batch processing. For lightweight CPU work, whisper.cpp's tinydiarize (-tdrz flag) provides turn-segmentation only (detects "speaker changed" but does not cluster the same person across the file), which is much cheaper but a different capability.

How accurate is pyannote speaker diarization?

Per pyannote's own benchmark table (pyannote/pyannote-audio README), community-1 reports diarization error rate (DER) of about 17.0% on AMI (meeting audio), 20.2% on DIHARD 3 (hard diverse audio), and 11.2% on VoxConverse (videos). Their commercial precision-2 model is 25-40% lower DER on the same benchmarks. Important caveat: DER numbers vary substantially with dataset version, collar settings, and overlap handling — different sources publish different numbers for what looks like the same test. Always cite the specific benchmark.

Why are speaker labels generic (SPEAKER_00) instead of real names?

Diarization solves "who spoke when" without knowing who anyone is. It clusters voice embeddings and assigns anonymous IDs (SPEAKER_00, SPEAKER_01). To get real names you need an additional step: either (a) provide reference audio for each known speaker and use speaker recognition to map clusters to identities, (b) post-process with an LLM that infers names from conversational cues (the GPT-4o hack covered above), or (c) use OpenAI's gpt-4o-transcribe-diarize with known_speaker_references[]. None of the popular open-source pipelines do speaker recognition end-to-end by default.

Can Whisper detect overlapping speakers?

Whisper itself transcribes only the dominant voice when speakers overlap — it does not produce multi-channel output. pyannote.audio can flag overlap regions, but the word-to-speaker assignment step that follows can only pick one speaker per word. Both the WhisperX and whisper-diarization READMEs explicitly state that overlapping speech is not handled well. For audio with significant overlap (panel discussions, phone calls with crosstalk), expect noticeably lower accuracy than for clean turn-taking conversation.

How do I make Speaker A in file 1 = Speaker A in file 2?

Out of the box with WhisperX or whisper-diarization, you can't — each file is diarized independently and speaker labels are anonymous per file. To get consistent speaker identity across files (e.g., for a podcast series with recurring guests), you need to: (1) extract a speaker embedding for each known person from reference audio, (2) compare your file's diarized speaker embeddings to the reference set using cosine similarity, (3) relabel matching speakers. This is non-trivial to build reliably. Managed services like AssemblyAI's speaker recognition and VexaScribe's account-level speaker library handle this if you don't want to build it yourself.

How Accurate Is Whisper?

Open ASR Leaderboard WER data, real-world accuracy by language, and benchmark caveats.

Whisper Transcription

OpenAI Whisper without the setup — VexaScribe runs the pipeline for you.

Speaker Labels & Diarization

How automatic speaker labels work end-to-end in a transcription product.

What Is ASR?

Plain-English explanation of automatic speech recognition — the technology Whisper is built on.

Whisper Speaker Diarization — How to Add Speaker Labels to Whisper (2026)

Does Whisper Do Speaker Diarization?

The Standard Whisper + Diarization Pipeline

Path 1 — WhisperX (Default Choice)

Practical notes

Path 2 — whisper-diarization (No HuggingFace Token)

Practical notes

Path 3 — Roll Your Own (Whisper + Raw pyannote)

Honest Comparison — WhisperX vs whisper-diarization vs Raw pyannote

OpenAI's gpt-4o-transcribe-diarize (2026)

Pipeline Failure Modes — What Actually Breaks

Overlapping speech

Short backchannels (<1 second)

Speaker switches mid-segment

Words without alignable tokens

Similar voices or distant mics

Speaker re-identification across files

The GPT-4o Post-Processing Hack

Self-Host Cost Math — Realistic Numbers

Production Deployment Realities

Self-Host vs Use a Managed Service — Honest Framework

Self-host when…

Managed wins when…

Managed Services for Whisper + Diarization

Where VexaScribe Fits

When VexaScribe is a fit

When VexaScribe is NOT a fit

Whisper Diarization FAQ

Related

How Accurate Is Whisper?

Whisper Transcription

Speaker Labels & Diarization

What Is ASR?