What Is a Transcript? Definition, Types, Examples & How Transcripts Are Made (2026)

A transcript is a written record of spoken content — from an audio or video recording, a court proceeding, or a student's academic history. This guide covers what a modern audio/video transcript actually contains, how it differs from subtitles and captions, and how transcripts are made in 2026.

TL;DR

A transcript is a written record of spoken content — a speech, meeting, interview, podcast, court proceeding, or lecture. In 2026 the term most commonly refers to an audio/video transcript: structured text with optional timestamps and speaker labels, usually generated by AI in minutes. A separate meaning — an academic transcript — is a student's official grade record. This page covers audio/video transcripts in depth, with a clear disambiguation between transcripts, subtitles, captions, notes, minutes, and summaries.

“Transcript” — Three Distinct Meanings

The word transcript is used in three overlapping senses in English. Merriam-Webster defines a transcript broadly as “a written, printed, or typed copy” — but in practice, which meaning applies depends on context. If someone asks for “the transcript” at a podcast studio, they mean sense 1. At a university registrar, sense 2. At a courthouse, sense 3. All three are legitimate uses of the same word.

Sense 1

Audio/video transcript

A text record of spoken content from an audio or video recording — a podcast, a meeting, an interview, a lecture, a YouTube video. This is the sense this page focuses on, and by far the most common in modern usage.

Sense 2

Academic transcript

An official record of a student's grades and coursework from a school, college, or university. Required for job applications, graduate school admissions, and professional licensing. Sealed and issued by the institution's registrar.

Sense 3

Legal transcript

A verbatim record of court proceedings, depositions, or legal interviews, prepared by a certified court reporter. Legally binding, admissible as an official record, and typically produced by stenography or certified voice reporting.

What an Audio/Video Transcript Actually Contains

A modern audio/video transcript is built around three core components: text (what was said), timestamps (when it was said), and speakers (who said it). Which of these appear depends on the transcript's purpose. Here are three realistic examples of the same podcast excerpt, formatted three different ways.

Example 1 — Plain text transcript (raw, no formatting)

Welcome to the podcast. Today we're talking about
productivity tips. Thanks for having me on. I've been
working remotely for five years. Great to have you.
What's your number one tip?

Simplest form. Good for reading and copy-pasting into a document. No speaker attribution, no timing.

Example 2 — Timestamped transcript

[00:00:00] Welcome to the podcast. Today we're
talking about productivity tips.
[00:00:08] Thanks for having me on. I've been
working remotely for five years.
[00:00:15] Great to have you. What's your
number one tip?

Adds a jump-to point every few seconds. Useful for podcasts, video editing, and citing quotes with exact timing.

Example 3 — Speaker-labeled and timestamped

Sarah (00:00:00): Welcome to the podcast. Today
we're talking about productivity tips.
Guest (00:00:08): Thanks for having me on. I've
been working remotely for five years.
Sarah (00:00:15): Great to have you. What's your
number one tip?

The standard for interviews, meetings, and multi-speaker content. Requires speaker diarization — separating who said what.

Beyond the basics. AI-enhanced transcripts increasingly ship with additional fields: chapter markers with auto-generated titles, sentiment analysis per speaker, extracted action items, keyword tags, and confidence scores per word. These aren't part of the classical definition of a transcript, but they're standard output from modern tools including VexaScribe, Otter, and Fireflies.

Transcript vs Subtitle vs Caption vs Notes vs Minutes vs Summary

These six terms are frequently confused — and using the wrong one can mean shipping the wrong deliverable. A podcaster asking for “captions” probably wants a full-length transcript; a lawyer asking for “minutes” probably means a verbatim transcript of the hearing. Here is the practical difference between all six, followed by a walk-through of when each fits.

TypeWhat It IsWhere It's UsedTimestampsLength
TranscriptFull text of spoken contentPodcasts, meetings, interviews, lecturesOptionalLong (matches source)
SubtitleText overlay, typically translated dialogueForeign-language film and videoRequired (sync to video)Short lines (32–42 chars)
CaptionText overlay for same-language dialogue plus sound effectsAccessibility for Deaf and hard-of-hearing viewersRequiredShort lines (32–42 chars)
Meeting NotesSelective bullets of what was discussedTeam standups, internal meetingsNoShort, curated
Meeting MinutesFormal record of decisions and actionsBoard meetings, corporate governanceNoStructured, moderate
SummaryCondensed key pointsExecutive briefings, digestsNoVery short

Transcript vs subtitle. A subtitle is a short, timed text overlay — traditionally a translation of dialogue for foreign-language film. Subtitles are constrained by reading speed (roughly 17 characters per second) and screen real estate. A transcript is the full unconstrained text, meant to be read separately from the video. If you need a text file that plays inside a video player at the right moment, that's a subtitle — see our subtitle generator.

Transcript vs caption. Captions are same-language text overlays that also describe non-speech audio — [door slam], [music playing], [laughter]. They exist for Deaf and hard-of-hearing viewers and are required by WCAG 2.1 and the Americans with Disabilities Act for many types of published video. A transcript rarely includes sound effects unless it's prepared specifically for accessibility.

Transcript vs meeting notes. Notes are curated bullets of what was discussed — someone read the transcript (or listened live) and picked out what mattered. A transcript captures everything; notes are selective. If your team just wants the highlights, go to audio to notes.

Transcript vs meeting minutes. Minutes are the formal, structured record of a governance meeting — attendees, agenda items, motions, votes, decisions, action items. Minutes cite the transcript rather than reproduce it. See meeting minutes generator.

Transcript vs summary. A summary is the condensed version — a paragraph or a page instead of dozens of pages. It answers “what happened” without preserving the exact words. See transcript to summary.

Verbatim vs Clean-Read vs Edited Transcripts

Even within “audio/video transcript,” there are three distinct styles. Which one you want depends entirely on what the transcript is for.

Verbatim

Every “um” and pause

Captures every filler word (“um,” “uh,” “you know”), stutter, laugh, cough, and pause. Required for legal proceedings, expert-witness testimony, qualitative research where speech patterns matter, and linguistic analysis.

Clean-Read (Intelligent Verbatim)

Fillers removed, meaning preserved

Filler words removed, minor grammar cleaned, false starts trimmed — but content and voice preserved. Standard for publishing, podcasts, journalism, and content marketing. The default output of most AI transcription tools.

Edited

Restructured for readability

Heavily reshaped: interview quotes condensed, sentences rewritten, order sometimes changed. Used in book excerpts, magazine profiles, and long-form journalism where the quote needs to read cleanly.

Rule of thumb: match the transcript style to the deliverable. A deposition needs verbatim. A podcast show-notes transcript needs clean-read. A quote in a magazine profile needs edited.

How Transcripts Are Made in 2026

Four production paths dominate today. The right one depends on how much accuracy you actually need, how fast you need it, and what you can spend. A decade ago the choice was mostly “human transcriptionist or DIY.” Since the release of OpenAI Whisper in late 2022, AI has become the default for most non-legal work — and hybrid AI-plus-human review has quietly taken over the middle of the market.

AI transcription

~$0.30/hr · 5–10 min turnaround · 92–96% accurate on clean audio

Whisper, Deepgram, AssemblyAI, Otter, VexaScribe. Runs speech recognition models over the audio and returns a formatted transcript in minutes. Sweet spot for podcasts, meetings, interviews, and lectures where 95%-ish accuracy is fine. See how accurate is Whisper for the WER benchmarks behind these numbers, and AI transcription for a broader overview.

Human transcription

~$60–150/hr of audio · 24–48h turnaround · 99%+ verbatim

Services like Rev and GoTranscript. A human transcriptionist listens to the audio and types out the transcript. Best for legal, medical, and verbatim requirements where every word matters and errors carry real cost. Rev's consumer price is around $1.50/min ($90/hr of audio); GoTranscript sits in a similar range.

Hybrid (AI first pass, human review)

~$1–2/min · ~24h turnaround · 98%+ accuracy

AI produces the initial draft; a human editor corrects errors, fixes speaker labels, and normalizes punctuation. The sweet spot for content publishing — you get near-human accuracy at a fraction of pure-human cost, delivered inside a day. See AI vs human transcription for the full tradeoff.

DIY manual

Free · 4–6 hours of work per hour of audio · quality varies

Loading the audio into a text editor and typing along. Rarely worth it in 2026 — even a $10/mo AI subscription usually pays for itself on the first hour of audio. Still occasionally useful for extremely short or sensitive recordings where nothing leaves your machine.

Common Use Cases for Transcripts

Transcripts show up everywhere spoken content needs to be searched, quoted, repurposed, or archived. The seven scenarios below cover the bulk of real-world demand across professional publishing, legal, research, and enterprise workflows.

Content marketing and repurposing

One 60-minute webinar becomes five blog posts, a LinkedIn carousel, an X thread, and a YouTube description. The transcript is the raw material every other asset is cut from.

Legal proceedings

Deposition transcripts, court records, arbitration hearings, and attorney-client interviews. See legal transcription service and deposition transcription.

Academic research

Interview transcripts loaded into NVivo or Atlas.ti for qualitative coding, thematic analysis, and grounded theory work. Also used for focus groups and ethnographic fieldwork.

Journalism

Source interviews, on-record quotes, and fact-checking. A verifiable transcript is often the difference between a defensible quote and a retraction.

Accessibility compliance

WCAG 2.1 Success Criterion 1.2.2 requires captions for prerecorded video, and SC 1.2.1 requires a transcript for prerecorded audio-only content. Public-sector and enterprise publishers treat this as a hard requirement.

SEO for audio and video

Search engines index text, not audio. Publishing a transcript alongside a YouTube video or podcast episode gives Google and Bing something to rank — often driving more organic traffic than the video itself.

Meeting records

A searchable archive of what was said and decided — invaluable when someone joins mid-project, when a decision gets challenged three months later, or when you need to build a decision log without babysitting notes in real time.

Sources and further reading

Frequently Asked Questions

Is a transcript the same as captions?

No. Captions are text overlays on video, synchronized with the audio, and include sound effects and non-speech cues for accessibility. A transcript is a standalone text document of what was said, usually longer and without timing constraints. Captions are typically constrained to 32-42 characters per line and 1-2 lines on screen; transcripts have no such limit.

Do I need a transcript or subtitles for YouTube?

Both serve different purposes. Subtitles or captions display on-screen during playback and are required for accessibility; a transcript is a full-text version viewers or search engines can read separately. For accessibility, SEO, and repurposing the content into blog posts, add both. YouTube auto-generates captions from a transcript file if you upload one.

How accurate are AI transcripts?

Modern AI (OpenAI Whisper Large-v3, Deepgram Nova-3, AssemblyAI Universal-2) hits roughly 95-97% accuracy on clean single-speaker English (around 3-5% Word Error Rate). Accuracy drops to 82-92% for meeting audio with multiple speakers and background noise, and lower still for strong accents or technical vocabulary. For legal or medical work requiring 99%+ accuracy, human transcription or a human-review pass is still standard.

Are transcripts admissible in court?

Court transcripts are prepared by certified court reporters (typically stenographers or CVR-certified reporters) and are legally binding official records. AI-generated transcripts are useful for reference, internal review, and litigation prep, but they are not admissible as an official court record without human verification, certification, and signature by a qualified court reporter.

How long does transcription take?

AI transcription takes about 5-10 minutes for a 1-hour recording, often faster on GPU infrastructure. Human transcription takes 4-6 hours of work per hour of audio and is usually delivered within 24-48 hours. Hybrid workflows (AI first pass plus human review) typically deliver in around 24 hours. Turnaround depends on audio length, complexity, number of speakers, and service tier.

What file formats do transcripts come in?

Common transcript formats are: plain text (.txt) for raw copy, Word document (.docx) for editing, PDF (.pdf) for sharing and archiving, SubRip subtitle (.srt) and WebVTT (.vtt) for video captions, and JSON for programmatic use with per-word timestamps, speaker labels, and confidence scores.

Need a transcript for your own audio or video? Try VexaScribe free.