What Is Transcription? Audio & Video Explained

Transcription converts spoken audio or video recordings into written text. This guide covers everything: types, formats, use cases, and how AI transcription works.

Plain LanguageAll Types CoveredWith Examples

Supported formats:

MP3WAVM4AMP4MOVWEBM

What Is Transcription? (The Simple Definition)

Transcription is the process of converting spoken words from an audio or video recording into written text. (This page covers audio/video transcription — not biological transcription, which is a different process in genetics.)

For most of history, transcription was done by hand: stenographers took shorthand notes in courtrooms and boardrooms. Magnetic tape recorders made it possible to replay speech and type it out later. Digital recorders and software made files easier to manage and share. Today, AI transcription services process 1 hour of audio in under 5 minutes.

The 4 Types of Transcription

Not all transcription is the same. The type you need depends on your use case.

TypeWhat It IncludesUsed ForExample
Full VerbatimEvery word, filler (um, uh), pauses, laughter, non-verbal soundsLegal, psychological research, linguistics[laughs] Um, I— I think the, the contract...
Clean VerbatimAll meaningful words, fillers removed, grammar intactBusiness meetings, journalism, most professional useI think the contract covers this.
EditedGrammar corrected, restructured for readabilityPublishing, marketing, blog postsThe contract covers this scenario.
PhoneticSound-based notation of speech soundsLinguistic analysis, dialect researchNot used in commercial transcription

Most business transcription uses “clean verbatim” by default.

What Is a Transcript Used For?

Transcripts serve every industry that relies on spoken communication.

Legal

Depositions, court records, and legal proceedings. Verbatim accuracy for official documentation.

Research

Qualitative interviews, focus groups, and oral histories. Make recordings searchable and citable.

Journalism

Interview records, source quotes, and fact-checking. Never misquote a source again.

Content Creation

Podcast show notes, blog posts, YouTube chapters, and social media clips from your recordings.

Accessibility

Making audio and video content available to deaf and hard-of-hearing users. Required by ADA and WCAG.

Education

Lecture notes, study guides, and course materials. Helps students review and retain information.

Transcription vs Captions vs Subtitles: What's the Difference?

These three terms are often confused. Here's a clear breakdown.

TranscriptionCaptionsSubtitles
Has timestampsNo (optional)YesYes
LanguageSame as audioSame as audioDifferent language
PurposeReading / searchingAccessibilityTranslation
File formatTXT, DOCX, PDFSRT, VTTSRT, VTT
Used onDocumentsVideo playersVideo players

How Transcription Works: Manual vs AI

Two approaches, very different trade-offs. See the full comparison →

Manual (Human) Transcription

MethodTrained typist listens and types
Speed4–6 hrs per hour of audio
Accuracy99%+
Cost$0.79–$3.00 / min

AI Transcription

MethodASR (speech recognition) model
Speed2–5 min per hour of audio
Accuracy95–98% (ideal conditions)
Cost$0.003–$0.25 / min

Transcription File Formats Explained

Different formats serve different purposes. NovaScribe exports all of them.

.TXT

Plain Text

No formatting, universally compatible with any text editor or app.

.DOCX

Microsoft Word

Easy to edit and share. The most common format for professional transcripts.

.PDF

PDF Document

Read-only format for distribution and archiving.

.SRT

SubRip Subtitles

Timed subtitles for video platforms like YouTube and Vimeo.

.VTT

WebVTT Captions

Web captions for HTML5 video players and streaming platforms.

When Is Transcription Legally Required?

For many organizations, transcription is not optional — it's a legal obligation.

ADA (Americans with Disabilities Act)

Requires accessible audio and video content for public-facing businesses. Organizations must provide transcripts or captions so deaf and hard-of-hearing users can access the same content.

Section 508 (US Federal)

US federal agencies must caption and transcribe all audio and video content. This applies to training materials, public announcements, recorded meetings, and online video.

WCAG 2.1 Level AA

Web Content Accessibility Guidelines require captions for all pre-recorded video content on websites. Level AA is the standard required by most accessibility laws and policies worldwide.

Consult a legal professional for specific compliance requirements.

How NovaScribe Transcription Works

Upload Your File

MP3, WAV, M4A, MP4, MOV, WEBM, and more. Drag and drop or browse to select your audio or video file.

AI Transcribes

NovaScribe's AI processes your file and generates an accurate text transcript — results ready in minutes.

Export Your Transcript

Download as TXT, DOCX, PDF, SRT, or VTT. Edit in-browser before exporting, or share a link directly.

Affordable Pricing

30-min interview=~$0.15
1-hour meeting=~$0.30
2-hour lecture=~$0.60

Based on Pro plan ($10/mo for 2,500 minutes). All export formats included at no extra cost.

View pricing plans

Transcription FAQ

What is the difference between transcription and translation?

Transcription converts spoken words into written text in the same language — a spoken English interview becomes a written English transcript. Translation converts content from one language to another — a written English document becomes written Spanish. Some services combine both: transcription + translation produces a written transcript in a different language than the original speech.

What is verbatim transcription?

Verbatim transcription captures every spoken word exactly as said, including filler words (um, uh, like), false starts, repetitions, laughter, pauses, and non-verbal sounds. It’s used in legal proceedings, psychological research, and linguistic studies where the exact manner of speech is important. Most business transcription uses ‘clean verbatim’ instead, which removes fillers while keeping all meaningful content.

What is the difference between transcription, captions, and subtitles?

Transcription is a text document without timing data — for reading and searching. Captions are timed text synchronized with audio/video, shown in the same language as the spoken content — designed for accessibility (deaf and hard-of-hearing audiences). Subtitles are also timed and synchronized, but typically in a different language than the audio — used for translation. NovaScribe exports all three formats: TXT/DOCX for transcripts, SRT/VTT for captions and subtitles.

How long does transcription take?

AI transcription processes 1 hour of audio in 2–5 minutes. Human transcription typically takes 4–6 hours per hour of audio for a standard turnaround, or 12–24 hours for rush service. The actual time depends on audio quality, number of speakers, and subject complexity. AI services like NovaScribe are near-instant regardless of file length.

What file formats can I get my transcript in?

Common transcript formats include: TXT (plain text, universally compatible), DOCX (Microsoft Word, most common for editing), PDF (read-only sharing), SRT (SubRip — timed subtitles for video platforms), VTT (WebVTT — web captions for HTML5 video), and JSON (structured data for developers). NovaScribe exports TXT, DOCX, PDF, SRT, and VTT.

Is transcription important for accessibility?

Yes — transcription is a key accessibility tool. The Americans with Disabilities Act (ADA) and Web Content Accessibility Guidelines (WCAG 2.1 AA) require that audio and video content be accessible to deaf and hard-of-hearing users. Transcripts and captions fulfill this requirement. Universities, government agencies, and companies subject to Section 508 compliance must provide transcripts or captions for all recorded audio/video content.

How accurate is AI transcription?

AI transcription reaches 95–98% accuracy in ideal conditions — clear audio, single speaker, standard accent, general vocabulary. In challenging conditions (multiple speakers, background noise, heavy accents, technical jargon), accuracy typically falls to 70–90%. For most business use cases like meeting notes, podcast show notes, and YouTube captions, AI accuracy is more than sufficient.

What is the difference between transcription and dictation?

Dictation is the real-time process of speaking for immediate capture — like speaking to a voice assistant or dictating a letter. Transcription is the conversion of pre-recorded audio into text after the fact. The key difference is timing: dictation happens live, transcription happens later. Many AI transcription tools can also handle dictation (real-time speech-to-text), but the primary use case is post-recording conversion.

Note: This guide covers audio and video transcription for business, legal, research, and content creation. For biological transcription (DNA to RNA), refer to molecular biology resources.

Ready to convert your audio or video to text? NovaScribe handles every format, every accent, and every use case — from quick meeting notes to legally compliant captions.