Home/Video to SRT

Verified July 2026

Video to SRT: Generate + Add Subtitles to Any Video (2026 Guide)

Video to SRT covers two related jobs: generate SRT subtitles from a video (upload MP4/MOV/WEBM, transcribe with Whisper Large-v3 in 5-10 minutes for a 1-hour video, export .srt), or combine an existing SRT with a video as soft-subs (uploaded alongside for viewer toggling) or burn-in hardcoded captions. VexaScribe handles job one with ~92-97% English accuracy — higher than YouTube auto-captions on accented, technical, or noisy audio. Job two typically uses FFmpeg (open-source) for burn-in or platform upload for soft-subs.

SRT (SubRip Subtitle) is a plain-text subtitle format standardized since ~2001, universally accepted by YouTube, Vimeo, LinkedIn, Facebook, and most video editors. This page covers BOTH jobs — generation and combination — with real accuracy numbers, format compatibility, decision matrices, and platform-specific requirements. All numbers sourced from Whisper documentation, W3C specifications, FFmpeg documentation, and platform docs — links in the Sources section.

By VexaScribe Editorial · Published July 4, 2026 · Verified July 2026

Convert video to SRT free Job A: Generate SRT Job B: Add SRT to video

Video to SRT at a Glance

video formats supported

MP4 · MOV · WEBM · AVI · MKV

~92-97%

AI accuracy clean English

Whisper Large-v3

platforms with SRT upload

YouTube · Vimeo · LinkedIn · TikTok...

2 jobs

generate OR combine

we cover both

The phrase “video to SRT” hides two very different tasks, and users searching for it split roughly evenly between them. The first job is generation: you have a raw video file and need a subtitle file that matches the spoken content, cue by cue, with per-line timestamps in HH:MM:SS,mmm format. The second job is combination: you already have both files — a finished SRT and a finished video — and you want them to travel together, either as separate soft-sub tracks that viewers toggle or as hardcoded captions rendered permanently into the video pixels. Different tools solve each. This page walks both honestly, with real accuracy numbers, format compatibility, decision matrices, and FFmpeg commands you can copy directly.

Which Job Do You Have?

Pick the workflow that matches the files you actually have on disk right now.

JOB A

You have a video, need SRT

You have MP4/MOV/WEBM/AVI/MKV on disk and no subtitle file yet. Goal: produce a valid SRT file with per-cue timing that matches the spoken content. Solution: upload to VexaScribe, run Whisper Large-v3, export .srt. Covered below in §C.

Go to Job A workflow →

JOB B

You have video AND SRT, want them combined

Both files exist. The fork: soft-subs (upload SRT alongside the video via the platform) or burn-in (render captions into the video pixels with FFmpeg or a desktop editor). Covered in §D and the decision matrix in §E.

Go to Job B workflow →

Job A: Generate SRT from Video

Five-step workflow to convert a video file into a valid SRT subtitle file with per-cue timestamps. Total wall-clock time for a 1-hour video: 10-15 minutes end-to-end, most of which is Whisper Large-v3 processing.

Upload the video file

Drag-and-drop the video into the VexaScribe upload area, or paste a direct URL. Supported formats: MP4 (H.264 + AAC), MOV (Apple QuickTime, iPhone recordings), WEBM (VP8/VP9 + Opus, YouTube download format), AVI, and MKV. File size cap is 5 GB per upload — comfortably enough for a 4+ hour piece at typical bitrates.

Configure language, speakers, and output

Whisper Large-v3 auto-detects language across 99 supported languages, but specifying the language explicitly is more reliable — Whisper occasionally confuses linguistically similar languages when auto-detecting. Enable speaker detection if the video has more than one person. Choose SRT as the export format (VTT, DOCX, PDF, and TXT are also available).

Wait 5-10 minutes for processing

VexaScribe's backend extracts audio from the video container (any format FFmpeg can decode), then Whisper Large-v3 (1.5B parameters, September 2023, MIT license) transcribes and times the cues on GPU infrastructure. Progress is visible in the dashboard; you do not need to keep the tab open.

Export the SRT file

The finished SRT contains numbered cues, HH:MM:SS,mmm start-arrow-end timing lines, and text lines. Standard SubRip format — universally accepted by YouTube, Vimeo, LinkedIn, FFmpeg, VLC, Premiere, DaVinci Resolve, and virtually every other video tool.

Upload SRT to your platform, or proceed to Job B

If you just needed the SRT for a platform's soft-sub upload (YouTube Studio → Subtitles → Add), stop here. If you need the captions embedded in the video pixels, continue to Job B for FFmpeg burn-in or desktop-editor rendering.

Cost: 30 min free at signup, then $2/mo (200 min), $5/mo (1,000 min), $10/mo (2,500 min), or $20/mo (6,000 min). No card required for the free tier.

Accuracy: ~92-97% Whisper Large-v3 on clean English video. Full breakdown by scenario in §G below. See also how accurate is Whisper for the wider accuracy curve.

Job B: Add an Existing SRT to Video

You have a finished SRT and a finished video, both on disk. The next decision is a fork: soft-subs or burn-in. The two paths solve different problems and have different trade-offs. The decision matrix in §E lays them out in a table; this section walks each path operationally.

Soft-subs (recommended when platform supports it)

Soft-subs are separate subtitle tracks that ride alongside the video — either as a separate SRT file uploaded to the platform, or as a subtitle stream muxed into the video container. Viewers toggle them on and off, choose language, and often adjust font size and background color. This is the accessibility-preferred path and the correct default for long-form content on platforms that support it.

How: Upload the SRT alongside the video via the platform's caption feature. Every major long-form video host supports this workflow.

Where it works: YouTube (Studio → Subtitles → Add subtitle track), LinkedIn (during publish), Vimeo (Advanced settings during publish), Facebook (publisher-only feature), Kaltura, Wistia, self-hosted HTML5 players via the <track> element.

Advantages: Small file size (SRT is text only), editable (fix a typo without re-rendering the entire video), viewer control (accessibility features like font resizing work), multi-language friendly (one SRT per language, viewer picks).

Disadvantages: Depends on the player supporting soft-subs (TikTok and Instagram Reels do not, in practice), styling is player-controlled (you cannot force a specific font or drop-shadow), some social platforms silently degrade to burn-in-only for maximum feed compatibility.

Burn-in (when soft-subs aren't supported)

Burn-in (also called “hardcoded” or “open captions”) renders the SRT text into the video pixels themselves. The result is a new video file where the captions are permanent — any player, no toggle. This is the correct choice for social platforms that do not support soft-subs and for content where consistent styling matters more than viewer control.

How (FFmpeg — free, open source): Two commands cover the common cases. The first is a soft-sub mux (essentially instant, no re-encoding); the second is true burn-in (requires re-encoding, adds 5-15% file size).

Soft-sub mux (adds SRT as a subtitle stream inside the MP4 container):

ffmpeg -i input.mp4 -i input.srt -c copy -c:s mov_text output.mp4

Hard-burn (renders SRT into video pixels; requires re-encoding):

ffmpeg -i input.mp4 -vf "subtitles=input.srt" output.mp4

How (desktop editor): Kapwing, VEED, Adobe Premiere, and DaVinci Resolve all import SRT files and burn them into a new render. Slower to set up than FFmpeg but easier if you also want to restyle the captions (font, color, drop-shadow, positioning).

Advantages: Universal player support (any device, any app, any platform), consistent styling (you control font and appearance), works on TikTok / Instagram Reels / any surface that ignores soft-sub tracks, avoids the “captions are off by default” problem on autoplay feeds.

Disadvantages: Larger output file (5-15% typical increase from re-encoding), not editable without re-rendering the entire video, no viewer control (accessibility fails for viewers who need larger text), one video per language.

Burn-in vs Soft-subs: Decision Matrix

Head-to-head across the seven dimensions that actually decide the fork. If your platform supports soft-subs and your content is long-form, soft-subs win. If your platform does not (TikTok, Instagram Reels), or the captions are load-bearing for feed engagement, burn-in wins.

Dimension	Soft-subs	Burn-in
File size	Small (SRT is text only)	Larger (5-15% typical increase)
Editability	Easy (edit SRT text file)	Hard (re-render required)
Player compatibility	Depends on player supporting soft-subs	Universal (any player)
Accessibility	Best (viewer can adjust font, disable)	Locked in — WCAG 2.1 SC 1.2.2 still met
Platform support	YouTube, LinkedIn, Vimeo, Facebook, Kaltura	Universal (works everywhere)
Multi-language	One SRT per language, viewer picks	One video per language
Best for	Long-form content, WCAG compliance	Social clips (TikTok, Reels)

Source: WCAG 2.1 SC 1.2.2 (June 2018) accepts either; verified July 2026.

Video Format Compatibility

Whisper Large-v3 does not care about the video container — it operates on the audio track. VexaScribe extracts audio via FFmpeg before invoking Whisper, so anything FFmpeg can decode works. In practice, that covers every container format used in the modern web-video pipeline.

Format	Extension	Supported	Notes
MP4 (H.264 + AAC)	.mp4	✓	Most common; our recommended input
MOV (Apple QuickTime)	.mov	✓	iPhone recordings, Final Cut exports
WEBM (VP8/VP9 + Opus)	.webm	✓	YouTube download format
AVI	.avi	✓	Older format, works fine
MKV (Matroska)	.mkv	✓	Container format, multiple streams supported

Source: Whisper Large-v3 accepts any format FFmpeg can decode; verified July 2026.

MP4 (H.264 + AAC) is our recommended input because it is universally decodable, well-optimized in modern FFmpeg builds, and matches what almost every camera and screen recorder outputs. MOV files from iPhone (typically HEVC + AAC) also work but occasionally have variable-frame-rate quirks that older FFmpeg builds handle poorly; a modern FFmpeg (7.x, released 2024) resolves them. If you have a container we do not list — OGV, FLV, 3GP — it almost certainly still works; FFmpeg's demuxer list is long.

Accuracy by Video Scenario

Video-to-SRT accuracy is fundamentally audio accuracy — Whisper Large-v3 does not look at the picture. That means the honest breakdown depends on what kind of audio your video has: single-speaker with a good microphone tracks near the top of the range; multi-speaker panels, background music, and outdoor field recordings degrade in predictable ways. Numbers below are aggregated from the Whisper paper (Radford et al., OpenAI 2022), independent benchmarks (Open ASR Leaderboard), and VexaScribe's internal test corpus of known-transcript videos.

Scenario	Accuracy	Notes
Talking-head vlog (single speaker, good mic)	~92-95%	Best case
Screen recording tutorial (voice-over)	~90-94%	Clean audio, clear diction
Interview / 2-person podcast video	~85-92%	Speaker attribution mostly correct
Panel discussion (3+ speakers)	~78-88%	Speaker changes may miss
Vlog with background music/noise	~78-88%	Music competes with speech
Field recording (outdoor, event, live)	~65-78%	Wind, crowd noise, distance

Source: Whisper paper (arxiv.org/abs/2212.04356) + internal testing; verified July 2026.

Practical read: for a talking-head vlog, a screen-recording tutorial, or a two-person podcast video, the SRT you get is broadcast-adjacent quality with light editing (a few minutes per hour to fix product names and proper nouns). For a panel discussion, an event-floor interview with crowd noise, or a field recording — expect meaningful cleanup time. Turn on speaker detection early; correcting speaker attribution after the fact is the most time-consuming edit.

Platform Requirements Matrix

Once you have the SRT, the last question is what your target platform actually accepts. Every major surface takes SRT, but a few (Instagram Reels most notably) do not accept soft-subs at all — they require burn-in. Others (TikTok) technically accept SRT via desktop uploader but reward burn-in with better feed engagement. Zoom and Wistia both auto-generate SRT internally but let you upload a higher-quality one from an outside source.

Platform	Accepted	Preferred	Soft-sub support	Notes
YouTube	SRT, VTT, SBV, DFXP, TTML	SRT	✓ (viewer toggle)	Studio → Subtitles → Add subtitle track
Vimeo	SRT, VTT, DFXP	SRT	✓	Advanced settings during publish
LinkedIn	SRT	SRT	✓	Upload during publish; auto-plays with captions
Facebook	SRT	SRT	✓	Publisher-only feature
TikTok	SRT (desktop)	Burn-in preferred	⚠ (limited)	Auto-captions default; SRT via desktop uploader
Instagram Reels	Not directly	Burn-in required	✗	Use CapCut/Kapwing to burn-in first
Zoom recordings	SRT export from Zoom	VTT for playback	✓	Auto-transcribes with paid tier
Wistia	SRT	SRT	✓	Custom player supports both

Sources: YouTube, Vimeo, LinkedIn, Facebook, TikTok, Instagram, Zoom, Wistia platform documentation; verified July 2026.

For a WCAG 2.1 SC 1.2.2 compliance workflow (Level A, prerecorded, June 2018) on YouTube, LinkedIn, or Vimeo, the correct path is soft-sub upload — a properly reviewed SRT beats auto-captions on the accessibility scoring rubric and lets viewers control their experience. For TikTok, Instagram Reels, and short-form social where captions drive silent-autoplay engagement, burn-in is the practical default. If you cross-post the same content to both surfaces, generate the SRT once (Job A), upload as soft-sub to long-form platforms, then burn-in with FFmpeg for the short-form cuts.

Sources & Verification

Every technical claim on this page traces back to a primary source. Product versions, standards dates, and platform behaviors are cross-checked against vendor documentation and standards bodies.

OpenAI Whisper paper (Radford et al. 2022): arxiv.org/abs/2212.04356 — original architecture, training corpus (680,000 hours), and zero-shot WER results.
Whisper Large-v3 model card: huggingface.co/openai/whisper-large-v3 — official model card (September 2023, 1.5B parameters, MIT license).
WebVTT specification (W3C, May 2013): w3.org/TR/webvtt1 — official specification for the sibling caption format used inside HTML5 <track> elements.
SubRip specification (Library of Congress): loc.gov/preservation/digital/formats/fdd/fdd000569.shtml — LOC format registry entry for SRT (SubRip Subtitle, ~2001).
WCAG 2.1 SC 1.2.2 (Captions Prerecorded): w3.org/TR/WCAG21 — W3C accessibility guideline (June 2018) referenced in the decision matrix.
YouTube Studio caption upload: support.google.com/youtube/answer/2734796 — official YouTube documentation for uploading SRT and other caption formats.
FFmpeg subtitle documentation: ffmpeg.org/ffmpeg-filters.html#subtitles — official FFmpeg documentation for the subtitles filter used in the hard-burn command.

All numbers verified July 2026 against primary sources listed above.

VexaScribe vs Alternatives

VexaScribe is the best value on the market for the generate side of video-to-SRT: 30 minutes free at signup, then $2/mo for 200 minutes running Whisper Large-v3 (~92-97% clean English accuracy). No built-in video editor, so the combine side (Job B) is off-platform — FFmpeg for the free path, a desktop editor for the styled path. Honest positioning: if you also need to restyle burn-in captions with per-frame animations, VexaScribe is not the whole answer.

Sonix is the enterprise-workflow option, entry pricing around $10/mo, similar generate accuracy (also Whisper-based). SSO, admin dashboard, and Zapier integrations make sense for teams handling hundreds of hours per month across multiple contributors. Overkill for individual creators.

VEED and Kapwing both bundle a full browser-based video editor with a subtitle generator. Pricing is $12-18/mo depending on tier. The advantage: you never leave the tab — upload video, generate SRT, restyle captions, burn-in, export the finished MP4 all in one place. The downside: if you only need the generate half, you are paying editor overhead for capability you never use, and rendering the final MP4 in a browser is slower than a native FFmpeg pipeline.

Descript is a different product shape entirely — a video editor built around the transcript, priced around $12/mo. Editing the transcript edits the video. Great for podcast video content where the workflow is script-first; not the right shape when you have a finished video and just need an SRT file to hand to a platform.

Common Use Cases

The six scenarios below cover the vast majority of what people mean when they search “video to SRT.” Match your situation to one and the workflow becomes concrete.

YouTube creator replacing auto-captions

You uploaded a video, YouTube's auto-captions are wrong on your product names or technical vocabulary. Workflow: Job A (upload video to VexaScribe, export SRT), then YouTube Studio → Subtitles → Add subtitle track → upload the SRT with timing. Replaces the auto-caption track. Meets WCAG 2.1 SC 1.2.2 properly.

LinkedIn thought-leadership post

LinkedIn videos autoplay silently in the feed. Soft-sub captions are the correct choice (viewer can toggle, accessibility works). Workflow: Job A generates SRT, then upload the SRT during publish via LinkedIn's video-publish flow. Do not burn-in on LinkedIn — it is unnecessary and hurts the accessibility score.

TikTok / Instagram Reels short-form cut

Both platforms reward burn-in captions in the feed. Workflow: Job A generates SRT from the source cut, then Job B burn-in with FFmpeg (ffmpeg -i input.mp4 -vf "subtitles=input.srt" output.mp4) or CapCut/Kapwing for styled captions. Upload the finished MP4 with baked-in text.

Vimeo portfolio / marketing video

Vimeo supports SRT and VTT as soft-subs via the Advanced settings during publish. Workflow: Job A → SRT export → upload SRT alongside the video. Vimeo lets you upload multiple language tracks for one video, so this is a good place for multilingual soft-subs if you translate the SRT (see /transcribe-and-translate-audio).

Course lecture / training video with WCAG requirement

For an educational surface that has to clear WCAG 2.1 SC 1.2.2 (Level A, prerecorded, June 2018), the correct workflow is Job A → export SRT → soft-sub upload to your LMS or hosted player. A properly reviewed SRT meets the standard; auto-captions typically do not. Related: /lecture-transcription for the workflow specific to educational content.

Corporate video that must play everywhere

A brand video destined for embed on multiple sites (some of which may not support soft-subs) benefits from burn-in for the widest reach. Workflow: Job A → SRT export → Job B FFmpeg burn-in or Adobe Premiere with the SRT imported as a caption track. Ship the finished MP4 with hardcoded captions; nothing to configure downstream.

Video to SRT FAQ

How do I convert video to SRT?

Upload the video file (MP4, MOV, WEBM, AVI, or MKV — up to 5 GB) to VexaScribe, choose language and speaker detection, wait 5-10 minutes for Whisper Large-v3 processing, then export as SRT. The output includes per-cue timestamps ready for YouTube, Vimeo, LinkedIn, and virtually any video platform.

How do I add an SRT file to a video?

Two approaches. Soft-subs: upload the SRT alongside the video via the platform's caption feature (YouTube Studio → Subtitles → Add subtitle track; Vimeo Advanced settings; LinkedIn during publish). Burn-in: use FFmpeg (`ffmpeg -i input.mp4 -vf "subtitles=input.srt" output.mp4`) or a desktop editor like Kapwing or VEED to render captions into the video pixels.

Should I use burn-in or soft-subs?

Soft-subs when your platform supports them (YouTube, LinkedIn, Vimeo, Facebook, Kaltura) — smaller file, editable, accessibility-friendly. Burn-in when your platform doesn't (TikTok, Instagram Reels) or when consistent styling matters — universal player support, no viewer control.

Does YouTube prefer SRT or VTT?

YouTube accepts both, but SRT is the more common upload format because it's universal across editors and desktop players. VTT is preferred when serving via a custom HTML5 web player because it supports CSS styling and positioning cues.

Can I burn in subtitles without re-encoding the video?

Soft-sub mux with FFmpeg (`-c copy -c:s mov_text`) is essentially instantaneous because it doesn't re-encode video. True burn-in (hard-coded captions rendered into pixels) always requires re-encoding, adding 5-15% file size and processing time equivalent to a full transcode.

What's the difference between soft-sub SRT and hardcoded captions?

Soft-subs are separate files (or muxed streams) that the video player displays alongside the video; viewers can toggle on/off, choose language, style. Hardcoded (burn-in) captions are rendered into the video pixels; they appear identically on every player but can't be toggled off or edited without re-rendering.

Does TikTok accept SRT files?

TikTok's desktop uploader accepts SRT for caption upload, but the mobile app defaults to auto-captions. Because engagement often depends on styled captions, most TikTok creators burn-in subtitles via CapCut, Kapwing, or VEED rather than rely on the platform's default caption display.

How accurate is video-to-SRT AI transcription in 2026?

Whisper Large-v3 (September 2023, MIT license) hits ~92-95% on clean single-speaker English video, drops to ~85-92% for 2-person interviews, ~78-88% for panels with 3+ speakers, ~78-88% with background music/noise, and ~65-78% for field recordings. Higher than YouTube auto-captions on accented, technical, or noisy audio; roughly equivalent on clean single-speaker English.

Convert video to SRT with VexaScribe (30 min free at signup, no card required) →