transcription
Audio/video transcription using OpenAI Whisper. Covers installation, model selection, transcript formats (SRT, VTT, JSON), timing synchronization, and speaker diarization. Use when transcribing media or generating subtitles.
Install
mkdir -p .claude/skills/transcription && curl -L -o skill.zip "https://mcp.directory/api/skills/download/6525" && unzip -o skill.zip -d .claude/skills/transcription && rm skill.zipInstalls to .claude/skills/transcription
About this skill
plugin: video-editing updated: 2026-01-20
Transcription with Whisper
Production-ready patterns for audio/video transcription using OpenAI Whisper.
System Requirements
Installation Options
Option 1: OpenAI Whisper (Python)
# macOS/Linux/Windows
pip install openai-whisper
# Verify
whisper --help
Option 2: whisper.cpp (C++ - faster)
# macOS
brew install whisper-cpp
# Linux - build from source
git clone https://github.com/ggerganov/whisper.cpp
cd whisper.cpp && make
# Windows - use pre-built binaries or build with cmake
Option 3: Insanely Fast Whisper (GPU accelerated)
pip install insanely-fast-whisper
Model Selection
| Model | Size | VRAM | Accuracy | Speed | Use Case |
|---|---|---|---|---|---|
| tiny | 39M | ~1GB | Low | Fastest | Quick previews |
| base | 74M | ~1GB | Medium | Fast | Draft transcripts |
| small | 244M | ~2GB | Good | Medium | General use |
| medium | 769M | ~5GB | Better | Slow | Quality transcripts |
| large-v3 | 1550M | ~10GB | Best | Slowest | Final production |
Recommendation: Start with small for speed/quality balance. Use large-v3 for final delivery.
Basic Transcription
Using OpenAI Whisper
# Basic transcription (auto-detect language)
whisper audio.mp3 --model small
# Specify language and output format
whisper audio.mp3 --model medium --language en --output_format srt
# Multiple output formats
whisper audio.mp3 --model small --output_format all
# With timestamps and word-level timing
whisper audio.mp3 --model small --word_timestamps True
Using whisper.cpp
# Download model first
./models/download-ggml-model.sh base.en
# Transcribe
./main -m models/ggml-base.en.bin -f audio.wav -osrt
# With timestamps
./main -m models/ggml-base.en.bin -f audio.wav -ocsv
Output Formats
SRT (SubRip Subtitle)
1
00:00:01,000 --> 00:00:04,500
Hello and welcome to this video.
2
00:00:05,000 --> 00:00:08,200
Today we'll discuss video editing.
VTT (WebVTT)
WEBVTT
00:00:01.000 --> 00:00:04.500
Hello and welcome to this video.
00:00:05.000 --> 00:00:08.200
Today we'll discuss video editing.
JSON (with word-level timing)
{
"text": "Hello and welcome to this video.",
"segments": [
{
"id": 0,
"start": 1.0,
"end": 4.5,
"text": " Hello and welcome to this video.",
"words": [
{"word": "Hello", "start": 1.0, "end": 1.3},
{"word": "and", "start": 1.4, "end": 1.5},
{"word": "welcome", "start": 1.6, "end": 2.0},
{"word": "to", "start": 2.1, "end": 2.2},
{"word": "this", "start": 2.3, "end": 2.5},
{"word": "video", "start": 2.6, "end": 3.0}
]
}
]
}
Audio Extraction for Transcription
Before transcribing video, extract audio in optimal format:
# Extract audio as WAV (16kHz, mono - optimal for Whisper)
ffmpeg -i video.mp4 -ar 16000 -ac 1 -c:a pcm_s16le audio.wav
# Extract as high-quality WAV for archival
ffmpeg -i video.mp4 -vn -c:a pcm_s16le audio.wav
# Extract as compressed MP3 (smaller, still works)
ffmpeg -i video.mp4 -vn -c:a libmp3lame -q:a 2 audio.mp3
Timing Synchronization
Convert Whisper JSON to FCP Timing
import json
def whisper_to_fcp_timing(whisper_json_path, fps=24):
"""Convert Whisper JSON output to FCP-compatible timing."""
with open(whisper_json_path) as f:
data = json.load(f)
segments = []
for seg in data.get("segments", []):
segments.append({
"start_time": seg["start"],
"end_time": seg["end"],
"start_frame": int(seg["start"] * fps),
"end_frame": int(seg["end"] * fps),
"text": seg["text"].strip(),
"words": seg.get("words", [])
})
return segments
Frame-Accurate Timing
# Get exact frame count and duration
ffprobe -v error -count_frames -select_streams v:0 \
-show_entries stream=nb_read_frames,duration,r_frame_rate \
-of json video.mp4
Speaker Diarization
For multi-speaker content, use pyannote.audio:
pip install pyannote.audio
from pyannote.audio import Pipeline
pipeline = Pipeline.from_pretrained("pyannote/speaker-diarization@2.1")
diarization = pipeline("audio.wav")
for turn, _, speaker in diarization.itertracks(yield_label=True):
print(f"{turn.start:.1f}s - {turn.end:.1f}s: {speaker}")
Batch Processing
#!/bin/bash
# Transcribe all videos in directory
MODEL="small"
OUTPUT_DIR="transcripts"
mkdir -p "$OUTPUT_DIR"
for video in *.mp4 *.mov *.avi; do
[[ -f "$video" ]] || continue
base="${video%.*}"
# Extract audio
ffmpeg -i "$video" -ar 16000 -ac 1 -c:a pcm_s16le "/tmp/${base}.wav" -y
# Transcribe
whisper "/tmp/${base}.wav" --model "$MODEL" \
--output_format all \
--output_dir "$OUTPUT_DIR"
# Cleanup temp audio
rm "/tmp/${base}.wav"
echo "Transcribed: $video"
done
Quality Optimization
Improve Accuracy
- Noise reduction before transcription:
ffmpeg -i noisy_audio.wav -af "highpass=f=200,lowpass=f=3000,afftdn=nf=-25" clean_audio.wav
- Use language hint:
whisper audio.mp3 --language en --model medium
- Provide initial prompt for context:
whisper audio.mp3 --initial_prompt "Technical discussion about video editing software."
Performance Tips
- GPU acceleration (if available):
whisper audio.mp3 --model large-v3 --device cuda
- Process in chunks for long videos:
# Split audio into 10-minute chunks
# Transcribe each chunk
# Merge results with time offset adjustment
Error Handling
# Validate audio file before transcription
validate_audio() {
local file="$1"
if ffprobe -v error -select_streams a:0 -show_entries stream=codec_type -of csv=p=0 "$file" 2>/dev/null | grep -q "audio"; then
return 0
else
echo "Error: No audio stream found in $file"
return 1
fi
}
# Check Whisper installation
check_whisper() {
if command -v whisper &> /dev/null; then
echo "Whisper available"
return 0
else
echo "Error: Whisper not installed. Run: pip install openai-whisper"
return 1
fi
}
Related Skills
- ffmpeg-core - Audio extraction and preprocessing
- final-cut-pro - Import transcripts as titles/markers
More by MadAppGang
View all skills by MadAppGang →You might also like
flutter-development
aj-geddes
Build beautiful cross-platform mobile apps with Flutter and Dart. Covers widgets, state management with Provider/BLoC, navigation, API integration, and material design.
drawio-diagrams-enhanced
jgtolentino
Create professional draw.io (diagrams.net) diagrams in XML format (.drawio files) with integrated PMP/PMBOK methodologies, extensive visual asset libraries, and industry-standard professional templates. Use this skill when users ask to create flowcharts, swimlane diagrams, cross-functional flowcharts, org charts, network diagrams, UML diagrams, BPMN, project management diagrams (WBS, Gantt, PERT, RACI), risk matrices, stakeholder maps, or any other visual diagram in draw.io format. This skill includes access to custom shape libraries for icons, clipart, and professional symbols.
ui-ux-pro-max
nextlevelbuilder
"UI/UX design intelligence. 50 styles, 21 palettes, 50 font pairings, 20 charts, 8 stacks (React, Next.js, Vue, Svelte, SwiftUI, React Native, Flutter, Tailwind). Actions: plan, build, create, design, implement, review, fix, improve, optimize, enhance, refactor, check UI/UX code. Projects: website, landing page, dashboard, admin panel, e-commerce, SaaS, portfolio, blog, mobile app, .html, .tsx, .vue, .svelte. Elements: button, modal, navbar, sidebar, card, table, form, chart. Styles: glassmorphism, claymorphism, minimalism, brutalism, neumorphism, bento grid, dark mode, responsive, skeuomorphism, flat design. Topics: color palette, accessibility, animation, layout, typography, font pairing, spacing, hover, shadow, gradient."
godot
bfollington
This skill should be used when working on Godot Engine projects. It provides specialized knowledge of Godot's file formats (.gd, .tscn, .tres), architecture patterns (component-based, signal-driven, resource-based), common pitfalls, validation tools, code templates, and CLI workflows. The `godot` command is available for running the game, validating scripts, importing resources, and exporting builds. Use this skill for tasks involving Godot game development, debugging scene/resource files, implementing game systems, or creating new Godot components.
nano-banana-pro
garg-aayush
Generate and edit images using Google's Nano Banana Pro (Gemini 3 Pro Image) API. Use when the user asks to generate, create, edit, modify, change, alter, or update images. Also use when user references an existing image file and asks to modify it in any way (e.g., "modify this image", "change the background", "replace X with Y"). Supports both text-to-image generation and image-to-image editing with configurable resolution (1K default, 2K, or 4K for high resolution). DO NOT read the image file first - use this skill directly with the --input-image parameter.
fastapi-templates
wshobson
Create production-ready FastAPI projects with async patterns, dependency injection, and comprehensive error handling. Use when building new FastAPI applications or setting up backend API projects.
Related MCP Servers
Browse all serversTranscribe for YouTube and other platforms. Extract accurate transcript of a YouTube video for accessibility, analysis,
Async browser automation server using GPT-4o for remote web navigation, extraction, and tasks. Ideal for Selenium softwa
Browser Use offers async browser automation with GPT-4o. Ideal for selenium software testing and browser automation stud
Leverage OpenAI o3 Search for advanced web results, outperforming Bing AI and other engines with unrivaled AI search cap
OpenAI WebSearch enables real-time AI search using Bing AI by Microsoft for up-to-date web info and configurable search
Enable your AI virtual assistant with automatic speech recognition and speech into text using faster-whisper for seamles
Stay ahead of the MCP ecosystem
Get weekly updates on new skills and servers.