transcription

Name: transcription
Author: MadAppGang

9views

1installs

Audio/video transcription using OpenAI Whisper. Covers installation, model selection, transcript formats (SRT, VTT, JSON), timing synchronization, and speaker diarization. Use when transcribing media or generating subtitles.

Install

mkdir -p .claude/skills/transcription && curl -L -o skill.zip "https://mcp.directory/api/skills/download/6525" && unzip -o skill.zip -d .claude/skills/transcription && rm skill.zip

Installs to .claude/skills/transcription

About this skill

plugin: video-editing updated: 2026-01-20

Transcription with Whisper

Production-ready patterns for audio/video transcription using OpenAI Whisper.

System Requirements

Installation Options

Option 1: OpenAI Whisper (Python)

# macOS/Linux/Windows
pip install openai-whisper

# Verify
whisper --help

Option 2: whisper.cpp (C++ - faster)

# macOS
brew install whisper-cpp

# Linux - build from source
git clone https://github.com/ggerganov/whisper.cpp
cd whisper.cpp && make

# Windows - use pre-built binaries or build with cmake

Option 3: Insanely Fast Whisper (GPU accelerated)

pip install insanely-fast-whisper

Model Selection

Model	Size	VRAM	Accuracy	Speed	Use Case
tiny	39M	~1GB	Low	Fastest	Quick previews
base	74M	~1GB	Medium	Fast	Draft transcripts
small	244M	~2GB	Good	Medium	General use
medium	769M	~5GB	Better	Slow	Quality transcripts
large-v3	1550M	~10GB	Best	Slowest	Final production

Recommendation: Start with small for speed/quality balance. Use large-v3 for final delivery.

Basic Transcription

Using OpenAI Whisper

# Basic transcription (auto-detect language)
whisper audio.mp3 --model small

# Specify language and output format
whisper audio.mp3 --model medium --language en --output_format srt

# Multiple output formats
whisper audio.mp3 --model small --output_format all

# With timestamps and word-level timing
whisper audio.mp3 --model small --word_timestamps True

Using whisper.cpp

# Download model first
./models/download-ggml-model.sh base.en

# Transcribe
./main -m models/ggml-base.en.bin -f audio.wav -osrt

# With timestamps
./main -m models/ggml-base.en.bin -f audio.wav -ocsv

Output Formats

SRT (SubRip Subtitle)

1
00:00:01,000 --> 00:00:04,500
Hello and welcome to this video.

2
00:00:05,000 --> 00:00:08,200
Today we'll discuss video editing.

VTT (WebVTT)

WEBVTT

00:00:01.000 --> 00:00:04.500
Hello and welcome to this video.

00:00:05.000 --> 00:00:08.200
Today we'll discuss video editing.

JSON (with word-level timing)

{
  "text": "Hello and welcome to this video.",
  "segments": [
    {
      "id": 0,
      "start": 1.0,
      "end": 4.5,
      "text": " Hello and welcome to this video.",
      "words": [
        {"word": "Hello", "start": 1.0, "end": 1.3},
        {"word": "and", "start": 1.4, "end": 1.5},
        {"word": "welcome", "start": 1.6, "end": 2.0},
        {"word": "to", "start": 2.1, "end": 2.2},
        {"word": "this", "start": 2.3, "end": 2.5},
        {"word": "video", "start": 2.6, "end": 3.0}
      ]
    }
  ]
}

Audio Extraction for Transcription

Before transcribing video, extract audio in optimal format:

# Extract audio as WAV (16kHz, mono - optimal for Whisper)
ffmpeg -i video.mp4 -ar 16000 -ac 1 -c:a pcm_s16le audio.wav

# Extract as high-quality WAV for archival
ffmpeg -i video.mp4 -vn -c:a pcm_s16le audio.wav

# Extract as compressed MP3 (smaller, still works)
ffmpeg -i video.mp4 -vn -c:a libmp3lame -q:a 2 audio.mp3

Timing Synchronization

Convert Whisper JSON to FCP Timing

import json

def whisper_to_fcp_timing(whisper_json_path, fps=24):
    """Convert Whisper JSON output to FCP-compatible timing."""
    with open(whisper_json_path) as f:
        data = json.load(f)

    segments = []
    for seg in data.get("segments", []):
        segments.append({
            "start_time": seg["start"],
            "end_time": seg["end"],
            "start_frame": int(seg["start"] * fps),
            "end_frame": int(seg["end"] * fps),
            "text": seg["text"].strip(),
            "words": seg.get("words", [])
        })

    return segments

Frame-Accurate Timing

# Get exact frame count and duration
ffprobe -v error -count_frames -select_streams v:0 \
  -show_entries stream=nb_read_frames,duration,r_frame_rate \
  -of json video.mp4

Speaker Diarization

For multi-speaker content, use pyannote.audio:

pip install pyannote.audio

from pyannote.audio import Pipeline

pipeline = Pipeline.from_pretrained("pyannote/[email protected]")
diarization = pipeline("audio.wav")

for turn, _, speaker in diarization.itertracks(yield_label=True):
    print(f"{turn.start:.1f}s - {turn.end:.1f}s: {speaker}")

Batch Processing

#!/bin/bash
# Transcribe all videos in directory

MODEL="small"
OUTPUT_DIR="transcripts"
mkdir -p "$OUTPUT_DIR"

for video in *.mp4 *.mov *.avi; do
  [[ -f "$video" ]] || continue

  base="${video%.*}"

  # Extract audio
  ffmpeg -i "$video" -ar 16000 -ac 1 -c:a pcm_s16le "/tmp/${base}.wav" -y

  # Transcribe
  whisper "/tmp/${base}.wav" --model "$MODEL" \
    --output_format all \
    --output_dir "$OUTPUT_DIR"

  # Cleanup temp audio
  rm "/tmp/${base}.wav"

  echo "Transcribed: $video"
done

Quality Optimization

Improve Accuracy

Noise reduction before transcription:

ffmpeg -i noisy_audio.wav -af "highpass=f=200,lowpass=f=3000,afftdn=nf=-25" clean_audio.wav

Use language hint:

whisper audio.mp3 --language en --model medium

Provide initial prompt for context:

whisper audio.mp3 --initial_prompt "Technical discussion about video editing software."

Performance Tips

GPU acceleration (if available):

whisper audio.mp3 --model large-v3 --device cuda

Process in chunks for long videos:

# Split audio into 10-minute chunks
# Transcribe each chunk
# Merge results with time offset adjustment

Error Handling

# Validate audio file before transcription
validate_audio() {
  local file="$1"
  if ffprobe -v error -select_streams a:0 -show_entries stream=codec_type -of csv=p=0 "$file" 2>/dev/null | grep -q "audio"; then
    return 0
  else
    echo "Error: No audio stream found in $file"
    return 1
  fi
}

# Check Whisper installation
check_whisper() {
  if command -v whisper &> /dev/null; then
    echo "Whisper available"
    return 0
  else
    echo "Error: Whisper not installed. Run: pip install openai-whisper"
    return 1
  fi
}

Related Skills

ffmpeg-core - Audio extraction and preprocessing
final-cut-pro - Import transcripts as titles/markers

More by MadAppGang

View all skills by MadAppGang →

golang-performance

MadAppGang

Use when profiling Go applications (pprof), running benchmarks, optimizing memory/CPU usage, or debugging performance bottlenecks in production Go code.

golang

MadAppGang

Use when building Go backend services, implementing goroutines/channels, handling errors idiomatically, writing tests with testify, or following Go best practices for APIs/CLI tools.

165

claudish-usage

MadAppGang

CRITICAL - Guide for using Claudish CLI ONLY through sub-agents to run Claude Code with any AI model (OpenRouter, Gemini, OpenAI, local models). NEVER run Claudish directly in main context unless user explicitly requests it. Use when user mentions external AI models, Claudish, OpenRouter, Gemini, OpenAI, Ollama, or alternative models. Includes mandatory sub-agent delegation patterns, agent selection guide, file-based instructions, and strict rules to prevent context window pollution.

444

schemas

MadAppGang

YAML frontmatter schemas for Claude Code agents and commands. Use when creating or validating agent/command files.

external-model-selection

MadAppGang

Choose optimal external AI models for code analysis, bug investigation, and architectural decisions. Use when consulting multiple LLMs via claudish, comparing model perspectives, or investigating complex Go/LSP/transpiler issues. Provides empirically validated model rankings (91/100 for MiniMax M2, 83/100 for Grok Code Fast) and proven consultation strategies based on real-world testing.

192

hierarchical-coordinator

MadAppGang

Prevent goal drift in long-running multi-agent workflows using a coordinator agent that validates outputs against original objectives at checkpoints. Use when orchestrating 3+ agents, multi-phase features, complex implementations, or any workflow where agents may lose sight of original requirements. Trigger keywords - "hierarchical", "coordinator", "anti-drift", "checkpoint", "validation", "goal-alignment", "decomposition", "phase-gate", "shared-state", "drift detection".

ui-ux-pro-max

nextlevelbuilder

"UI/UX design intelligence. 50 styles, 21 palettes, 50 font pairings, 20 charts, 8 stacks (React, Next.js, Vue, Svelte, SwiftUI, React Native, Flutter, Tailwind). Actions: plan, build, create, design, implement, review, fix, improve, optimize, enhance, refactor, check UI/UX code. Projects: website, landing page, dashboard, admin panel, e-commerce, SaaS, portfolio, blog, mobile app, .html, .tsx, .vue, .svelte. Elements: button, modal, navbar, sidebar, card, table, form, chart. Styles: glassmorphism, claymorphism, minimalism, brutalism, neumorphism, bento grid, dark mode, responsive, skeuomorphism, flat design. Topics: color palette, accessibility, animation, layout, typography, font pairing, spacing, hover, shadow, gradient."

2,8862,530

pdf-to-markdown

aliceisjustplaying

Convert entire PDF documents to clean, structured Markdown for full context loading. Use this skill when the user wants to extract ALL text from a PDF into context (not grep/search), when discussing or analyzing PDF content in full, when the user mentions "load the whole PDF", "bring the PDF into context", "read the entire PDF", or when partial extraction/grepping would miss important context. This is the preferred method for PDF text extraction over page-by-page or grep approaches.

3,8161,659

flutter-development

aj-geddes

Build beautiful cross-platform mobile apps with Flutter and Dart. Covers widgets, state management with Provider/BLoC, navigation, API integration, and material design.

2,1541,641

drawio-diagrams-enhanced

jgtolentino

Create professional draw.io (diagrams.net) diagrams in XML format (.drawio files) with integrated PMP/PMBOK methodologies, extensive visual asset libraries, and industry-standard professional templates. Use this skill when users ask to create flowcharts, swimlane diagrams, cross-functional flowcharts, org charts, network diagrams, UML diagrams, BPMN, project management diagrams (WBS, Gantt, PERT, RACI), risk matrices, stakeholder maps, or any other visual diagram in draw.io format. This skill includes access to custom shape libraries for icons, clipart, and professional symbols.

2,2681,469

godot

bfollington

This skill should be used when working on Godot Engine projects. It provides specialized knowledge of Godot's file formats (.gd, .tscn, .tres), architecture patterns (component-based, signal-driven, resource-based), common pitfalls, validation tools, code templates, and CLI workflows. The `godot` command is available for running the game, validating scripts, importing resources, and exporting builds. Use this skill for tasks involving Godot game development, debugging scene/resource files, implementing game systems, or creating new Godot components.

2,4701,225

nano-banana-pro

garg-aayush

Generate and edit images using Google's Nano Banana Pro (Gemini 3 Pro Image) API. Use when the user asks to generate, create, edit, modify, change, alter, or update images. Also use when user references an existing image file and asks to modify it in any way (e.g., "modify this image", "change the background", "replace X with Y"). Supports both text-to-image generation and image-to-image editing with configurable resolution (1K default, 2K, or 4K for high resolution). DO NOT read the image file first - use this skill directly with the --input-image parameter.

1,959969

Related MCP Servers

Browse all servers

Video & Audio Text Extraction

Transcribe for YouTube and other platforms. Extract accurate transcript of a YouTube video for accessibility, analysis, and content creation.

90 tools

Browser Use

Async browser automation server using GPT-4o for remote web navigation, extraction, and tasks. Ideal for Selenium software testing.

8110 tools

Browser Use

Browser Use offers async browser automation with GPT-4o. Ideal for selenium software testing and browser automation studio tasks.

8110 tools

OpenAI o3 Search

Leverage OpenAI o3 Search for advanced web results, outperforming Bing AI and other engines with unrivaled AI search capabilities and reasoning.

2860 tools

OpenAI WebSearch

OpenAI WebSearch enables real-time AI search using Bing AI by Microsoft for up-to-date web info and configurable search parameters.

860 tools

Speech Interface (Faster Whisper)

Enable your AI virtual assistant with automatic speech recognition and speech into text using faster-whisper for seamless voice interaction.

810 tools

Install

mkdir -p .claude/skills/transcription && curl -L -o skill.zip "https://mcp.directory/api/skills/download/6525" && unzip -o skill.zip -d .claude/skills/transcription && rm skill.zip

Installs to .claude/skills/transcription

Stats

Views

Installs

Author

MadAppGang

7 skills published

Links

Source Code

transcription

Install

About this skill

Transcription with Whisper

System Requirements

Installation Options

Model Selection

Basic Transcription

Using OpenAI Whisper

Using whisper.cpp

Output Formats

SRT (SubRip Subtitle)

VTT (WebVTT)

JSON (with word-level timing)

Audio Extraction for Transcription

Timing Synchronization

Convert Whisper JSON to FCP Timing

Frame-Accurate Timing

Speaker Diarization

Batch Processing

Quality Optimization

Improve Accuracy

Performance Tips

Error Handling

Related Skills

More by MadAppGang

golang-performance

golang

claudish-usage

schemas

external-model-selection

hierarchical-coordinator

You might also like

ui-ux-pro-max

pdf-to-markdown

flutter-development

drawio-diagrams-enhanced

godot

nano-banana-pro

Related MCP Servers