whisper-transcription
Transcribe audio/video to text with word-level timestamps using OpenAI Whisper. Use when you need speech-to-text with accurate timing information for each word.
Install
mkdir -p .claude/skills/whisper-transcription && curl -L -o skill.zip "https://mcp.directory/api/skills/download/3254" && unzip -o skill.zip -d .claude/skills/whisper-transcription && rm skill.zipInstalls to .claude/skills/whisper-transcription
About this skill
Whisper Transcription
OpenAI Whisper provides accurate speech-to-text with word-level timestamps.
Installation
pip install openai-whisper
Model Selection
Use the tiny model for fast transcription - it's sufficient for most tasks and runs much faster:
| Model | Size | Speed | Accuracy |
|---|---|---|---|
| tiny | 39 MB | Fastest | Good for clear speech |
| base | 74 MB | Fast | Better accuracy |
| small | 244 MB | Medium | High accuracy |
Recommendation: Start with tiny - it handles clear interview/podcast audio well.
Basic Usage with Word Timestamps
import whisper
import json
def transcribe_with_timestamps(audio_path, output_path):
"""
Transcribe audio and get word-level timestamps.
Args:
audio_path: Path to audio/video file
output_path: Path to save JSON output
"""
# Use tiny model for speed
model = whisper.load_model("tiny")
# Transcribe with word timestamps
result = model.transcribe(
audio_path,
word_timestamps=True,
language="en" # Specify language for better accuracy
)
# Extract words with timestamps
words = []
for segment in result["segments"]:
if "words" in segment:
for word_info in segment["words"]:
words.append({
"word": word_info["word"].strip(),
"start": word_info["start"],
"end": word_info["end"]
})
with open(output_path, "w") as f:
json.dump(words, f, indent=2)
return words
Detecting Specific Words
def find_words(transcription, target_words):
"""
Find specific words in transcription with their timestamps.
Args:
transcription: List of word dicts with 'word', 'start', 'end'
target_words: Set of words to find (lowercase)
Returns:
List of matches with word and timestamp
"""
matches = []
target_lower = {w.lower() for w in target_words}
for item in transcription:
word = item["word"].lower().strip()
# Remove punctuation for matching
clean_word = ''.join(c for c in word if c.isalnum())
if clean_word in target_lower:
matches.append({
"word": clean_word,
"timestamp": item["start"]
})
return matches
Complete Example: Find Filler Words
import whisper
import json
# Filler words to detect
FILLER_WORDS = {
"um", "uh", "hum", "hmm", "mhm",
"like", "so", "well", "yeah", "okay",
"basically", "actually", "literally"
}
def detect_fillers(audio_path, output_path):
# Load tiny model (fast!)
model = whisper.load_model("tiny")
# Transcribe
result = model.transcribe(audio_path, word_timestamps=True, language="en")
# Find fillers
fillers = []
for segment in result["segments"]:
for word_info in segment.get("words", []):
word = word_info["word"].lower().strip()
clean = ''.join(c for c in word if c.isalnum())
if clean in FILLER_WORDS:
fillers.append({
"word": clean,
"timestamp": round(word_info["start"], 2)
})
with open(output_path, "w") as f:
json.dump(fillers, f, indent=2)
return fillers
# Usage
detect_fillers("/root/input.mp4", "/root/annotations.json")
Audio Extraction (if needed)
Whisper can process video files directly, but for cleaner results:
# Extract audio as 16kHz mono WAV
ffmpeg -i input.mp4 -vn -acodec pcm_s16le -ar 16000 -ac 1 audio.wav
Multi-Word Phrases
For detecting phrases like "you know" or "I mean":
def find_phrases(transcription, phrases):
"""Find multi-word phrases in transcription."""
matches = []
words = [w["word"].lower().strip() for w in transcription]
for phrase in phrases:
phrase_words = phrase.lower().split()
phrase_len = len(phrase_words)
for i in range(len(words) - phrase_len + 1):
if words[i:i+phrase_len] == phrase_words:
matches.append({
"word": phrase,
"timestamp": transcription[i]["start"]
})
return matches
More by benchflow-ai
View all skills by benchflow-ai →You might also like
flutter-development
aj-geddes
Build beautiful cross-platform mobile apps with Flutter and Dart. Covers widgets, state management with Provider/BLoC, navigation, API integration, and material design.
drawio-diagrams-enhanced
jgtolentino
Create professional draw.io (diagrams.net) diagrams in XML format (.drawio files) with integrated PMP/PMBOK methodologies, extensive visual asset libraries, and industry-standard professional templates. Use this skill when users ask to create flowcharts, swimlane diagrams, cross-functional flowcharts, org charts, network diagrams, UML diagrams, BPMN, project management diagrams (WBS, Gantt, PERT, RACI), risk matrices, stakeholder maps, or any other visual diagram in draw.io format. This skill includes access to custom shape libraries for icons, clipart, and professional symbols.
ui-ux-pro-max
nextlevelbuilder
"UI/UX design intelligence. 50 styles, 21 palettes, 50 font pairings, 20 charts, 8 stacks (React, Next.js, Vue, Svelte, SwiftUI, React Native, Flutter, Tailwind). Actions: plan, build, create, design, implement, review, fix, improve, optimize, enhance, refactor, check UI/UX code. Projects: website, landing page, dashboard, admin panel, e-commerce, SaaS, portfolio, blog, mobile app, .html, .tsx, .vue, .svelte. Elements: button, modal, navbar, sidebar, card, table, form, chart. Styles: glassmorphism, claymorphism, minimalism, brutalism, neumorphism, bento grid, dark mode, responsive, skeuomorphism, flat design. Topics: color palette, accessibility, animation, layout, typography, font pairing, spacing, hover, shadow, gradient."
godot
bfollington
This skill should be used when working on Godot Engine projects. It provides specialized knowledge of Godot's file formats (.gd, .tscn, .tres), architecture patterns (component-based, signal-driven, resource-based), common pitfalls, validation tools, code templates, and CLI workflows. The `godot` command is available for running the game, validating scripts, importing resources, and exporting builds. Use this skill for tasks involving Godot game development, debugging scene/resource files, implementing game systems, or creating new Godot components.
nano-banana-pro
garg-aayush
Generate and edit images using Google's Nano Banana Pro (Gemini 3 Pro Image) API. Use when the user asks to generate, create, edit, modify, change, alter, or update images. Also use when user references an existing image file and asks to modify it in any way (e.g., "modify this image", "change the background", "replace X with Y"). Supports both text-to-image generation and image-to-image editing with configurable resolution (1K default, 2K, or 4K for high resolution). DO NOT read the image file first - use this skill directly with the --input-image parameter.
fastapi-templates
wshobson
Create production-ready FastAPI projects with async patterns, dependency injection, and comprehensive error handling. Use when building new FastAPI applications or setting up backend API projects.
Related MCP Servers
Browse all serversUnlock powerful text to speech and AI voice generator tools with ElevenLabs. Create, clone, and customize speech easily.
Transcribe for YouTube and other platforms. Extract accurate transcript of a YouTube video for accessibility, analysis,
Boost your AI code assistant with Context7: inject real-time API documentation from OpenAPI specification sources into y
Enhance software testing with Playwright MCP: Fast, reliable browser automation, an innovative alternative to Selenium s
Extend your developer tools with GitHub MCP Server for advanced automation, supporting GitHub Student and student packag
Optimize your codebase for AI with Repomix—transform, compress, and secure repos for easier analysis with modern AI tool
Stay ahead of the MCP ecosystem
Get weekly updates on new skills and servers.