ai-multimodal

102views

7installs

Process and generate multimedia content using Google Gemini API. Capabilities include analyze audio files (transcription with timestamps, summarization, speech understanding, music/sound analysis up to 9.5 hours), understand images (captioning, object detection, OCR, visual Q&A, segmentation), process videos (scene detection, Q&A, temporal analysis, YouTube URLs, up to 6 hours), extract from documents (PDF tables, forms, charts, diagrams, multi-page), generate images (text-to-image, editing, composition, refinement). Use when working with audio/video files, analyzing images or screenshots, processing PDF documents, extracting structured data from media, creating images from text prompts, or implementing multimodal AI features. Supports multiple models (Gemini 2.5/2.0) with context windows up to 2M tokens.

Install

mkdir -p .claude/skills/ai-multimodal && curl -L -o skill.zip "https://mcp.directory/api/skills/download/321" && unzip -o skill.zip -d .claude/skills/ai-multimodal && rm skill.zip

Installs to .claude/skills/ai-multimodal

About this skill

AI Multimodal Processing Skill

Process audio, images, videos, documents, and generate images using Google Gemini's multimodal API. Unified interface for all multimedia content understanding and generation.

Core Capabilities

Audio Processing

Transcription with timestamps (up to 9.5 hours)
Audio summarization and analysis
Speech understanding and speaker identification
Music and environmental sound analysis
Text-to-speech generation with controllable voice

Image Understanding

Image captioning and description
Object detection with bounding boxes (2.0+)
Pixel-level segmentation (2.5+)
Visual question answering
Multi-image comparison (up to 3,600 images)
OCR and text extraction

Video Analysis

Scene detection and summarization
Video Q&A with temporal understanding
Transcription with visual descriptions
YouTube URL support
Long video processing (up to 6 hours)
Frame-level analysis

Document Extraction

Native PDF vision processing (up to 1,000 pages)
Table and form extraction
Chart and diagram analysis
Multi-page document understanding
Structured data output (JSON schema)
Format conversion (PDF to HTML/JSON)

Image Generation

Text-to-image generation
Image editing and modification
Multi-image composition (up to 3 images)
Iterative refinement
Multiple aspect ratios (1:1, 16:9, 9:16, 4:3, 3:4)
Controllable style and quality

Capability Matrix

Task	Audio	Image	Video	Document	Generation
Transcription	✓	-	✓	-	-
Summarization	✓	✓	✓	✓	-
Q&A	✓	✓	✓	✓	-
Object Detection	-	✓	✓	-	-
Text Extraction	-	✓	-	✓	-
Structured Output	✓	✓	✓	✓	-
Creation	TTS	-	-	-	✓
Timestamps	✓	-	✓	-	-
Segmentation	-	✓	-	-	-

Model Selection Guide

Gemini 2.5 Series (Recommended)

gemini-2.5-pro: Highest quality, all features, 1M-2M context
gemini-2.5-flash: Best balance, all features, 1M-2M context
gemini-2.5-flash-lite: Lightweight, segmentation support
gemini-2.5-flash-image: Image generation only

Gemini 2.0 Series

gemini-2.0-flash: Fast processing, object detection
gemini-2.0-flash-lite: Lightweight option

Feature Requirements

Segmentation: Requires 2.5+ models
Object Detection: Requires 2.0+ models
Multi-video: Requires 2.5+ models
Image Generation: Requires flash-image model

Context Windows

2M tokens: ~6 hours video (low-res) or ~2 hours (default)
1M tokens: ~3 hours video (low-res) or ~1 hour (default)
Audio: 32 tokens/second (1 min = 1,920 tokens)
PDF: 258 tokens/page (fixed)
Image: 258-1,548 tokens based on size

Quick Start

Prerequisites

API Key Setup: Supports both Google AI Studio and Vertex AI.

The skill checks for GEMINI_API_KEY in this order:

Process environment: export GEMINI_API_KEY="your-key"
Project root: .env
.claude/.env
.claude/skills/.env
.claude/skills/ai-multimodal/.env

Get API key: https://aistudio.google.com/apikey

For Vertex AI:

export GEMINI_USE_VERTEX=true
export VERTEX_PROJECT_ID=your-gcp-project-id
export VERTEX_LOCATION=us-central1  # Optional

Install SDK:

pip install google-genai python-dotenv pillow

Common Patterns

Transcribe Audio:

python scripts/gemini_batch_process.py \
  --files audio.mp3 \
  --task transcribe \
  --model gemini-2.5-flash

Analyze Image:

python scripts/gemini_batch_process.py \
  --files image.jpg \
  --task analyze \
  --prompt "Describe this image" \
  --output docs/assets/<output-name>.md \
  --model gemini-2.5-flash

Process Video:

python scripts/gemini_batch_process.py \
  --files video.mp4 \
  --task analyze \
  --prompt "Summarize key points with timestamps" \
  --output docs/assets/<output-name>.md \
  --model gemini-2.5-flash

Extract from PDF:

python scripts/gemini_batch_process.py \
  --files document.pdf \
  --task extract \
  --prompt "Extract table data as JSON" \
  --output docs/assets/<output-name>.md \
  --format json

Generate Image:

python scripts/gemini_batch_process.py \
  --task generate \
  --prompt "A futuristic city at sunset" \
  --output docs/assets/<output-file-name> \
  --model gemini-2.5-flash-image \
  --aspect-ratio 16:9

Optimize Media:

# Prepare large video for processing
python scripts/media_optimizer.py \
  --input large-video.mp4 \
  --output docs/assets/<output-file-name> \
  --target-size 100MB

# Batch optimize multiple files
python scripts/media_optimizer.py \
  --input-dir ./videos \
  --output-dir docs/assets/optimized \
  --quality 85

Convert Documents to Markdown:

# Convert to PDF
python scripts/document_converter.py \
  --input document.docx \
  --output docs/assets/document.md

# Extract pages
python scripts/document_converter.py \
  --input large.pdf \
  --output docs/assets/chapter1.md \
  --pages 1-20

Supported Formats

Audio

WAV, MP3, AAC, FLAC, OGG Vorbis, AIFF
Max 9.5 hours per request
Auto-downsampled to 16 Kbps mono

Images

PNG, JPEG, WEBP, HEIC, HEIF
Max 3,600 images per request
Resolution: ≤384px = 258 tokens, larger = tiled

Video

MP4, MPEG, MOV, AVI, FLV, MPG, WebM, WMV, 3GPP
Max 6 hours (low-res) or 2 hours (default)
YouTube URLs supported (public only)

Documents

PDF only for vision processing
Max 1,000 pages
TXT, HTML, Markdown supported (text-only)

Size Limits

Inline: <20MB total request
File API: 2GB per file, 20GB project quota
Retention: 48 hours auto-delete

Reference Navigation

For detailed implementation guidance, see:

Audio Processing

references/audio-processing.md - Transcription, analysis, TTS
- Timestamp handling and segment analysis
- Multi-speaker identification
- Non-speech audio analysis
- Text-to-speech generation

Image Understanding

references/vision-understanding.md - Captioning, detection, OCR
- Object detection and localization
- Pixel-level segmentation
- Visual question answering
- Multi-image comparison

Video Analysis

references/video-analysis.md - Scene detection, temporal understanding
- YouTube URL processing
- Timestamp-based queries
- Video clipping and FPS control
- Long video optimization

Document Extraction

references/document-extraction.md - PDF processing, structured output
- Table and form extraction
- Chart and diagram analysis
- JSON schema validation
- Multi-page handling

Image Generation

references/image-generation.md - Text-to-image, editing
- Prompt engineering strategies
- Image editing and composition
- Aspect ratio selection
- Safety settings

Cost Optimization

Token Costs

Input Pricing:

Gemini 2.5 Flash: $1.00/1M input, $0.10/1M output
Gemini 2.5 Pro: $3.00/1M input, $12.00/1M output
Gemini 1.5 Flash: $0.70/1M input, $0.175/1M output

Token Rates:

Audio: 32 tokens/second (1 min = 1,920 tokens)
Video: ~300 tokens/second (default) or ~100 (low-res)
PDF: 258 tokens/page (fixed)
Image: 258-1,548 tokens based on size

TTS Pricing:

Flash TTS: $10/1M tokens
Pro TTS: $20/1M tokens

Best Practices

Use gemini-2.5-flash for most tasks (best price/performance)
Use File API for files >20MB or repeated queries
Optimize media before upload (see media_optimizer.py)
Process specific segments instead of full videos
Use lower FPS for static content
Implement context caching for repeated queries
Batch process multiple files in parallel

Rate Limits

Free Tier:

10-15 RPM (requests per minute)
1M-4M TPM (tokens per minute)
1,500 RPD (requests per day)

YouTube Limits:

Free tier: 8 hours/day
Paid tier: No length limits
Public videos only

Storage Limits:

20GB per project
2GB per file
48-hour retention

Error Handling

Common errors and solutions:

400: Invalid format/size - validate before upload
401: Invalid API key - check configuration
403: Permission denied - verify API key restrictions
404: File not found - ensure file uploaded and active
429: Rate limit exceeded - implement exponential backoff
500: Server error - retry with backoff

Scripts Overview

All scripts support unified API key detection and error handling:

gemini_batch_process.py: Batch process multiple media files

Supports all modalities (audio, image, video, PDF)
Progress tracking and error recovery
Output formats: JSON, Markdown, CSV
Rate limiting and retry logic
Dry-run mode

media_optimizer.py: Prepare media for Gemini API

Compress videos/audio for size limits
Resize images appropriately
Split long videos into chunks
Format conversion
Quality vs size optimization

document_converter.py: Convert documents to PDF

Convert DOCX, XLSX, PPTX to PDF
Extract page ranges
Optimize PDFs for Gemini
Extract images from PDFs
Batch conversion support

Run any script with --help for detailed usage.

Resources

More by mrgoonie

View all skills by mrgoonie →

sequential-thinking

mrgoonie

Use when complex problems require systematic step-by-step reasoning with ability to revise thoughts, branch into alternative approaches, or dynamically adjust scope. Ideal for multi-stage analysis, design planning, problem decomposition, or tasks with initially unclear scope.

30191

chrome-devtools

mrgoonie

Browser automation, debugging, and performance analysis using Puppeteer CLI scripts. Use for automating browsers, taking screenshots, analyzing performance, monitoring network traffic, web scraping, form automation, and JavaScript debugging.

14831

threejs

mrgoonie

Build 3D web apps with Three.js (WebGL/WebGPU). Use for 3D scenes, animations, custom shaders, PBR materials, VR/XR experiences, games, data visualizations, product configurators.

4524

problem-solving

mrgoonie

Creative problem-solving techniques for breaking through stuck points - includes collision-zone thinking, inversion, pattern recognition, and simplification

2314

media-processing

mrgoonie

Process multimedia files with FFmpeg (video/audio encoding, conversion, streaming, filtering, hardware acceleration) and ImageMagick (image manipulation, format conversion, batch processing, effects, composition). Use when converting media formats, encoding videos with specific codecs (H.264, H.265, VP9), resizing/cropping images, extracting audio from video, applying filters and effects, optimizing file sizes, creating streaming manifests (HLS/DASH), generating thumbnails, batch processing images, creating composite images, or implementing media processing pipelines. Supports 100+ formats, hardware acceleration (NVENC, QSV), and complex filtergraphs.

11911

simplification-cascades

mrgoonie

Find one insight that eliminates multiple components - "if this is true, we don't need X, Y, or Z"

8011

flutter-development

aj-geddes

Build beautiful cross-platform mobile apps with Flutter and Dart. Covers widgets, state management with Provider/BLoC, navigation, API integration, and material design.

1,5761,371

ui-ux-pro-max

nextlevelbuilder

"UI/UX design intelligence. 50 styles, 21 palettes, 50 font pairings, 20 charts, 8 stacks (React, Next.js, Vue, Svelte, SwiftUI, React Native, Flutter, Tailwind). Actions: plan, build, create, design, implement, review, fix, improve, optimize, enhance, refactor, check UI/UX code. Projects: website, landing page, dashboard, admin panel, e-commerce, SaaS, portfolio, blog, mobile app, .html, .tsx, .vue, .svelte. Elements: button, modal, navbar, sidebar, card, table, form, chart. Styles: glassmorphism, claymorphism, minimalism, brutalism, neumorphism, bento grid, dark mode, responsive, skeuomorphism, flat design. Topics: color palette, accessibility, animation, layout, typography, font pairing, spacing, hover, shadow, gradient."

1,1191,193

drawio-diagrams-enhanced

jgtolentino

Create professional draw.io (diagrams.net) diagrams in XML format (.drawio files) with integrated PMP/PMBOK methodologies, extensive visual asset libraries, and industry-standard professional templates. Use this skill when users ask to create flowcharts, swimlane diagrams, cross-functional flowcharts, org charts, network diagrams, UML diagrams, BPMN, project management diagrams (WBS, Gantt, PERT, RACI), risk matrices, stakeholder maps, or any other visual diagram in draw.io format. This skill includes access to custom shape libraries for icons, clipart, and professional symbols.

1,4191,110

godot

bfollington

This skill should be used when working on Godot Engine projects. It provides specialized knowledge of Godot's file formats (.gd, .tscn, .tres), architecture patterns (component-based, signal-driven, resource-based), common pitfalls, validation tools, code templates, and CLI workflows. The `godot` command is available for running the game, validating scripts, importing resources, and exporting builds. Use this skill for tasks involving Godot game development, debugging scene/resource files, implementing game systems, or creating new Godot components.

1,200751

nano-banana-pro

garg-aayush

Generate and edit images using Google's Nano Banana Pro (Gemini 3 Pro Image) API. Use when the user asks to generate, create, edit, modify, change, alter, or update images. Also use when user references an existing image file and asks to modify it in any way (e.g., "modify this image", "change the background", "replace X with Y"). Supports both text-to-image generation and image-to-image editing with configurable resolution (1K default, 2K, or 4K for high resolution). DO NOT read the image file first - use this skill directly with the --input-image parameter.

1,159685

pdf-to-markdown

aliceisjustplaying

Convert entire PDF documents to clean, structured Markdown for full context loading. Use this skill when the user wants to extract ALL text from a PDF into context (not grep/search), when discussing or analyzing PDF content in full, when the user mentions "load the whole PDF", "bring the PDF into context", "read the entire PDF", or when partial extraction/grepping would miss important context. This is the preferred method for PDF text extraction over page-by-page or grep approaches.

1,330621

Related MCP Servers

Browse all servers

Blender

Connect Blender to Claude AI for seamless 3D modeling. Use AI 3D model generator tools for faster, intuitive, interactiv

17,59521 tools

YouTube Downloader

Easily download videos or convert YouTube to MP3/MP4 with our YouTube downloader for quick content analysis using yt-dlp

2220 tools

Content Core

Extract text and audio from URLs, docs, videos, and images with AI voice generator and text to speech for unified conten

1361 tools

Read Website Fast

Extract web content and convert to clean Markdown. Fast data extraction from web pages with caching, robots.txt support,

1351 tools

GPT Image Generator

Generate and edit images instantly using GPT Image Generator, an advanced AI image generator for creative visual content

180 tools

Face Generator

Generate faces with thispersondoesnotexist, an AI face generator with batch options—ideal for customizable datasets and

61 tools

Stay ahead of the MCP ecosystem

Get weekly updates on new skills and servers.

Install

mkdir -p .claude/skills/ai-multimodal && curl -L -o skill.zip "https://mcp.directory/api/skills/download/321" && unzip -o skill.zip -d .claude/skills/ai-multimodal && rm skill.zip

Installs to .claude/skills/ai-multimodal

Stats

Views

102

Installs

Author

mrgoonie

7 skills published

Links

Source Code

ai-multimodal

Install

About this skill

AI Multimodal Processing Skill

Core Capabilities

Audio Processing

Image Understanding

Video Analysis

Document Extraction

Image Generation

Capability Matrix

Model Selection Guide

Gemini 2.5 Series (Recommended)

Gemini 2.0 Series

Feature Requirements

Context Windows

Quick Start

Prerequisites

Common Patterns

Supported Formats

Audio

Images

Video

Documents

Size Limits

Reference Navigation

Audio Processing

Image Understanding

Video Analysis

Document Extraction

Image Generation

Cost Optimization

Token Costs

Best Practices

Rate Limits

Error Handling

Scripts Overview

Resources

More by mrgoonie

sequential-thinking

chrome-devtools

threejs

problem-solving

media-processing

simplification-cascades

You might also like

flutter-development

ui-ux-pro-max

drawio-diagrams-enhanced

godot

nano-banana-pro

pdf-to-markdown

Related MCP Servers

Stay ahead of the MCP ecosystem