pdf-processing
Comprehensive PDF processing techniques for handling large files that exceed Claude Code's reading limits, including chunking strategies, text/table extraction, and OCR for scanned documents. Use when working with PDFs larger than 10-15MB or more than 30-50 pages.
Install
mkdir -p .claude/skills/pdf-processing && curl -L -o skill.zip "https://mcp.directory/api/skills/download/300" && unzip -o skill.zip -d .claude/skills/pdf-processing && rm skill.zipInstalls to .claude/skills/pdf-processing
About this skill
PDF Processing for Claude Code
Provides comprehensive techniques and utilities for processing PDF files in Claude Code, especially large files that exceed direct reading capabilities.
Overview
Claude Code can read PDF files directly using the Read tool, but has critical limitations:
- Official limits: 32MB max file size, 100 pages max
- Real-world limits: Much lower (10-15MB, 30-50 pages)
- Known issue: Claude Code crashes with large PDFs, causing session termination and context loss
- Token cost: 1,500-3,000 tokens per page for text + additional for images
This skill provides workarounds, utilities, and best practices for handling PDFs of any size.
Quick Start
Check if PDF is Too Large for Direct Reading
import os
def is_pdf_too_large(filepath, max_mb=10):
"""Check if PDF exceeds safe processing size."""
size_mb = os.path.getsize(filepath) / (1024 * 1024)
return size_mb > max_mb
# Use before attempting to read
if is_pdf_too_large("document.pdf"):
print("PDF too large - use chunking strategies")
else:
# Safe to read directly with Claude Code
pass
Extract Text from PDF
import fitz # PyMuPDF - fastest option
def extract_text_fast(pdf_path):
"""Extract all text from PDF quickly."""
doc = fitz.open(pdf_path)
text = ""
for page in doc:
text += page.get_text()
doc.close()
return text
# Usage
text = extract_text_fast("document.pdf")
Split Large PDF into Chunks
from pypdf import PdfReader, PdfWriter
def chunk_pdf(input_path, pages_per_chunk=25, output_dir="chunks"):
"""Split PDF into smaller files."""
reader = PdfReader(input_path)
total_pages = len(reader.pages)
os.makedirs(output_dir, exist_ok=True)
for i in range(0, total_pages, pages_per_chunk):
writer = PdfWriter()
end = min(i + pages_per_chunk, total_pages)
for page_num in range(i, end):
writer.add_page(reader.pages[page_num])
output_file = f"{output_dir}/chunk_{i//pages_per_chunk:03d}_pages_{i+1}-{end}.pdf"
with open(output_file, "wb") as output:
writer.write(output)
print(f"Created {output_file}")
# Usage
chunk_pdf("large_document.pdf", pages_per_chunk=30)
Extract Tables from PDF
import pdfplumber
def extract_tables(pdf_path):
"""Extract all tables from PDF with high accuracy."""
tables = []
with pdfplumber.open(pdf_path) as pdf:
for page_num, page in enumerate(pdf.pages, 1):
page_tables = page.extract_tables()
for table_num, table in enumerate(page_tables, 1):
tables.append({
'page': page_num,
'table_num': table_num,
'data': table
})
return tables
# Usage
tables = extract_tables("report.pdf")
for t in tables:
print(f"Page {t['page']}, Table {t['table_num']}")
print(t['data'])
Python Libraries
pypdf (formerly PyPDF2)
- Best for: Basic PDF operations (split, merge, rotate)
- Speed: Slower than alternatives
- Install:
pip install pypdf
PyMuPDF (fitz)
- Best for: Fast text extraction, general-purpose processing
- Speed: 10-20x faster than pypdf
- Install:
pip install PyMuPDF
pdfplumber
- Best for: Table extraction, precise text with coordinates
- Speed: Moderate (0.10s per page)
- Install:
pip install pdfplumber
pdf2image
- Best for: Converting PDF pages to images
- Requires: Poppler (system dependency)
- Install:
pip install pdf2image
pytesseract
- Best for: OCR on scanned PDFs
- Requires: Tesseract (system dependency)
- Install:
pip install pytesseract
Chunking Strategies
1. Page-Based Splitting
Split PDF into fixed page batches.
When to use: Document structure is irrelevant; you need simple, predictable chunks
Optimal size: 20-30 pages per chunk (stays under 10MB typically)
# See Quick Start "Split Large PDF into Chunks"
chunk_pdf("document.pdf", pages_per_chunk=25)
2. Size-Based Splitting
Monitor file size and split when threshold is reached.
When to use: Avoiding crashes is critical; page count is unreliable indicator
def chunk_by_size(pdf_path, max_mb=8):
"""Split PDF keeping chunks under size limit."""
reader = PdfReader(pdf_path)
writer = PdfWriter()
chunk_num = 0
for page_num, page in enumerate(reader.pages):
writer.add_page(page)
# Check size by writing to bytes
from io import BytesIO
buffer = BytesIO()
writer.write(buffer)
size_mb = buffer.tell() / (1024 * 1024)
if size_mb >= max_mb:
# Save chunk
output = f"chunk_{chunk_num:03d}.pdf"
with open(output, "wb") as f:
writer.write(f)
chunk_num += 1
writer = PdfWriter() # Start new chunk
3. Overlapping Chunks
Include overlap between chunks to maintain context.
When to use: Content spans pages; losing context between chunks is problematic
Optimal overlap: 1-2 pages (or 10-20% of chunk size)
def chunk_with_overlap(pdf_path, pages_per_chunk=25, overlap=2):
"""Split PDF with overlapping pages for context preservation."""
reader = PdfReader(pdf_path)
total_pages = len(reader.pages)
chunk_num = 0
start = 0
while start < total_pages:
writer = PdfWriter()
end = min(start + pages_per_chunk, total_pages)
for page_num in range(start, end):
writer.add_page(reader.pages[page_num])
output = f"chunk_{chunk_num:03d}_pages_{start+1}-{end}.pdf"
with open(output, "wb") as f:
writer.write(f)
chunk_num += 1
start = end - overlap # Move forward with overlap
4. Text Extraction First
Extract text, then chunk the text instead of PDF.
When to use: You only need text content, not layout/images
Advantage: Much smaller, faster to process, no crashes
def extract_and_chunk_text(pdf_path, chars_per_chunk=10000):
"""Extract text and split into manageable chunks."""
import fitz
doc = fitz.open(pdf_path)
full_text = ""
for page in doc:
full_text += f"\n\n--- Page {page.number + 1} ---\n\n"
full_text += page.get_text()
doc.close()
# Split text into chunks
chunks = []
for i in range(0, len(full_text), chars_per_chunk):
chunks.append(full_text[i:i + chars_per_chunk])
return chunks
# Usage
text_chunks = extract_and_chunk_text("large.pdf")
for i, chunk in enumerate(text_chunks):
with open(f"text_chunk_{i:03d}.txt", "w", encoding="utf-8") as f:
f.write(chunk)
Handling Different PDF Types
Text-Based PDFs (Native Text)
PDFs created digitally with searchable text.
Detection:
import fitz
doc = fitz.open("document.pdf")
text = doc[0].get_text() # First page
if len(text.strip()) > 50:
print("Text-based PDF")
else:
print("Likely scanned PDF")
Best approach: Direct text extraction with PyMuPDF or pdfplumber
Scanned PDFs (Images of Text)
PDFs created by scanning physical documents.
Requires: OCR (Optical Character Recognition)
Approach:
from pdf2image import convert_from_path
import pytesseract
def ocr_pdf(pdf_path):
"""Extract text from scanned PDF using OCR."""
# Convert to images
images = convert_from_path(pdf_path, dpi=300)
# OCR each page
text = ""
for i, image in enumerate(images, 1):
page_text = pytesseract.image_to_string(image)
text += f"\n\n--- Page {i} ---\n\n{page_text}"
return text
Performance note: OCR is much slower than direct text extraction
Mixed PDFs
Some pages have text, others are scanned.
Approach: Detect page-by-page and use appropriate method
def extract_mixed_pdf(pdf_path):
"""Handle PDFs with both text and scanned pages."""
import fitz
from pdf2image import convert_from_path
import pytesseract
doc = fitz.open(pdf_path)
full_text = ""
for page_num, page in enumerate(doc):
text = page.get_text()
if len(text.strip()) > 50:
# Has text - use direct extraction
full_text += f"\n\n--- Page {page_num + 1} (text) ---\n\n{text}"
else:
# Likely scanned - use OCR
images = convert_from_path(pdf_path, first_page=page_num+1, last_page=page_num+1, dpi=300)
ocr_text = pytesseract.image_to_string(images[0])
full_text += f"\n\n--- Page {page_num + 1} (OCR) ---\n\n{ocr_text}"
doc.close()
return full_text
Helper Scripts
This skill includes pre-built scripts in the scripts/ directory:
- chunk_pdf.py: Flexible PDF chunking with multiple strategies
- extract_text.py: Unified text extraction (handles text-based and OCR)
- extract_tables.py: Advanced table extraction with formatting
- process_large_pdf.py: Orchestrate complete large PDF processing workflow
Using Helper Scripts
# Chunk a large PDF
python .claude/skills/pdf-processing/scripts/chunk_pdf.py large_doc.pdf --pages 30 --overlap 2
# Extract all text
python .claude/skills/pdf-processing/scripts/extract_text.py document.pdf --output text.txt
# Extract tables to CSV
python .claude/skills/pdf-processing/scripts/extract_tables.py report.pdf --output tables/
# Process large PDF end-to-end
python .claude/skills/pdf-processing/scripts/process_large_pdf.py huge_doc.pdf --strategy chunk --output processed/
Error Handling
Preventing Crashes
Key principle: Never trust PDF size alone - always check before reading
def safe_pdf_read(pdf_path, max_pages=30, max_mb=10):
"""Safely check if PDF can be read directly."""
import fitz
# Check file size
size_mb = os.path.getsize(pdf_path
---
*Content truncated.*
You might also like
flutter-development
aj-geddes
Build beautiful cross-platform mobile apps with Flutter and Dart. Covers widgets, state management with Provider/BLoC, navigation, API integration, and material design.
ui-ux-pro-max
nextlevelbuilder
"UI/UX design intelligence. 50 styles, 21 palettes, 50 font pairings, 20 charts, 8 stacks (React, Next.js, Vue, Svelte, SwiftUI, React Native, Flutter, Tailwind). Actions: plan, build, create, design, implement, review, fix, improve, optimize, enhance, refactor, check UI/UX code. Projects: website, landing page, dashboard, admin panel, e-commerce, SaaS, portfolio, blog, mobile app, .html, .tsx, .vue, .svelte. Elements: button, modal, navbar, sidebar, card, table, form, chart. Styles: glassmorphism, claymorphism, minimalism, brutalism, neumorphism, bento grid, dark mode, responsive, skeuomorphism, flat design. Topics: color palette, accessibility, animation, layout, typography, font pairing, spacing, hover, shadow, gradient."
drawio-diagrams-enhanced
jgtolentino
Create professional draw.io (diagrams.net) diagrams in XML format (.drawio files) with integrated PMP/PMBOK methodologies, extensive visual asset libraries, and industry-standard professional templates. Use this skill when users ask to create flowcharts, swimlane diagrams, cross-functional flowcharts, org charts, network diagrams, UML diagrams, BPMN, project management diagrams (WBS, Gantt, PERT, RACI), risk matrices, stakeholder maps, or any other visual diagram in draw.io format. This skill includes access to custom shape libraries for icons, clipart, and professional symbols.
godot
bfollington
This skill should be used when working on Godot Engine projects. It provides specialized knowledge of Godot's file formats (.gd, .tscn, .tres), architecture patterns (component-based, signal-driven, resource-based), common pitfalls, validation tools, code templates, and CLI workflows. The `godot` command is available for running the game, validating scripts, importing resources, and exporting builds. Use this skill for tasks involving Godot game development, debugging scene/resource files, implementing game systems, or creating new Godot components.
nano-banana-pro
garg-aayush
Generate and edit images using Google's Nano Banana Pro (Gemini 3 Pro Image) API. Use when the user asks to generate, create, edit, modify, change, alter, or update images. Also use when user references an existing image file and asks to modify it in any way (e.g., "modify this image", "change the background", "replace X with Y"). Supports both text-to-image generation and image-to-image editing with configurable resolution (1K default, 2K, or 4K for high resolution). DO NOT read the image file first - use this skill directly with the --input-image parameter.
pdf-to-markdown
aliceisjustplaying
Convert entire PDF documents to clean, structured Markdown for full context loading. Use this skill when the user wants to extract ALL text from a PDF into context (not grep/search), when discussing or analyzing PDF content in full, when the user mentions "load the whole PDF", "bring the PDF into context", "read the entire PDF", or when partial extraction/grepping would miss important context. This is the preferred method for PDF text extraction over page-by-page or grep approaches.
Related MCP Servers
Browse all serversAutomate PDF tasks using PyMuPDF: text, image, form, annotation & metadata handling—ideal for powder diffraction file ma
Integrate with Gemini CLI for large-scale file analysis, secure code execution, and advanced context control using Googl
Optimize Facebook ad campaigns with AI-driven insights, creative analysis, and campaign control in Meta Ads Manager for
Sync Trello with Google Calendar easily. Fast, automated Trello workflows, card management & seamless Google Calendar in
GistPad (GitHub Gists) turns gists into a powerful knowledge management system for daily notes and versioned content.
Connect with Square API for seamless e-commerce, orders, inventory, and payment processing via conversational interfaces
Stay ahead of the MCP ecosystem
Get weekly updates on new skills and servers.