pdf-processing

156
32
Source

Comprehensive PDF processing techniques for handling large files that exceed Claude Code's reading limits, including chunking strategies, text/table extraction, and OCR for scanned documents. Use when working with PDFs larger than 10-15MB or more than 30-50 pages.

Install

mkdir -p .claude/skills/pdf-processing && curl -L -o skill.zip "https://mcp.directory/api/skills/download/300" && unzip -o skill.zip -d .claude/skills/pdf-processing && rm skill.zip

Installs to .claude/skills/pdf-processing

About this skill

PDF Processing for Claude Code

Provides comprehensive techniques and utilities for processing PDF files in Claude Code, especially large files that exceed direct reading capabilities.

Overview

Claude Code can read PDF files directly using the Read tool, but has critical limitations:

  • Official limits: 32MB max file size, 100 pages max
  • Real-world limits: Much lower (10-15MB, 30-50 pages)
  • Known issue: Claude Code crashes with large PDFs, causing session termination and context loss
  • Token cost: 1,500-3,000 tokens per page for text + additional for images

This skill provides workarounds, utilities, and best practices for handling PDFs of any size.

Quick Start

Check if PDF is Too Large for Direct Reading

import os

def is_pdf_too_large(filepath, max_mb=10):
    """Check if PDF exceeds safe processing size."""
    size_mb = os.path.getsize(filepath) / (1024 * 1024)
    return size_mb > max_mb

# Use before attempting to read
if is_pdf_too_large("document.pdf"):
    print("PDF too large - use chunking strategies")
else:
    # Safe to read directly with Claude Code
    pass

Extract Text from PDF

import fitz  # PyMuPDF - fastest option

def extract_text_fast(pdf_path):
    """Extract all text from PDF quickly."""
    doc = fitz.open(pdf_path)
    text = ""
    for page in doc:
        text += page.get_text()
    doc.close()
    return text

# Usage
text = extract_text_fast("document.pdf")

Split Large PDF into Chunks

from pypdf import PdfReader, PdfWriter

def chunk_pdf(input_path, pages_per_chunk=25, output_dir="chunks"):
    """Split PDF into smaller files."""
    reader = PdfReader(input_path)
    total_pages = len(reader.pages)

    os.makedirs(output_dir, exist_ok=True)

    for i in range(0, total_pages, pages_per_chunk):
        writer = PdfWriter()
        end = min(i + pages_per_chunk, total_pages)

        for page_num in range(i, end):
            writer.add_page(reader.pages[page_num])

        output_file = f"{output_dir}/chunk_{i//pages_per_chunk:03d}_pages_{i+1}-{end}.pdf"
        with open(output_file, "wb") as output:
            writer.write(output)

        print(f"Created {output_file}")

# Usage
chunk_pdf("large_document.pdf", pages_per_chunk=30)

Extract Tables from PDF

import pdfplumber

def extract_tables(pdf_path):
    """Extract all tables from PDF with high accuracy."""
    tables = []

    with pdfplumber.open(pdf_path) as pdf:
        for page_num, page in enumerate(pdf.pages, 1):
            page_tables = page.extract_tables()
            for table_num, table in enumerate(page_tables, 1):
                tables.append({
                    'page': page_num,
                    'table_num': table_num,
                    'data': table
                })

    return tables

# Usage
tables = extract_tables("report.pdf")
for t in tables:
    print(f"Page {t['page']}, Table {t['table_num']}")
    print(t['data'])

Python Libraries

pypdf (formerly PyPDF2)

  • Best for: Basic PDF operations (split, merge, rotate)
  • Speed: Slower than alternatives
  • Install: pip install pypdf

PyMuPDF (fitz)

  • Best for: Fast text extraction, general-purpose processing
  • Speed: 10-20x faster than pypdf
  • Install: pip install PyMuPDF

pdfplumber

  • Best for: Table extraction, precise text with coordinates
  • Speed: Moderate (0.10s per page)
  • Install: pip install pdfplumber

pdf2image

  • Best for: Converting PDF pages to images
  • Requires: Poppler (system dependency)
  • Install: pip install pdf2image

pytesseract

  • Best for: OCR on scanned PDFs
  • Requires: Tesseract (system dependency)
  • Install: pip install pytesseract

Chunking Strategies

1. Page-Based Splitting

Split PDF into fixed page batches.

When to use: Document structure is irrelevant; you need simple, predictable chunks

Optimal size: 20-30 pages per chunk (stays under 10MB typically)

# See Quick Start "Split Large PDF into Chunks"
chunk_pdf("document.pdf", pages_per_chunk=25)

2. Size-Based Splitting

Monitor file size and split when threshold is reached.

When to use: Avoiding crashes is critical; page count is unreliable indicator

def chunk_by_size(pdf_path, max_mb=8):
    """Split PDF keeping chunks under size limit."""
    reader = PdfReader(pdf_path)
    writer = PdfWriter()
    chunk_num = 0

    for page_num, page in enumerate(reader.pages):
        writer.add_page(page)

        # Check size by writing to bytes
        from io import BytesIO
        buffer = BytesIO()
        writer.write(buffer)
        size_mb = buffer.tell() / (1024 * 1024)

        if size_mb >= max_mb:
            # Save chunk
            output = f"chunk_{chunk_num:03d}.pdf"
            with open(output, "wb") as f:
                writer.write(f)
            chunk_num += 1
            writer = PdfWriter()  # Start new chunk

3. Overlapping Chunks

Include overlap between chunks to maintain context.

When to use: Content spans pages; losing context between chunks is problematic

Optimal overlap: 1-2 pages (or 10-20% of chunk size)

def chunk_with_overlap(pdf_path, pages_per_chunk=25, overlap=2):
    """Split PDF with overlapping pages for context preservation."""
    reader = PdfReader(pdf_path)
    total_pages = len(reader.pages)

    chunk_num = 0
    start = 0

    while start < total_pages:
        writer = PdfWriter()
        end = min(start + pages_per_chunk, total_pages)

        for page_num in range(start, end):
            writer.add_page(reader.pages[page_num])

        output = f"chunk_{chunk_num:03d}_pages_{start+1}-{end}.pdf"
        with open(output, "wb") as f:
            writer.write(f)

        chunk_num += 1
        start = end - overlap  # Move forward with overlap

4. Text Extraction First

Extract text, then chunk the text instead of PDF.

When to use: You only need text content, not layout/images

Advantage: Much smaller, faster to process, no crashes

def extract_and_chunk_text(pdf_path, chars_per_chunk=10000):
    """Extract text and split into manageable chunks."""
    import fitz

    doc = fitz.open(pdf_path)
    full_text = ""

    for page in doc:
        full_text += f"\n\n--- Page {page.number + 1} ---\n\n"
        full_text += page.get_text()

    doc.close()

    # Split text into chunks
    chunks = []
    for i in range(0, len(full_text), chars_per_chunk):
        chunks.append(full_text[i:i + chars_per_chunk])

    return chunks

# Usage
text_chunks = extract_and_chunk_text("large.pdf")
for i, chunk in enumerate(text_chunks):
    with open(f"text_chunk_{i:03d}.txt", "w", encoding="utf-8") as f:
        f.write(chunk)

Handling Different PDF Types

Text-Based PDFs (Native Text)

PDFs created digitally with searchable text.

Detection:

import fitz

doc = fitz.open("document.pdf")
text = doc[0].get_text()  # First page

if len(text.strip()) > 50:
    print("Text-based PDF")
else:
    print("Likely scanned PDF")

Best approach: Direct text extraction with PyMuPDF or pdfplumber

Scanned PDFs (Images of Text)

PDFs created by scanning physical documents.

Requires: OCR (Optical Character Recognition)

Approach:

from pdf2image import convert_from_path
import pytesseract

def ocr_pdf(pdf_path):
    """Extract text from scanned PDF using OCR."""
    # Convert to images
    images = convert_from_path(pdf_path, dpi=300)

    # OCR each page
    text = ""
    for i, image in enumerate(images, 1):
        page_text = pytesseract.image_to_string(image)
        text += f"\n\n--- Page {i} ---\n\n{page_text}"

    return text

Performance note: OCR is much slower than direct text extraction

Mixed PDFs

Some pages have text, others are scanned.

Approach: Detect page-by-page and use appropriate method

def extract_mixed_pdf(pdf_path):
    """Handle PDFs with both text and scanned pages."""
    import fitz
    from pdf2image import convert_from_path
    import pytesseract

    doc = fitz.open(pdf_path)
    full_text = ""

    for page_num, page in enumerate(doc):
        text = page.get_text()

        if len(text.strip()) > 50:
            # Has text - use direct extraction
            full_text += f"\n\n--- Page {page_num + 1} (text) ---\n\n{text}"
        else:
            # Likely scanned - use OCR
            images = convert_from_path(pdf_path, first_page=page_num+1, last_page=page_num+1, dpi=300)
            ocr_text = pytesseract.image_to_string(images[0])
            full_text += f"\n\n--- Page {page_num + 1} (OCR) ---\n\n{ocr_text}"

    doc.close()
    return full_text

Helper Scripts

This skill includes pre-built scripts in the scripts/ directory:

  • chunk_pdf.py: Flexible PDF chunking with multiple strategies
  • extract_text.py: Unified text extraction (handles text-based and OCR)
  • extract_tables.py: Advanced table extraction with formatting
  • process_large_pdf.py: Orchestrate complete large PDF processing workflow

Using Helper Scripts

# Chunk a large PDF
python .claude/skills/pdf-processing/scripts/chunk_pdf.py large_doc.pdf --pages 30 --overlap 2

# Extract all text
python .claude/skills/pdf-processing/scripts/extract_text.py document.pdf --output text.txt

# Extract tables to CSV
python .claude/skills/pdf-processing/scripts/extract_tables.py report.pdf --output tables/

# Process large PDF end-to-end
python .claude/skills/pdf-processing/scripts/process_large_pdf.py huge_doc.pdf --strategy chunk --output processed/

Error Handling

Preventing Crashes

Key principle: Never trust PDF size alone - always check before reading

def safe_pdf_read(pdf_path, max_pages=30, max_mb=10):
    """Safely check if PDF can be read directly."""
    import fitz

    # Check file size
    size_mb = os.path.getsize(pdf_path

---

*Content truncated.*

You might also like

flutter-development

aj-geddes

Build beautiful cross-platform mobile apps with Flutter and Dart. Covers widgets, state management with Provider/BLoC, navigation, API integration, and material design.

1,5731,370

ui-ux-pro-max

nextlevelbuilder

"UI/UX design intelligence. 50 styles, 21 palettes, 50 font pairings, 20 charts, 8 stacks (React, Next.js, Vue, Svelte, SwiftUI, React Native, Flutter, Tailwind). Actions: plan, build, create, design, implement, review, fix, improve, optimize, enhance, refactor, check UI/UX code. Projects: website, landing page, dashboard, admin panel, e-commerce, SaaS, portfolio, blog, mobile app, .html, .tsx, .vue, .svelte. Elements: button, modal, navbar, sidebar, card, table, form, chart. Styles: glassmorphism, claymorphism, minimalism, brutalism, neumorphism, bento grid, dark mode, responsive, skeuomorphism, flat design. Topics: color palette, accessibility, animation, layout, typography, font pairing, spacing, hover, shadow, gradient."

1,1161,191

drawio-diagrams-enhanced

jgtolentino

Create professional draw.io (diagrams.net) diagrams in XML format (.drawio files) with integrated PMP/PMBOK methodologies, extensive visual asset libraries, and industry-standard professional templates. Use this skill when users ask to create flowcharts, swimlane diagrams, cross-functional flowcharts, org charts, network diagrams, UML diagrams, BPMN, project management diagrams (WBS, Gantt, PERT, RACI), risk matrices, stakeholder maps, or any other visual diagram in draw.io format. This skill includes access to custom shape libraries for icons, clipart, and professional symbols.

1,4181,109

godot

bfollington

This skill should be used when working on Godot Engine projects. It provides specialized knowledge of Godot's file formats (.gd, .tscn, .tres), architecture patterns (component-based, signal-driven, resource-based), common pitfalls, validation tools, code templates, and CLI workflows. The `godot` command is available for running the game, validating scripts, importing resources, and exporting builds. Use this skill for tasks involving Godot game development, debugging scene/resource files, implementing game systems, or creating new Godot components.

1,197748

nano-banana-pro

garg-aayush

Generate and edit images using Google's Nano Banana Pro (Gemini 3 Pro Image) API. Use when the user asks to generate, create, edit, modify, change, alter, or update images. Also use when user references an existing image file and asks to modify it in any way (e.g., "modify this image", "change the background", "replace X with Y"). Supports both text-to-image generation and image-to-image editing with configurable resolution (1K default, 2K, or 4K for high resolution). DO NOT read the image file first - use this skill directly with the --input-image parameter.

1,155684

pdf-to-markdown

aliceisjustplaying

Convert entire PDF documents to clean, structured Markdown for full context loading. Use this skill when the user wants to extract ALL text from a PDF into context (not grep/search), when discussing or analyzing PDF content in full, when the user mentions "load the whole PDF", "bring the PDF into context", "read the entire PDF", or when partial extraction/grepping would miss important context. This is the preferred method for PDF text extraction over page-by-page or grep approaches.

1,318617

Stay ahead of the MCP ecosystem

Get weekly updates on new skills and servers.