Claude pdf-to-markdown skill: 10 PDF-to-MD pipelines that just work

markitdown

Same shape for .docx, .pptx, .xlsx — pair when your inbox mixes formats.

filesystem

Lets Claude read the source PDF directly from disk without an upload step.

Multi-column academic paper (preserve reading order)

Two-column ICML or arXiv papers where naive extractors interleave columns. The skill detects the layout and walks each column top-to-bottom before moving across.

ForResearchers and grad students feeding papers into a notebook or NotebookLM.

The prompt

Convert `papers/attention-is-all-you-need.pdf` to Markdown. The PDF is a NeurIPS-style two-column layout with footnotes. Use pdf-to-markdown's accurate mode (docling) so the column reading order is preserved and footnotes land at the end of each section, not mid-paragraph. Strip the running header and page numbers. Save to `out/attention.md` and print the first H2 to stdout so I can sanity-check the section boundaries.

What slides.md looks like

from docling.document_converter import DocumentConverter
from pathlib import Path

src = 'papers/attention-is-all-you-need.pdf'
conv = DocumentConverter()
result = conv.convert(src)
md = result.document.export_to_markdown()

# Strip running header and page numbers (heuristic).
lines = [l for l in md.splitlines()
         if not l.strip().startswith('Page ') and l.strip() != 'Attention Is All You Need']
Path('out/attention.md').write_text('\n'.join(lines), encoding='utf-8')

One-line tweak

Swap `export_to_markdown()` for `export_to_dict()` to get a structured DoclingDocument JSON when you want to reason over sections programmatically.

Pairs with

literature-review

Feeds the cleaned markdown into a structured literature-review note.

patent-search

Same workflow for patent PDFs — accurate mode handles claim tables.

PDF with embedded images (export to /assets, link from markdown)

Reports that lean on figures and screenshots. The skill writes each image into a sibling directory and inserts the relative `![](...)` reference inline so the markdown renders end-to-end on GitHub or Notion.

ForTechnical writers porting investor decks or product PRDs into a wiki.

The prompt

Convert `decks/series-b-update.pdf` to Markdown. Extract every embedded image at full resolution (no downscaling), cache them under `out/series-b-update.images/`, and embed each one inline with a stable filename (`fig-{page}-{idx}.png`). Add a one-line caption above each image based on the surrounding text. Verify the markdown renders on GitHub by previewing the first three image references.

What slides.md looks like

import pymupdf4llm

md_text = pymupdf4llm.to_markdown(
    'decks/series-b-update.pdf',
    write_images=True,
    image_path='out/series-b-update.images',
    image_format='png',
    dpi=200,
    embed_images=False,  # write to disk, reference by relative path
)
# Stable filenames: pymupdf4llm writes 'page-N-image-K.png'.
open('out/series-b-update.md', 'w', encoding='utf-8').write(md_text)

One-line tweak

Set `embed_images=True` to inline the images as base64 data URIs — useful when you need a single self-contained .md to email to someone.

Pairs with

github

Commits the markdown + images directory in one PR for review.

notion

Mirrors the same converted page into Notion with images intact.

RAG pipeline: PDF -> chunks -> vector DB

End-to-end: convert the PDF, split the markdown into semantic chunks at heading boundaries, embed, and write to a local vector store. The markdown intermediate is the RAG-friendly representation.

ForEngineers building retrieval over a corpus of PDFs.

The prompt

Build a RAG ingestion script. Walk `corpus/*.pdf`, convert each via pdf-to-markdown (fast mode), split the markdown at every H2 boundary into chunks (each chunk includes its parent H1 as context), and write to a local Chroma collection named `papers`. Use OpenAI text-embedding-3-small. Skip files where the cache key matches a previous run. Print one line per file: `<name> <chunks> <ms>`.

What slides.md looks like

import pymupdf4llm, re, hashlib, time, chromadb
from pathlib import Path

client = chromadb.PersistentClient('./chroma')
col = client.get_or_create_collection('papers')

for pdf in Path('corpus').glob('*.pdf'):
    t = time.time()
    md = pymupdf4llm.to_markdown(str(pdf))
    chunks = re.split(r'(?m)^## ', md)
    ids = [hashlib.md5(f'{pdf.stem}-{i}'.encode()).hexdigest() for i in range(len(chunks))]
    col.upsert(ids=ids, documents=chunks,
               metadatas=[{'source': pdf.name, 'chunk': i} for i in range(len(chunks))])
    print(f'{pdf.name} {len(chunks)} {int((time.time()-t)*1000)}ms')

One-line tweak

Replace the `re.split` with a sliding-window splitter (700-token windows, 80-token overlap) when papers have flat structure and few H2s.

Pairs with

rag-implementation

The cleaned markdown is exactly the input shape this skill expects.

filesystem

Walks the corpus directory and streams PDFs without manual upload.

Annotated/highlighted PDF -> markdown blockquotes

Take a PDF marked up with yellow highlights and stickies. The skill extracts the highlighted spans as Markdown blockquotes with a `> [!NOTE]` callout, preserving the surrounding paragraph as context.

ForLawyers, analysts, and anyone who annotates PDFs and needs the highlights as a clean review doc.

The prompt

Open `contracts/msa-redlined.pdf`. Use the pdf-to-markdown skill (fast mode), then post-process: walk every highlight annotation in the PDF (PyMuPDF exposes `page.annots()` with `type[0] == 8` for highlight), capture the highlighted text plus the sentence it sits inside, and emit each as a Markdown blockquote with a `> [!NOTE]` callout block. Save to `out/msa-redlined-highlights.md`.

What slides.md looks like

import pymupdf

doc = pymupdf.open('contracts/msa-redlined.pdf')
out = []
for page_num, page in enumerate(doc, 1):
    for annot in page.annots() or []:
        if annot.type[0] == 8:  # highlight
            quad = annot.vertices
            text = page.get_textbox(annot.rect)
            out.append(f'> [!NOTE] page {page_num}\n> {text.strip()}\n')
open('out/msa-redlined-highlights.md', 'w').write('\n'.join(out))

One-line tweak

Filter annotations by author (`annot.info['title']`) so each reviewer's highlights end up in their own .md file — useful for diffing redlines across legal counsel.

Pairs with

literature-review

Pipes the extracted highlights into a structured review note.

zotero

Mirrors the same highlights into your Zotero library as proper annotations.

OCR-fallback for scanned PDFs (tesseract integration)

PDFs that are images of pages, not text. The skill detects the missing text layer, runs Tesseract on each page, and produces markdown — slower than the native path but it actually works.

ForTeams ingesting old scanned reports, court filings, or 1990s-era manuals.

The prompt

Convert `archive/1998-financial-report.pdf` to Markdown. The file is scanned (no text layer). Detect that automatically by checking page 1 with `page.get_text()` — if it returns less than 50 chars, fall back to OCR via pytesseract. Use `lang='eng'`. Cache OCR results under `~/.cache/pdf-to-markdown/` so re-runs skip re-OCR. Save the markdown to `out/1998-financial-report.md` and warn me if OCR confidence dips below 60%.

What slides.md looks like

import pymupdf, pytesseract
from PIL import Image
from io import BytesIO
from pathlib import Path

doc = pymupdf.open('archive/1998-financial-report.pdf')
md = []
for i, page in enumerate(doc):
    text = page.get_text().strip()
    if len(text) < 50:
        pix = page.get_pixmap(dpi=300)
        img = Image.open(BytesIO(pix.tobytes('png')))
        text = pytesseract.image_to_string(img, lang='eng')
    md.append(f'## Page {i+1}\n\n{text}\n')
Path('out/1998-financial-report.md').write_text('\n'.join(md), encoding='utf-8')

One-line tweak

Swap `lang='eng'` for `lang='eng+jpn+chi_sim'` to OCR multilingual scans — install the language packs once with `brew install tesseract-lang`.

Pairs with

pdf-ocr-extraction

Specialist skill for OCR-heavy archives where every PDF needs the OCR path.

filesystem

Walks an archive directory of scanned PDFs without copying them.

Bulk-convert a directory of PDFs (glob pattern)

Point the skill at a folder, get a parallel folder of .md files. Skips any PDF whose cache key matches the previous run so re-running on a 500-PDF directory only touches the new ones.

ForAnyone doing one-time corpus migration: archive cleanup, vendor doc dumps, regulator filings.

The prompt

Walk `vendor-docs/**/*.pdf` recursively. For each PDF, convert to Markdown using pdf-to-markdown (fast mode) and write to `markdown-out/<same relative path>.md`. Use a content hash as the cache key so re-runs only re-process files whose bytes changed. Run 4 conversions in parallel via `concurrent.futures.ProcessPoolExecutor`. Print a final summary: total PDFs, converted, skipped (cache hit), failed.

What slides.md looks like

import pymupdf4llm, hashlib
from pathlib import Path
from concurrent.futures import ProcessPoolExecutor, as_completed

CACHE = Path.home() / '.cache/pdf-to-markdown'
CACHE.mkdir(parents=True, exist_ok=True)

def convert(pdf):
    h = hashlib.sha1(pdf.read_bytes()).hexdigest()[:16]
    cache_hit = (CACHE / f'{h}.md')
    if cache_hit.exists():
        return (pdf, 'skip')
    md = pymupdf4llm.to_markdown(str(pdf))
    cache_hit.write_text(md, encoding='utf-8')
    out = Path('markdown-out') / pdf.relative_to('vendor-docs').with_suffix('.md')
    out.parent.mkdir(parents=True, exist_ok=True)
    out.write_text(md, encoding='utf-8')
    return (pdf, 'ok')

pdfs = list(Path('vendor-docs').rglob('*.pdf'))
with ProcessPoolExecutor(max_workers=4) as ex:
    for f in as_completed([ex.submit(convert, p) for p in pdfs]):
        print(*f.result())

One-line tweak

Lower `max_workers=4` to `2` if you're on an M-series Mac — pymupdf4llm is already multi-threaded internally and over-spawning competes with itself.

Pairs with

markdown-converter

Compose with markdown-converter when you need to also reformat the output (front-matter, wikilinks).

github

Commits the entire markdown-out tree as a docs PR after the bulk convert.

Extract just the bibliography from a paper

Skip the body, capture only the references. The skill finds the 'References' or 'Bibliography' section and emits a clean numbered list ready to import into Zotero or BibTeX.

ForResearchers building reading lists from a single paper's references.

The prompt

Open `papers/transformer-survey.pdf`. Convert it to Markdown with pdf-to-markdown, then slice out only the section starting from the heading 'References' (case-insensitive) through end-of-document. Normalize each entry to a single line, strip line-wrap hyphens, and write to `out/transformer-survey.refs.md` as a numbered list.

What slides.md looks like

import pymupdf4llm, re
from pathlib import Path

md = pymupdf4llm.to_markdown('papers/transformer-survey.pdf')
m = re.search(r'(?im)^#+\s*references\s*$', md)
if not m:
    raise SystemExit('No References section found')
refs = md[m.end():]
# Collapse line wraps and strip soft hyphens.
refs = re.sub(r'-\n', '', refs)
refs = re.sub(r'\n(?!\[)', ' ', refs)
entries = re.findall(r'\[\d+\][^\[]+', refs)
out = '\n'.join(f'{i+1}. {e.strip()}' for i, e in enumerate(entries))
Path('out/transformer-survey.refs.md').write_text(out, encoding='utf-8')

One-line tweak

Pipe the cleaned refs into `anystyle` (Ruby) to get proper BibTeX out: `cat refs.md | anystyle parse --format bib > refs.bib`.

Pairs with

literature-review

Bibliography is the input — literature-review structures the reading plan.

zotero

Imports the parsed entries into a Zotero collection in one call.

PDF tables -> markdown tables -> CSV pivot

Earnings reports and regulator filings hide all the value in tables. The skill extracts each table as Markdown and lets you pivot the same data into a CSV in one follow-up step.

ForAnalysts who want the numbers, not the prose.

The prompt

Open `filings/10K-2025.pdf`. Use pdf-to-markdown's accurate mode to extract every table on pages 30-60. For each table, write a Markdown table to `out/10K-tables.md` AND a parallel CSV under `out/csv/table-{page}-{idx}.csv`. Add a one-line index at the top of `10K-tables.md` linking each table to its CSV. Skip any table with fewer than 2 columns or 2 rows.

What slides.md looks like

import pandas as pd
from docling.document_converter import DocumentConverter
from pathlib import Path

Path('out/csv').mkdir(parents=True, exist_ok=True)
result = DocumentConverter().convert('filings/10K-2025.pdf', page_range=(30, 60))
md_lines = ['# 10-K tables\n']
for i, table in enumerate(result.document.tables):
    df = table.export_to_dataframe()
    if df.shape[0] < 2 or df.shape[1] < 2:
        continue
    csv_name = f'table-{table.prov[0].page_no}-{i}.csv'
    df.to_csv(f'out/csv/{csv_name}', index=False)
    md_lines.append(f'## Table {i} (p.{table.prov[0].page_no}) [{csv_name}](csv/{csv_name})\n')
    md_lines.append(df.to_markdown(index=False))
Path('out/10K-tables.md').write_text('\n\n'.join(md_lines), encoding='utf-8')

One-line tweak

Add `df.transpose()` before `to_csv` for tables where the first column is the time series and rows are metrics — flips it into a tidy long-format frame.

Pairs with

pdf-extraction

Specialist for high-volume table extraction across hundreds of filings.

google-drive-sheets

Pushes the resulting CSVs straight into a shared Sheet.

Diff two PDF versions (markdown diff for visibility)

Two PDFs of the same contract, six months apart. Convert both to markdown, diff with `git diff --no-index`, and you finally see what actually changed without scrolling page-by-page.

ForLawyers reviewing redlines, ops teams diffing vendor MSAs, anyone tracking a living spec.

The prompt

Convert `contracts/msa-2024.pdf` and `contracts/msa-2025.pdf` to Markdown using pdf-to-markdown (fast mode). Strip page numbers and running headers from both before diffing — they create false positives. Run `git diff --no-index --word-diff=color msa-2024.md msa-2025.md` and write the diff to `out/msa-diff.md` with a header summarizing how many lines added/removed.

What slides.md looks like

import pymupdf4llm, re, subprocess
from pathlib import Path

def clean(p):
    md = pymupdf4llm.to_markdown(p)
    md = re.sub(r'(?m)^\s*Page \d+ of \d+\s*$', '', md)
    return md

a = Path('out/msa-2024.md'); a.write_text(clean('contracts/msa-2024.pdf'))
b = Path('out/msa-2025.md'); b.write_text(clean('contracts/msa-2025.pdf'))
r = subprocess.run(['git', 'diff', '--no-index', '--word-diff=plain', str(a), str(b)],
                   capture_output=True, text=True)
Path('out/msa-diff.md').write_text(f'# MSA diff 2024 -> 2025\n\n\u0060\u0060\u0060diff\n{r.stdout}\n\u0060\u0060\u0060')

One-line tweak

Replace `git diff` with `python -m difflib` when you don't have git on the path — same output shape, no external dependency.

Pairs with

markdown-tools

Use markdown-tools to normalize whitespace and heading depth before diffing.

github

Files the diff as a PR comment so reviewers see the change inline.

Community signal

Three voices from people working on PDF-to-markdown for real: the maintainer of the underlying library, a long-time HN commenter who benchmarked Marker against Nougat, and the Marker maintainer himself on when to swap engines.

“Turn PDF and other documents into clean, LLM-ready data — in one line of code.”

Artifex / PyMuPDF team · Blog

The PyMuPDF4LLM project tagline. Sets the bar for what 'pdf-to-markdown' should mean in 2026: not page-by-page scraping, but a one-call markdown export.

“It handles multi-column layouts, tables, images, headers, and scanned pages with automatic OCR.”

Artifex / PyMuPDF team · Blog

From the PyMuPDF4LLM README. The skill's fast-mode default delivers exactly this list — use case 2 (multi-column) and use case 6 (scanned OCR) lean on it.

“Marker extracted considerably more text, finished faster, and did not crash on any pdf, while Nougat took much longer and sometimes crashed.”

hashemian · Hacker News

From the original Marker HN launch thread. Useful framing for when to graduate from pymupdf4llm to a learned-layout extractor on adversarial PDFs.

The contrarian take

Not every extractor handles every PDF. The honest critique on the original Marker HN thread came from crotchfire:

“Within the first three paragraphs it hallucinated spurious paragraph breaks, ignored boldfacing, and hallucinated blockquotes into new sections.”

crotchfire · Hacker News

From the Marker HN launch thread — generalizable warning for any layout-reasoning extractor.