Claude pdf-to-markdown skill: 10 PDF-to-MD pipelines that just work
Ten real PDF-to-Markdown pipelines — single-PDF batch convert, multi-column academic paper, embedded-image export, RAG ingestion, annotated-PDF blockquotes, scanned-OCR fallback, bulk directory walk, bibliography slicing, table-to-CSV pivot, and PDF-version diff — each as a single Claude prompt with the exact Python script it produces.
Already on the page because you searched pdf to markdown skill or claude pdf to markdown? You’re in the right place. The pdf-to-markdown skill wraps two open-source extractors (pymupdf4llm fast-mode, docling accurate-mode) with content-hash caching and image export. The cookbook below is what to do once it’s installed.
Already know what skills are? Skip to the cookbook. First time? Read the explainer then come back. Need the install? It’s on the /skills/pdf-to-markdown page.

On this page · 21 sections▾
- What this skill does
- The cookbook
- Install + README
- Watch it built
- 01 · Single-PDF batch convert with table preservation
- 02 · Multi-column academic paper (preserve reading order)
- 03 · PDF with embedded images (export to /assets, link from markdown)
- 04 · RAG pipeline: PDF -> chunks -> vector DB
- 05 · Annotated/highlighted PDF -> markdown blockquotes
- 06 · OCR-fallback for scanned PDFs (tesseract integration)
- 07 · Bulk-convert a directory of PDFs (glob pattern)
- 08 · Extract just the bibliography from a paper
- 09 · PDF tables -> markdown tables -> CSV pivot
- 10 · Diff two PDF versions (markdown diff for visibility)
- Community signal
- The contrarian take
- Real pipelines shipped
- Gotchas
- Pairs well with
- FAQ
- Sources
What this skill actually does
Sixty seconds of context before the cookbook — what the pdf-to-markdown skill is, what Claude returns when you invoke it, and the one thing it does NOT do for you.
What this skill actually does
“Convert entire PDF documents to clean, structured Markdown for full context loading.”
— aliceisjustplaying, the skill author · /skills/pdf-to-markdown
What Claude returns
When triggered, Claude calls a Python venv at `~/.claude/skills/pdf-to-markdown/.venv/` that runs `pymupdf4llm.to_markdown()` (fast mode) or `docling.DocumentConverter` (accurate mode) against your PDF. It returns a `.md` file with headers detected via font-size, real Markdown tables, ordered/unordered lists, multi-column reading order preserved, code blocks, and an `images/` sibling directory holding every extracted figure with relative paths inserted inline. Aggressive content-hash caching at `~/.cache/pdf-to-markdown/` skips re-processing identical PDFs across runs.
What it does NOT do
It does not install the venv or its dependencies for you — you run `uv venv .venv` and `uv pip install pymupdf pymupdf4llm` (or docling) once before triggering. It also does not OCR scanned PDFs by default; pair it with use case 6 for that path.
How you trigger it
Convert this PDF to Markdown so I can load the whole thing.Bring the entire paper into context — extract it as markdown.Read this scanned PDF and turn it into clean .md with the tables intact.Cost when idle
~110 tokens at idle (the skill name + description in the system prompt). The conversion script and venv invocation only run when you trigger them.
The cookbook
Each entry below is a pipeline you could ship today. They run roughly in order of complexity — the early ones convert one PDF, the middle ones lean on the skill’s structure-preservation features (tables, columns, images), and the later ones compose the skill with OCR, glob walks, and downstream tooling. Every entry pairs with one or two skills or MCP servers you already have on mcp.directory.
Install + README
If the skill isn’t on your machine yet, here’s the one-liner. The full install panel (Codex, Copilot, Antigravity variants) is on the skill page — the same UI is embedded below.
One-line install · by aliceisjustplaying
Open skill pageInstall
mkdir -p .claude/skills/pdf-to-markdown && curl -L -o skill.zip "https://mcp.directory/api/skills/download/319" && unzip -o skill.zip -d .claude/skills/pdf-to-markdown && rm skill.zipInstalls to .claude/skills/pdf-to-markdown
Watch it built
The official PyMuPDF tutorial walking through pymupdf4llm.to_markdown(), image extraction, and LlamaIndex hand-off. The skill’s fast-mode default is the same library this video covers — useful before the cookbook because it anchors the contract before you read the prompts.
Single-PDF batch convert with table preservation
Drop one PDF on the prompt, get a clean .md plus an /images directory next to it. Tables come out as real Markdown tables, not as paragraph mush.
ForSolo devs converting one report at a time before shipping it into a wiki or RAG store.
The prompt
Convert `papers/q1-report.pdf` to Markdown using the pdf-to-markdown skill. Use the default fast mode (pymupdf4llm). Write the output to `out/q1-report.md` and place any extracted images in `out/q1-report.images/` with relative paths in the markdown. If a table fails to render as a Markdown table, retry that page in accurate mode (docling) and merge it back in. Confirm tables render in Markdown preview before returning.What slides.md looks like
import pymupdf4llm
from pathlib import Path
src = Path('papers/q1-report.pdf')
out_md = Path('out/q1-report.md')
out_imgs = Path('out/q1-report.images')
out_imgs.mkdir(parents=True, exist_ok=True)
md = pymupdf4llm.to_markdown(
str(src),
write_images=True,
image_path=str(out_imgs),
image_format='png',
)
out_md.write_text(md, encoding='utf-8')One-line tweak
Pass `pages=[0,1,2]` to `to_markdown` to convert just the first three pages — useful when you only need the executive summary for a quick AI chat.
Multi-column academic paper (preserve reading order)
Two-column ICML or arXiv papers where naive extractors interleave columns. The skill detects the layout and walks each column top-to-bottom before moving across.
ForResearchers and grad students feeding papers into a notebook or NotebookLM.
The prompt
Convert `papers/attention-is-all-you-need.pdf` to Markdown. The PDF is a NeurIPS-style two-column layout with footnotes. Use pdf-to-markdown's accurate mode (docling) so the column reading order is preserved and footnotes land at the end of each section, not mid-paragraph. Strip the running header and page numbers. Save to `out/attention.md` and print the first H2 to stdout so I can sanity-check the section boundaries.What slides.md looks like
from docling.document_converter import DocumentConverter
from pathlib import Path
src = 'papers/attention-is-all-you-need.pdf'
conv = DocumentConverter()
result = conv.convert(src)
md = result.document.export_to_markdown()
# Strip running header and page numbers (heuristic).
lines = [l for l in md.splitlines()
if not l.strip().startswith('Page ') and l.strip() != 'Attention Is All You Need']
Path('out/attention.md').write_text('\n'.join(lines), encoding='utf-8')One-line tweak
Swap `export_to_markdown()` for `export_to_dict()` to get a structured DoclingDocument JSON when you want to reason over sections programmatically.
PDF with embedded images (export to /assets, link from markdown)
Reports that lean on figures and screenshots. The skill writes each image into a sibling directory and inserts the relative `` reference inline so the markdown renders end-to-end on GitHub or Notion.
ForTechnical writers porting investor decks or product PRDs into a wiki.
The prompt
Convert `decks/series-b-update.pdf` to Markdown. Extract every embedded image at full resolution (no downscaling), cache them under `out/series-b-update.images/`, and embed each one inline with a stable filename (`fig-{page}-{idx}.png`). Add a one-line caption above each image based on the surrounding text. Verify the markdown renders on GitHub by previewing the first three image references.What slides.md looks like
import pymupdf4llm
md_text = pymupdf4llm.to_markdown(
'decks/series-b-update.pdf',
write_images=True,
image_path='out/series-b-update.images',
image_format='png',
dpi=200,
embed_images=False, # write to disk, reference by relative path
)
# Stable filenames: pymupdf4llm writes 'page-N-image-K.png'.
open('out/series-b-update.md', 'w', encoding='utf-8').write(md_text)One-line tweak
Set `embed_images=True` to inline the images as base64 data URIs — useful when you need a single self-contained .md to email to someone.
RAG pipeline: PDF -> chunks -> vector DB
End-to-end: convert the PDF, split the markdown into semantic chunks at heading boundaries, embed, and write to a local vector store. The markdown intermediate is the RAG-friendly representation.
ForEngineers building retrieval over a corpus of PDFs.
The prompt
Build a RAG ingestion script. Walk `corpus/*.pdf`, convert each via pdf-to-markdown (fast mode), split the markdown at every H2 boundary into chunks (each chunk includes its parent H1 as context), and write to a local Chroma collection named `papers`. Use OpenAI text-embedding-3-small. Skip files where the cache key matches a previous run. Print one line per file: `<name> <chunks> <ms>`.What slides.md looks like
import pymupdf4llm, re, hashlib, time, chromadb
from pathlib import Path
client = chromadb.PersistentClient('./chroma')
col = client.get_or_create_collection('papers')
for pdf in Path('corpus').glob('*.pdf'):
t = time.time()
md = pymupdf4llm.to_markdown(str(pdf))
chunks = re.split(r'(?m)^## ', md)
ids = [hashlib.md5(f'{pdf.stem}-{i}'.encode()).hexdigest() for i in range(len(chunks))]
col.upsert(ids=ids, documents=chunks,
metadatas=[{'source': pdf.name, 'chunk': i} for i in range(len(chunks))])
print(f'{pdf.name} {len(chunks)} {int((time.time()-t)*1000)}ms')One-line tweak
Replace the `re.split` with a sliding-window splitter (700-token windows, 80-token overlap) when papers have flat structure and few H2s.
Annotated/highlighted PDF -> markdown blockquotes
Take a PDF marked up with yellow highlights and stickies. The skill extracts the highlighted spans as Markdown blockquotes with a `> [!NOTE]` callout, preserving the surrounding paragraph as context.
ForLawyers, analysts, and anyone who annotates PDFs and needs the highlights as a clean review doc.
The prompt
Open `contracts/msa-redlined.pdf`. Use the pdf-to-markdown skill (fast mode), then post-process: walk every highlight annotation in the PDF (PyMuPDF exposes `page.annots()` with `type[0] == 8` for highlight), capture the highlighted text plus the sentence it sits inside, and emit each as a Markdown blockquote with a `> [!NOTE]` callout block. Save to `out/msa-redlined-highlights.md`.What slides.md looks like
import pymupdf
doc = pymupdf.open('contracts/msa-redlined.pdf')
out = []
for page_num, page in enumerate(doc, 1):
for annot in page.annots() or []:
if annot.type[0] == 8: # highlight
quad = annot.vertices
text = page.get_textbox(annot.rect)
out.append(f'> [!NOTE] page {page_num}\n> {text.strip()}\n')
open('out/msa-redlined-highlights.md', 'w').write('\n'.join(out))One-line tweak
Filter annotations by author (`annot.info['title']`) so each reviewer's highlights end up in their own .md file — useful for diffing redlines across legal counsel.
OCR-fallback for scanned PDFs (tesseract integration)
PDFs that are images of pages, not text. The skill detects the missing text layer, runs Tesseract on each page, and produces markdown — slower than the native path but it actually works.
ForTeams ingesting old scanned reports, court filings, or 1990s-era manuals.
The prompt
Convert `archive/1998-financial-report.pdf` to Markdown. The file is scanned (no text layer). Detect that automatically by checking page 1 with `page.get_text()` — if it returns less than 50 chars, fall back to OCR via pytesseract. Use `lang='eng'`. Cache OCR results under `~/.cache/pdf-to-markdown/` so re-runs skip re-OCR. Save the markdown to `out/1998-financial-report.md` and warn me if OCR confidence dips below 60%.What slides.md looks like
import pymupdf, pytesseract
from PIL import Image
from io import BytesIO
from pathlib import Path
doc = pymupdf.open('archive/1998-financial-report.pdf')
md = []
for i, page in enumerate(doc):
text = page.get_text().strip()
if len(text) < 50:
pix = page.get_pixmap(dpi=300)
img = Image.open(BytesIO(pix.tobytes('png')))
text = pytesseract.image_to_string(img, lang='eng')
md.append(f'## Page {i+1}\n\n{text}\n')
Path('out/1998-financial-report.md').write_text('\n'.join(md), encoding='utf-8')One-line tweak
Swap `lang='eng'` for `lang='eng+jpn+chi_sim'` to OCR multilingual scans — install the language packs once with `brew install tesseract-lang`.
Bulk-convert a directory of PDFs (glob pattern)
Point the skill at a folder, get a parallel folder of .md files. Skips any PDF whose cache key matches the previous run so re-running on a 500-PDF directory only touches the new ones.
ForAnyone doing one-time corpus migration: archive cleanup, vendor doc dumps, regulator filings.
The prompt
Walk `vendor-docs/**/*.pdf` recursively. For each PDF, convert to Markdown using pdf-to-markdown (fast mode) and write to `markdown-out/<same relative path>.md`. Use a content hash as the cache key so re-runs only re-process files whose bytes changed. Run 4 conversions in parallel via `concurrent.futures.ProcessPoolExecutor`. Print a final summary: total PDFs, converted, skipped (cache hit), failed.What slides.md looks like
import pymupdf4llm, hashlib
from pathlib import Path
from concurrent.futures import ProcessPoolExecutor, as_completed
CACHE = Path.home() / '.cache/pdf-to-markdown'
CACHE.mkdir(parents=True, exist_ok=True)
def convert(pdf):
h = hashlib.sha1(pdf.read_bytes()).hexdigest()[:16]
cache_hit = (CACHE / f'{h}.md')
if cache_hit.exists():
return (pdf, 'skip')
md = pymupdf4llm.to_markdown(str(pdf))
cache_hit.write_text(md, encoding='utf-8')
out = Path('markdown-out') / pdf.relative_to('vendor-docs').with_suffix('.md')
out.parent.mkdir(parents=True, exist_ok=True)
out.write_text(md, encoding='utf-8')
return (pdf, 'ok')
pdfs = list(Path('vendor-docs').rglob('*.pdf'))
with ProcessPoolExecutor(max_workers=4) as ex:
for f in as_completed([ex.submit(convert, p) for p in pdfs]):
print(*f.result())One-line tweak
Lower `max_workers=4` to `2` if you're on an M-series Mac — pymupdf4llm is already multi-threaded internally and over-spawning competes with itself.
Extract just the bibliography from a paper
Skip the body, capture only the references. The skill finds the 'References' or 'Bibliography' section and emits a clean numbered list ready to import into Zotero or BibTeX.
ForResearchers building reading lists from a single paper's references.
The prompt
Open `papers/transformer-survey.pdf`. Convert it to Markdown with pdf-to-markdown, then slice out only the section starting from the heading 'References' (case-insensitive) through end-of-document. Normalize each entry to a single line, strip line-wrap hyphens, and write to `out/transformer-survey.refs.md` as a numbered list.What slides.md looks like
import pymupdf4llm, re
from pathlib import Path
md = pymupdf4llm.to_markdown('papers/transformer-survey.pdf')
m = re.search(r'(?im)^#+\s*references\s*$', md)
if not m:
raise SystemExit('No References section found')
refs = md[m.end():]
# Collapse line wraps and strip soft hyphens.
refs = re.sub(r'-\n', '', refs)
refs = re.sub(r'\n(?!\[)', ' ', refs)
entries = re.findall(r'\[\d+\][^\[]+', refs)
out = '\n'.join(f'{i+1}. {e.strip()}' for i, e in enumerate(entries))
Path('out/transformer-survey.refs.md').write_text(out, encoding='utf-8')One-line tweak
Pipe the cleaned refs into `anystyle` (Ruby) to get proper BibTeX out: `cat refs.md | anystyle parse --format bib > refs.bib`.
PDF tables -> markdown tables -> CSV pivot
Earnings reports and regulator filings hide all the value in tables. The skill extracts each table as Markdown and lets you pivot the same data into a CSV in one follow-up step.
ForAnalysts who want the numbers, not the prose.
The prompt
Open `filings/10K-2025.pdf`. Use pdf-to-markdown's accurate mode to extract every table on pages 30-60. For each table, write a Markdown table to `out/10K-tables.md` AND a parallel CSV under `out/csv/table-{page}-{idx}.csv`. Add a one-line index at the top of `10K-tables.md` linking each table to its CSV. Skip any table with fewer than 2 columns or 2 rows.What slides.md looks like
import pandas as pd
from docling.document_converter import DocumentConverter
from pathlib import Path
Path('out/csv').mkdir(parents=True, exist_ok=True)
result = DocumentConverter().convert('filings/10K-2025.pdf', page_range=(30, 60))
md_lines = ['# 10-K tables\n']
for i, table in enumerate(result.document.tables):
df = table.export_to_dataframe()
if df.shape[0] < 2 or df.shape[1] < 2:
continue
csv_name = f'table-{table.prov[0].page_no}-{i}.csv'
df.to_csv(f'out/csv/{csv_name}', index=False)
md_lines.append(f'## Table {i} (p.{table.prov[0].page_no}) [{csv_name}](csv/{csv_name})\n')
md_lines.append(df.to_markdown(index=False))
Path('out/10K-tables.md').write_text('\n\n'.join(md_lines), encoding='utf-8')One-line tweak
Add `df.transpose()` before `to_csv` for tables where the first column is the time series and rows are metrics — flips it into a tidy long-format frame.
Diff two PDF versions (markdown diff for visibility)
Two PDFs of the same contract, six months apart. Convert both to markdown, diff with `git diff --no-index`, and you finally see what actually changed without scrolling page-by-page.
ForLawyers reviewing redlines, ops teams diffing vendor MSAs, anyone tracking a living spec.
The prompt
Convert `contracts/msa-2024.pdf` and `contracts/msa-2025.pdf` to Markdown using pdf-to-markdown (fast mode). Strip page numbers and running headers from both before diffing — they create false positives. Run `git diff --no-index --word-diff=color msa-2024.md msa-2025.md` and write the diff to `out/msa-diff.md` with a header summarizing how many lines added/removed.What slides.md looks like
import pymupdf4llm, re, subprocess
from pathlib import Path
def clean(p):
md = pymupdf4llm.to_markdown(p)
md = re.sub(r'(?m)^\s*Page \d+ of \d+\s*$', '', md)
return md
a = Path('out/msa-2024.md'); a.write_text(clean('contracts/msa-2024.pdf'))
b = Path('out/msa-2025.md'); b.write_text(clean('contracts/msa-2025.pdf'))
r = subprocess.run(['git', 'diff', '--no-index', '--word-diff=plain', str(a), str(b)],
capture_output=True, text=True)
Path('out/msa-diff.md').write_text(f'# MSA diff 2024 -> 2025\n\n\u0060\u0060\u0060diff\n{r.stdout}\n\u0060\u0060\u0060')One-line tweak
Replace `git diff` with `python -m difflib` when you don't have git on the path — same output shape, no external dependency.
Community signal
Three voices from people working on PDF-to-markdown for real: the maintainer of the underlying library, a long-time HN commenter who benchmarked Marker against Nougat, and the Marker maintainer himself on when to swap engines.
“Turn PDF and other documents into clean, LLM-ready data — in one line of code.”
Artifex / PyMuPDF team · Blog
The PyMuPDF4LLM project tagline. Sets the bar for what 'pdf-to-markdown' should mean in 2026: not page-by-page scraping, but a one-call markdown export.
“It handles multi-column layouts, tables, images, headers, and scanned pages with automatic OCR.”
Artifex / PyMuPDF team · Blog
From the PyMuPDF4LLM README. The skill's fast-mode default delivers exactly this list — use case 2 (multi-column) and use case 6 (scanned OCR) lean on it.
“Marker extracted considerably more text, finished faster, and did not crash on any pdf, while Nougat took much longer and sometimes crashed.”
hashemian · Hacker News
From the original Marker HN launch thread. Useful framing for when to graduate from pymupdf4llm to a learned-layout extractor on adversarial PDFs.
The contrarian take
Not every extractor handles every PDF. The honest critique on the original Marker HN thread came from crotchfire:
“Within the first three paragraphs it hallucinated spurious paragraph breaks, ignored boldfacing, and hallucinated blockquotes into new sections.”
crotchfire · Hacker News
From the Marker HN launch thread — generalizable warning for any layout-reasoning extractor.
Real failure mode for any extractor that reasons about layout. The skill's fast mode (pymupdf4llm) is a deterministic XML walk — it can't hallucinate, only mis-segment. If you need genuine layout reasoning, the accurate mode (docling) and the marker fallback in use case 7 are the upgrade paths. For high-stakes PDFs, paid services like Mathpix or Adobe Extract still win on accuracy — the trade-off is cost per page versus open-source flexibility.
One more comparison worth naming: there are several PDF-related MCP servers — markitdown, pdf-reader, and markdown-to-pdf (the inverse direction) — that wrap conversion as MCP tools. The trade-off is the usual skill-vs-MCP one: the skill is ~110 idle tokens, the MCP’s tool schemas load every turn. Pick the MCP only when multiple AI clients need to share a converter; otherwise the skill in this cookbook is lighter.
Real pipelines shipped with PDF-to-markdown
Concrete examples from public projects. Most of these don’t use the Claude skill specifically — they’re here to show what production-grade PDF-to-markdown pipelines look like, so you have a target shape in mind when you write the prompt.
- PyMuPDF / Artifex — official RAG/LLM walkthrough using PyMuPDF4LLM as the markdown extractor
- Shravan Kumar (Medium) — 'PyMuPDF4LLM is all You Need for Extracting Data from PDFs' production walkthrough
- S. Anand — tools-in-data-science course module 'Convert PDFs to Markdown' (used by 1000+ students)
- DEV Community / m_sea_bass — 'How to Convert PDFs to Markdown Using PyMuPDF4LLM and Its Evaluation'
- Towards Data Science — 'Preparing PDFs for RAGs' end-to-end pipeline
- themenonlab — 'Best Open-Source PDF-to-Markdown Tools in 2026: Marker vs Docling vs MinerU vs pdf-craft vs PyMuPDF4LLM' comparison
Gotchas (the four that bite)
Sourced from the pymupdf4llm issue tracker and the pdf-to-markdown skill repo.
Scanned PDFs need OCR — fast mode silently returns empty
If a page has no text layer (scanned image), pymupdf4llm returns an empty string for that page without warning. Always check `len(md.strip()) > 0` before assuming success — and if not, fall back to use case 6's pytesseract path.
Merged-cell tables flatten in fast mode
pymupdf4llm renders tables as standard pipe-delimited Markdown, which has no merged-cell syntax. Merged cells get duplicated across the merged span. For accurate-mode tables, use docling (use case 9) or accept the duplication and clean up downstream.
The venv must be created before first run
The skill assumes `~/.claude/skills/pdf-to-markdown/.venv/` exists. If you skipped the `uv venv .venv && uv pip install pymupdf pymupdf4llm` step, the skill errors with 'venv not found'. Run the install once and it caches forever.
Cache invalidation is content-hash, not path
If you edit a PDF in place (annotating, signing), the SHA-1 changes and the cache misses correctly. But if you rename a PDF without changing bytes, you get the cached output from the old name — fine for content but confusing if you're tracking by filename.
Pairs well with
Curated to match the cookbook’s actual integrations: the document-conversion adjacent skills (markitdown, markdown-converter, pdf-extraction, pdf-ocr-extraction, literature-review, rag-implementation, markdown-tools) plus the MCP servers the longer use cases (3, 4, 5, 7, 9, 10) lean on.
Related skills
Related MCP servers
Two posts that compose well with this cookbook: What are Claude Code skills? covers the underlying mechanism, and 10 PowerPoint decks with the Claude pptx skill covers a sibling Python-tooling skill — the same cookbook shape, a different output format.
Frequently asked questions
What does the pdf-to-markdown skill actually run under the hood?
Two modes. Fast mode (the default) runs PyMuPDF4LLM, a thin wrapper over PyMuPDF's text extraction with reading-order, multi-column, and image-export bolted on. Accurate mode runs IBM's Docling, which uses a learned layout model to recover semantic structure on hard PDFs. The skill picks fast by default and falls back to accurate when you say so or when a table fails to render.
Is this a Claude pdf to markdown skill or a pdf to markdown MCP server?
Both shapes exist on mcp.directory. This page covers the Claude Code skill (loads ~110 tokens at idle, runs as a venv inside `~/.claude/skills/pdf-to-markdown/`). The MCP-server path uses servers like markitdown and pdf-reader, which load tool schemas every turn but expose the same conversion to multiple AI clients on a shared host. Pick the skill for one developer; pick the MCP server when several agents share one machine.
How accurate is pdf to markdown extraction on academic papers?
On native (text-layer) PDFs in two-column layouts, accurate mode (Docling) preserves reading order and reconstructs section hierarchy with very few errors. On scanned papers, you need OCR (use case 6 above). On heavily formatted PDFs with sidebars, watermarks, or rotated text, expect to hand-fix a few headings — that's where the marker comparison in the contrarian section earns its keep.
Can I use the converted markdown directly in a RAG pipeline?
Yes — that's the whole point. Use case 4 walks the corpus, splits the markdown at heading boundaries, embeds, and writes to Chroma. The argument for going PDF -> markdown -> chunks (versus PDF -> chunks directly) is that the markdown intermediate is structured and human-reviewable: you can spot extraction errors before they pollute your vector store.
Does the skill handle scanned PDFs without a text layer?
Not natively. The fast and accurate modes both rely on a text layer. For scanned PDFs, use case 6 above shows the OCR-fallback pattern: detect missing text via `page.get_text()`, render each page to PNG at 300 dpi, run pytesseract, and merge back. For high-volume OCR work, the dedicated `pdf-ocr-extraction` skill on mcp.directory is faster.
Why is 'nano-pdf' getting impressions on Google but no clicks here?
nano-pdf is a separate skill on mcp.directory aimed at minimal page extraction; it ranks for short-form variants like `nano pdf skill`. This cookbook targets the longer-tail PDF-to-markdown queries: `pdf to markdown skill`, `claude pdf to markdown`, `pdf to markdown mcp`, `pdf to md skill`. If you specifically wanted nano-pdf, the right page is `/skills/nano-pdf` — open that one instead.
How does pdf-to-markdown compare to Marker, Mathpix, or Adobe Extract?
Marker is a learned-layout open-source extractor — better than pymupdf4llm on adversarial PDFs, slower per page, requires a GPU for best results. Mathpix and Adobe Extract are paid services with the highest accuracy on math and scanned content but cost per page. The skill's two modes (pymupdf4llm + docling) are the right open-source default; reach for marker or a paid service when use case 6 (scanned PDFs) and use case 9 (table extraction at scale) start to fail.
Will the markdown output render images and tables correctly on GitHub?
Yes for both, with one caveat. Images are written as separate PNGs to a sibling `images/` directory and embedded with relative paths — that renders on GitHub as long as you commit the `images/` directory alongside the .md. Tables come out as standard pipe-delimited Markdown, which GitHub renders natively. If your PDF has merged-cell tables, swap to accurate mode (Docling) — pymupdf4llm flattens merged cells.
Sources
Primary
- aliceisjustplaying/claude-skill-pdf-to-markdown — SKILL.md, scripts/, README
- pymupdf/pymupdf4llm — fast-mode extractor source
- PyMuPDF4LLM official documentation
- docling-project/docling — accurate-mode extractor
- datalab-to/marker — learned-layout fallback (referenced in contrarian section)
Community
- Artifex / PyMuPDF team — Blog
- Artifex / PyMuPDF team — Blog
- hashemian — Hacker News
- vikp (Marker maintainer) — Hacker News
- Artifex / PyMuPDF team — Blog
- mannycalavera42 — Hacker News
Critical and contrarian
Internal