kreuzberg
Extract text, tables, metadata, and images from 75+ document formats (PDF, Office, images, HTML, email, archives, academic) using Kreuzberg. Use when writing code that calls Kreuzberg APIs in Python, Node.js/TypeScript, Rust, or CLI. Covers installation, extraction (sync/async), configuration (OCR, chunking, output format), batch processing, error handling, and plugins.
Install
mkdir -p .claude/skills/kreuzberg && curl -L -o skill.zip "https://mcp.directory/api/skills/download/5869" && unzip -o skill.zip -d .claude/skills/kreuzberg && rm skill.zipInstalls to .claude/skills/kreuzberg
About this skill
Kreuzberg Document Extraction
Kreuzberg is a high-performance document intelligence library with a Rust core and native bindings for Python, Node.js/TypeScript, Ruby, Go, Java, C#, PHP, and Elixir. It extracts text, tables, metadata, and images from 91+ file formats including PDF, Office documents, images (with OCR), HTML, email, archives, and academic formats.
Use this skill when writing code that:
- Extracts text or metadata from documents
- Performs OCR on scanned documents or images
- Batch-processes multiple files
- Configures extraction options (output format, chunking, OCR, language detection)
- Implements custom plugins (post-processors, validators, OCR backends)
Installation
Python
pip install kreuzberg
# Optional OCR backends:
pip install kreuzberg[easyocr] # EasyOCR
pip install kreuzberg[paddleocr] # PaddleOCR
Node.js
npm install @kreuzberg/node
Rust
# Cargo.toml
[dependencies]
kreuzberg = { version = "4", features = ["tokio-runtime"] }
# features: tokio-runtime (required for sync + batch), pdf, ocr, chunking,
# embeddings, language-detection, keywords-yake, keywords-rake
CLI
# Download from GitHub releases, or:
cargo install kreuzberg-cli
Quick Start
Python (Async)
from kreuzberg import extract_file
result = await extract_file("document.pdf")
print(result.content) # extracted text
print(result.metadata) # document metadata
print(result.tables) # extracted tables
Python (Sync)
from kreuzberg import extract_file_sync
result = extract_file_sync("document.pdf")
print(result.content)
Node.js
import { extractFile } from '@kreuzberg/node';
const result = await extractFile('document.pdf');
console.log(result.content);
console.log(result.metadata);
console.log(result.tables);
Node.js (Sync)
import { extractFileSync } from '@kreuzberg/node';
const result = extractFileSync('document.pdf');
Rust (Async)
use kreuzberg::{extract_file, ExtractionConfig};
#[tokio::main]
async fn main() -> kreuzberg::Result<()> {
let config = ExtractionConfig::default();
let result = extract_file("document.pdf", None, &config).await?;
println!("{}", result.content);
Ok(())
}
Rust (Sync) — requires tokio-runtime feature
use kreuzberg::{extract_file_sync, ExtractionConfig};
fn main() -> kreuzberg::Result<()> {
let config = ExtractionConfig::default();
let result = extract_file_sync("document.pdf", None, &config)?;
println!("{}", result.content);
Ok(())
}
CLI
kreuzberg extract document.pdf
kreuzberg extract document.pdf --format json
kreuzberg extract document.pdf --output-format markdown
Configuration
All languages use the same configuration structure with language-appropriate naming conventions.
Python (snake_case)
from kreuzberg import (
ExtractionConfig, OcrConfig, TesseractConfig,
PdfConfig, ChunkingConfig,
)
config = ExtractionConfig(
ocr=OcrConfig(
backend="tesseract",
language="eng",
tesseract_config=TesseractConfig(psm=6, enable_table_detection=True),
),
pdf_options=PdfConfig(passwords=["secret123"]),
chunking=ChunkingConfig(max_chars=1000, max_overlap=200),
output_format="markdown",
)
result = await extract_file("document.pdf", config=config)
Node.js (camelCase)
import { extractFile, type ExtractionConfig } from '@kreuzberg/node';
const config: ExtractionConfig = {
ocr: { backend: 'tesseract', language: 'eng' },
pdfOptions: { passwords: ['secret123'] },
chunking: { maxChars: 1000, maxOverlap: 200 },
outputFormat: 'markdown',
};
const result = await extractFile('document.pdf', null, config);
Rust (snake_case)
use kreuzberg::{ExtractionConfig, OcrConfig, ChunkingConfig, OutputFormat};
let config = ExtractionConfig {
ocr: Some(OcrConfig {
backend: "tesseract".into(),
language: "eng".into(),
..Default::default()
}),
chunking: Some(ChunkingConfig {
max_characters: 1000,
overlap: 200,
..Default::default()
}),
output_format: OutputFormat::Markdown,
..Default::default()
};
let result = extract_file("document.pdf", None, &config).await?;
Config File (TOML)
output_format = "markdown"
[ocr]
backend = "tesseract"
language = "eng"
[chunking]
max_chars = 1000
max_overlap = 200
[pdf_options]
passwords = ["secret123"]
# CLI: auto-discovers kreuzberg.toml in current/parent directories
kreuzberg extract doc.pdf
# or explicit:
kreuzberg extract doc.pdf --config kreuzberg.toml
kreuzberg extract doc.pdf --config-json '{"ocr":{"backend":"tesseract","language":"deu"}}'
Batch Processing
Python
from kreuzberg import batch_extract_files, batch_extract_files_sync
# Async
results = await batch_extract_files(["doc1.pdf", "doc2.docx", "doc3.xlsx"])
# Sync
results = batch_extract_files_sync(["doc1.pdf", "doc2.docx"])
for result in results:
print(f"{len(result.content)} chars extracted")
Node.js
import { batchExtractFiles } from '@kreuzberg/node';
const results = await batchExtractFiles(['doc1.pdf', 'doc2.docx']);
Rust — requires tokio-runtime feature
use kreuzberg::{batch_extract_file, ExtractionConfig};
let config = ExtractionConfig::default();
let paths = vec!["doc1.pdf", "doc2.docx"];
let results = batch_extract_file(paths, &config).await?;
CLI
kreuzberg batch *.pdf --format json
kreuzberg batch docs/*.docx --output-format markdown
OCR
OCR runs automatically for images and scanned PDFs. Tesseract is the default backend (native binding, no external install required).
Backends
- Tesseract (default): Built-in native binding. All Tesseract languages supported.
- EasyOCR (Python only):
pip install kreuzberg[easyocr]. Passeasyocr_kwargs={"gpu": True}. - PaddleOCR (Python only):
pip install kreuzberg[paddleocr]. Passpaddleocr_kwargs={"use_angle_cls": True}. - Guten (Node.js only): Built-in OCR backend via
GutenOcrBackend.
Language Codes
config = ExtractionConfig(ocr=OcrConfig(language="eng")) # English
config = ExtractionConfig(ocr=OcrConfig(language="eng+deu")) # Multiple
config = ExtractionConfig(ocr=OcrConfig(language="all")) # All installed
Force OCR
config = ExtractionConfig(force_ocr=True) # OCR even if text is extractable
ExtractionResult Fields
| Field | Python | Node.js | Rust | Description |
|---|---|---|---|---|
| Text content | result.content | result.content | result.content | Extracted text (str/String) |
| MIME type | result.mime_type | result.mimeType | result.mime_type | Input document MIME type |
| Metadata | result.metadata | result.metadata | result.metadata | Document metadata (dict/object/HashMap) |
| Tables | result.tables | result.tables | result.tables | Extracted tables with cells + markdown |
| Languages | result.detected_languages | result.detectedLanguages | result.detected_languages | Detected languages (if enabled) |
| Chunks | result.chunks | result.chunks | result.chunks | Text chunks (if chunking enabled) |
| Images | result.images | result.images | result.images | Extracted images (if enabled) |
| Elements | result.elements | result.elements | result.elements | Semantic elements (if element_based format) |
| Pages | result.pages | result.pages | result.pages | Per-page content (if page extraction enabled) |
| Keywords | result.keywords | result.keywords | result.keywords | Extracted keywords (if enabled) |
Error Handling
Python
from kreuzberg import (
extract_file_sync, KreuzbergError, ParsingError,
OCRError, ValidationError, MissingDependencyError,
)
try:
result = extract_file_sync("file.pdf")
except ParsingError as e:
print(f"Failed to parse: {e}")
except OCRError as e:
print(f"OCR failed: {e}")
except ValidationError as e:
print(f"Invalid input: {e}")
except MissingDependencyError as e:
print(f"Missing dependency: {e}")
except KreuzbergError as e:
print(f"Extraction failed: {e}")
Node.js
import {
extractFile, KreuzbergError, ParsingError,
OcrError, ValidationError, MissingDependencyError,
} from '@kreuzberg/node';
try {
const result = await extractFile('file.pdf');
} catch (e) {
if (e instanceof ParsingError) { /* ... */ }
else if (e instanceof OcrError) { /* ... */ }
else if (e instanceof ValidationError) { /* ... */ }
else if (e instanceof KreuzbergError) { /* ... */ }
}
Rust
use kreuzberg::{extract_file, ExtractionConfig, KreuzbergError};
let config = ExtractionConfig::default();
match extract_file("file.pdf", None, &config).await {
Ok(result) => println!("{}", result.content),
Err(KreuzbergError::Parsing(msg)) => eprintln!("Parse error: {msg}"),
Err(KreuzbergError::Ocr(msg)) => eprintln!("OCR error: {msg}"),
Err(e) => eprintln!("Error: {e}"),
}
Common Pitfalls
- Python ChunkingConfig fields: Use
max_charsandmax_overlap, NOTmax_charactersoroverlap. - Rust extract_file signature: Third argument is
&ExtractionConfig(a reference), notOption. Use&ExtractionConfig::default()for defaults. - Rust feature gates:
extract_file_sync,batch_extract_file, andbatch_extract_file_syncall requirefeatures = ["tokio-runtime"]in Cargo.toml. - Rust async context:
extract_fileis async. Use#[tokio::main]or call from an async context. - CLI --format vs --output-format:
--formatcontrols CLI output (text/json).--output-formatcontrols content format (plain/markdown/djot/html). - Node.js extractFile signature
Content truncated.
You might also like
flutter-development
aj-geddes
Build beautiful cross-platform mobile apps with Flutter and Dart. Covers widgets, state management with Provider/BLoC, navigation, API integration, and material design.
drawio-diagrams-enhanced
jgtolentino
Create professional draw.io (diagrams.net) diagrams in XML format (.drawio files) with integrated PMP/PMBOK methodologies, extensive visual asset libraries, and industry-standard professional templates. Use this skill when users ask to create flowcharts, swimlane diagrams, cross-functional flowcharts, org charts, network diagrams, UML diagrams, BPMN, project management diagrams (WBS, Gantt, PERT, RACI), risk matrices, stakeholder maps, or any other visual diagram in draw.io format. This skill includes access to custom shape libraries for icons, clipart, and professional symbols.
ui-ux-pro-max
nextlevelbuilder
"UI/UX design intelligence. 50 styles, 21 palettes, 50 font pairings, 20 charts, 8 stacks (React, Next.js, Vue, Svelte, SwiftUI, React Native, Flutter, Tailwind). Actions: plan, build, create, design, implement, review, fix, improve, optimize, enhance, refactor, check UI/UX code. Projects: website, landing page, dashboard, admin panel, e-commerce, SaaS, portfolio, blog, mobile app, .html, .tsx, .vue, .svelte. Elements: button, modal, navbar, sidebar, card, table, form, chart. Styles: glassmorphism, claymorphism, minimalism, brutalism, neumorphism, bento grid, dark mode, responsive, skeuomorphism, flat design. Topics: color palette, accessibility, animation, layout, typography, font pairing, spacing, hover, shadow, gradient."
godot
bfollington
This skill should be used when working on Godot Engine projects. It provides specialized knowledge of Godot's file formats (.gd, .tscn, .tres), architecture patterns (component-based, signal-driven, resource-based), common pitfalls, validation tools, code templates, and CLI workflows. The `godot` command is available for running the game, validating scripts, importing resources, and exporting builds. Use this skill for tasks involving Godot game development, debugging scene/resource files, implementing game systems, or creating new Godot components.
nano-banana-pro
garg-aayush
Generate and edit images using Google's Nano Banana Pro (Gemini 3 Pro Image) API. Use when the user asks to generate, create, edit, modify, change, alter, or update images. Also use when user references an existing image file and asks to modify it in any way (e.g., "modify this image", "change the background", "replace X with Y"). Supports both text-to-image generation and image-to-image editing with configurable resolution (1K default, 2K, or 4K for high resolution). DO NOT read the image file first - use this skill directly with the --input-image parameter.
fastapi-templates
wshobson
Create production-ready FastAPI projects with async patterns, dependency injection, and comprehensive error handling. Use when building new FastAPI applications or setting up backend API projects.
Related MCP Servers
Browse all serversCreate and edit PowerPoint presentations in Python with Office PowerPoint. Use python pptx or pptx python tools to add s
Securely extract text, metadata, & pages from PDFs using Adobe Acrobat PDF editor software for local & remote files.
Securely extract text and page info from PDFs using pdfjs-dist. Works with local files or remote URLs, like Adobe Acroba
Unlock powerful image manipulation with ImageSorcery: resize, crop, detect objects, and perform optical character recogn
Extract text and audio from URLs, docs, videos, and images with AI voice generator and text to speech for unified conten
Connect Claude with Vectorize.io's vector database to extract text from images and enable advanced retrieval for researc
Stay ahead of the MCP ecosystem
Get weekly updates on new skills and servers.