data-extractor
Extract structured data from any document format using unstructured - unified document processing
Install
mkdir -p .claude/skills/data-extractor && curl -L -o skill.zip "https://mcp.directory/api/skills/download/6476" && unzip -o skill.zip -d .claude/skills/data-extractor && rm skill.zipInstalls to .claude/skills/data-extractor
About this skill
Data Extractor Skill
Overview
This skill enables extraction of structured data from any document format using unstructured - a unified library for processing PDFs, Word docs, emails, HTML, and more. Get consistent, structured output regardless of input format.
How to Use
- Provide the document to process
- Optionally specify extraction options
- I'll extract structured elements with metadata
Example prompts:
- "Extract all text and tables from this PDF"
- "Parse this email and get the body, attachments, and metadata"
- "Convert this HTML page to structured elements"
- "Extract data from these mixed-format documents"
Domain Knowledge
unstructured Fundamentals
from unstructured.partition.auto import partition
# Automatically detect and process any document
elements = partition("document.pdf")
# Access extracted elements
for element in elements:
print(f"Type: {type(element).__name__}")
print(f"Text: {element.text}")
print(f"Metadata: {element.metadata}")
Supported Formats
| Format | Function | Notes |
|---|---|---|
partition_pdf | Native + scanned | |
| Word | partition_docx | Full structure |
| PowerPoint | partition_pptx | Slides & notes |
| Excel | partition_xlsx | Sheets & tables |
partition_email | Body & attachments | |
| HTML | partition_html | Tags preserved |
| Markdown | partition_md | Structure preserved |
| Plain Text | partition_text | Basic parsing |
| Images | partition_image | OCR extraction |
Element Types
from unstructured.documents.elements import (
Title,
NarrativeText,
Text,
ListItem,
Table,
Image,
Header,
Footer,
PageBreak,
Address,
EmailAddress,
)
# Elements have consistent structure
element.text # Raw text content
element.metadata # Rich metadata
element.category # Element type
element.id # Unique identifier
Auto Partition
from unstructured.partition.auto import partition
# Process any file type
elements = partition(
filename="document.pdf",
strategy="auto", # or "fast", "hi_res", "ocr_only"
include_metadata=True,
include_page_breaks=True,
)
# Filter by type
titles = [e for e in elements if isinstance(e, Title)]
tables = [e for e in elements if isinstance(e, Table)]
Format-Specific Partitioning
# PDF with options
from unstructured.partition.pdf import partition_pdf
elements = partition_pdf(
filename="document.pdf",
strategy="hi_res", # High quality extraction
infer_table_structure=True, # Detect tables
include_page_breaks=True,
languages=["en"], # OCR language
)
# Word documents
from unstructured.partition.docx import partition_docx
elements = partition_docx(
filename="document.docx",
include_metadata=True,
)
# HTML
from unstructured.partition.html import partition_html
elements = partition_html(
filename="page.html",
include_metadata=True,
)
Working with Tables
from unstructured.partition.auto import partition
elements = partition("report.pdf", infer_table_structure=True)
# Extract tables
for element in elements:
if element.category == "Table":
print("Table found:")
print(element.text)
# Access structured table data
if hasattr(element, 'metadata') and element.metadata.text_as_html:
print("HTML:", element.metadata.text_as_html)
Metadata Access
from unstructured.partition.auto import partition
elements = partition("document.pdf")
for element in elements:
meta = element.metadata
# Common metadata fields
print(f"Page: {meta.page_number}")
print(f"Filename: {meta.filename}")
print(f"Filetype: {meta.filetype}")
print(f"Coordinates: {meta.coordinates}")
print(f"Languages: {meta.languages}")
Chunking for AI/RAG
from unstructured.partition.auto import partition
from unstructured.chunking.title import chunk_by_title
from unstructured.chunking.basic import chunk_elements
# Partition document
elements = partition("document.pdf")
# Chunk by title (semantic chunks)
chunks = chunk_by_title(
elements,
max_characters=1000,
combine_text_under_n_chars=200,
)
# Or basic chunking
chunks = chunk_elements(
elements,
max_characters=500,
overlap=50,
)
for chunk in chunks:
print(f"Chunk ({len(chunk.text)} chars):")
print(chunk.text[:100] + "...")
Batch Processing
from unstructured.partition.auto import partition
from pathlib import Path
from concurrent.futures import ThreadPoolExecutor
def process_document(file_path):
"""Process single document."""
try:
elements = partition(str(file_path))
return {
'file': str(file_path),
'status': 'success',
'elements': len(elements),
'text': '\n\n'.join([e.text for e in elements])
}
except Exception as e:
return {
'file': str(file_path),
'status': 'error',
'error': str(e)
}
def batch_process(input_dir, max_workers=4):
"""Process all documents in directory."""
input_path = Path(input_dir)
files = list(input_path.glob('*'))
with ThreadPoolExecutor(max_workers=max_workers) as executor:
results = list(executor.map(process_document, files))
return results
Export Formats
from unstructured.partition.auto import partition
from unstructured.staging.base import elements_to_json, elements_to_dicts
elements = partition("document.pdf")
# To JSON string
json_str = elements_to_json(elements)
# To list of dicts
dicts = elements_to_dicts(elements)
# To DataFrame
import pandas as pd
df = pd.DataFrame(dicts)
Best Practices
- Choose Strategy Wisely: "fast" for speed, "hi_res" for accuracy
- Enable Table Detection: For documents with tables
- Specify Language: For better OCR on non-English docs
- Chunk for RAG: Use semantic chunking for AI applications
- Handle Errors: Some formats may fail gracefully
Common Patterns
Document to JSON
def document_to_json(file_path, output_path=None):
"""Convert document to structured JSON."""
from unstructured.partition.auto import partition
from unstructured.staging.base import elements_to_json
import json
elements = partition(file_path)
# Create structured output
output = {
'source': file_path,
'elements': []
}
for element in elements:
output['elements'].append({
'type': type(element).__name__,
'text': element.text,
'metadata': {
'page': element.metadata.page_number,
'coordinates': element.metadata.coordinates.to_dict() if element.metadata.coordinates else None
}
})
if output_path:
with open(output_path, 'w') as f:
json.dump(output, f, indent=2)
return output
Email Parser
from unstructured.partition.email import partition_email
def parse_email(email_path):
"""Extract structured data from email."""
elements = partition_email(email_path)
email_data = {
'subject': None,
'from': None,
'to': [],
'date': None,
'body': [],
'attachments': []
}
for element in elements:
meta = element.metadata
# Extract headers from metadata
if meta.subject:
email_data['subject'] = meta.subject
if meta.sent_from:
email_data['from'] = meta.sent_from
if meta.sent_to:
email_data['to'] = meta.sent_to
# Body content
email_data['body'].append({
'type': type(element).__name__,
'text': element.text
})
return email_data
Examples
Example 1: Research Paper Extraction
from unstructured.partition.pdf import partition_pdf
from unstructured.chunking.title import chunk_by_title
def extract_paper(pdf_path):
"""Extract structured data from research paper."""
elements = partition_pdf(
filename=pdf_path,
strategy="hi_res",
infer_table_structure=True,
include_page_breaks=True
)
paper = {
'title': None,
'abstract': None,
'sections': [],
'tables': [],
'references': []
}
# Find title (usually first Title element)
for element in elements:
if element.category == "Title" and not paper['title']:
paper['title'] = element.text
break
# Extract tables
for element in elements:
if element.category == "Table":
paper['tables'].append({
'page': element.metadata.page_number,
'content': element.text,
'html': element.metadata.text_as_html if hasattr(element.metadata, 'text_as_html') else None
})
# Chunk into sections
chunks = chunk_by_title(elements, max_characters=2000)
current_section = None
for chunk in chunks:
if chunk.category == "Title":
paper['sections'].append({
'title': chunk.text,
'content': ''
})
elif paper['sections']:
paper['sections'][-1]['content'] += chunk.text + '\n'
return paper
paper = extract_paper('research_paper.pdf')
print(f"Title: {paper['title']}")
print(f"Tables: {len(paper['tables'])}")
print(f"Sections: {len(paper['sections'])}")
Example 2: Invoice Data Extraction
from unstructured.partition.auto import partition
import re
def extract_invoice_data(file_path):
"""Extract key data from invoice."""
elements = partition(file_path, strategy="hi_res")
---
*Content truncated.*
More by openclaw
View all skills by openclaw →You might also like
flutter-development
aj-geddes
Build beautiful cross-platform mobile apps with Flutter and Dart. Covers widgets, state management with Provider/BLoC, navigation, API integration, and material design.
drawio-diagrams-enhanced
jgtolentino
Create professional draw.io (diagrams.net) diagrams in XML format (.drawio files) with integrated PMP/PMBOK methodologies, extensive visual asset libraries, and industry-standard professional templates. Use this skill when users ask to create flowcharts, swimlane diagrams, cross-functional flowcharts, org charts, network diagrams, UML diagrams, BPMN, project management diagrams (WBS, Gantt, PERT, RACI), risk matrices, stakeholder maps, or any other visual diagram in draw.io format. This skill includes access to custom shape libraries for icons, clipart, and professional symbols.
ui-ux-pro-max
nextlevelbuilder
"UI/UX design intelligence. 50 styles, 21 palettes, 50 font pairings, 20 charts, 8 stacks (React, Next.js, Vue, Svelte, SwiftUI, React Native, Flutter, Tailwind). Actions: plan, build, create, design, implement, review, fix, improve, optimize, enhance, refactor, check UI/UX code. Projects: website, landing page, dashboard, admin panel, e-commerce, SaaS, portfolio, blog, mobile app, .html, .tsx, .vue, .svelte. Elements: button, modal, navbar, sidebar, card, table, form, chart. Styles: glassmorphism, claymorphism, minimalism, brutalism, neumorphism, bento grid, dark mode, responsive, skeuomorphism, flat design. Topics: color palette, accessibility, animation, layout, typography, font pairing, spacing, hover, shadow, gradient."
godot
bfollington
This skill should be used when working on Godot Engine projects. It provides specialized knowledge of Godot's file formats (.gd, .tscn, .tres), architecture patterns (component-based, signal-driven, resource-based), common pitfalls, validation tools, code templates, and CLI workflows. The `godot` command is available for running the game, validating scripts, importing resources, and exporting builds. Use this skill for tasks involving Godot game development, debugging scene/resource files, implementing game systems, or creating new Godot components.
nano-banana-pro
garg-aayush
Generate and edit images using Google's Nano Banana Pro (Gemini 3 Pro Image) API. Use when the user asks to generate, create, edit, modify, change, alter, or update images. Also use when user references an existing image file and asks to modify it in any way (e.g., "modify this image", "change the background", "replace X with Y"). Supports both text-to-image generation and image-to-image editing with configurable resolution (1K default, 2K, or 4K for high resolution). DO NOT read the image file first - use this skill directly with the --input-image parameter.
fastapi-templates
wshobson
Create production-ready FastAPI projects with async patterns, dependency injection, and comprehensive error handling. Use when building new FastAPI applications or setting up backend API projects.
Related MCP Servers
Browse all serversDumpling AI offers advanced web scraping tools, acting as a web scraper to extract structured data from websites and doc
Dumpling AI is a powerful web scraper offering advanced web scraping tools to extract, process, and analyze data from di
Automate document workflows with Orion Vision and Azure Form Recognizer for intelligent document processing and assembly
MCP-Upstage-Server: AI document extraction with Upstage AI — automatic data extraction from documents, custom schemas an
XPath enables Claude to execute xpath queries on XML and HTML, supporting web scraping for structured data extraction wi
Access the SEC EDGAR database for financial analysis, research, and compliance with millions of US Securities and Exchan
Stay ahead of the MCP ecosystem
Get weekly updates on new skills and servers.