data-extractor

0views

1installs

Extract structured data from any document format using unstructured - unified document processing

Install

mkdir -p .claude/skills/data-extractor && curl -L -o skill.zip "https://mcp.directory/api/skills/download/6476" && unzip -o skill.zip -d .claude/skills/data-extractor && rm skill.zip

Installs to .claude/skills/data-extractor

About this skill

Data Extractor Skill

Overview

This skill enables extraction of structured data from any document format using unstructured - a unified library for processing PDFs, Word docs, emails, HTML, and more. Get consistent, structured output regardless of input format.

How to Use

Provide the document to process
Optionally specify extraction options
I'll extract structured elements with metadata

Example prompts:

"Extract all text and tables from this PDF"
"Parse this email and get the body, attachments, and metadata"
"Convert this HTML page to structured elements"
"Extract data from these mixed-format documents"

Domain Knowledge

unstructured Fundamentals

from unstructured.partition.auto import partition

# Automatically detect and process any document
elements = partition("document.pdf")

# Access extracted elements
for element in elements:
    print(f"Type: {type(element).__name__}")
    print(f"Text: {element.text}")
    print(f"Metadata: {element.metadata}")

Supported Formats

Format	Function	Notes
PDF	`partition_pdf`	Native + scanned
Word	`partition_docx`	Full structure
PowerPoint	`partition_pptx`	Slides & notes
Excel	`partition_xlsx`	Sheets & tables
Email	`partition_email`	Body & attachments
HTML	`partition_html`	Tags preserved
Markdown	`partition_md`	Structure preserved
Plain Text	`partition_text`	Basic parsing
Images	`partition_image`	OCR extraction

Element Types

from unstructured.documents.elements import (
    Title,
    NarrativeText,
    Text,
    ListItem,
    Table,
    Image,
    Header,
    Footer,
    PageBreak,
    Address,
    EmailAddress,
)

# Elements have consistent structure
element.text           # Raw text content
element.metadata       # Rich metadata
element.category       # Element type
element.id            # Unique identifier

Auto Partition

from unstructured.partition.auto import partition

# Process any file type
elements = partition(
    filename="document.pdf",
    strategy="auto",          # or "fast", "hi_res", "ocr_only"
    include_metadata=True,
    include_page_breaks=True,
)

# Filter by type
titles = [e for e in elements if isinstance(e, Title)]
tables = [e for e in elements if isinstance(e, Table)]

Format-Specific Partitioning

# PDF with options
from unstructured.partition.pdf import partition_pdf

elements = partition_pdf(
    filename="document.pdf",
    strategy="hi_res",              # High quality extraction
    infer_table_structure=True,     # Detect tables
    include_page_breaks=True,
    languages=["en"],               # OCR language
)

# Word documents
from unstructured.partition.docx import partition_docx

elements = partition_docx(
    filename="document.docx",
    include_metadata=True,
)

# HTML
from unstructured.partition.html import partition_html

elements = partition_html(
    filename="page.html",
    include_metadata=True,
)

Working with Tables

from unstructured.partition.auto import partition

elements = partition("report.pdf", infer_table_structure=True)

# Extract tables
for element in elements:
    if element.category == "Table":
        print("Table found:")
        print(element.text)
        
        # Access structured table data
        if hasattr(element, 'metadata') and element.metadata.text_as_html:
            print("HTML:", element.metadata.text_as_html)

Metadata Access

from unstructured.partition.auto import partition

elements = partition("document.pdf")

for element in elements:
    meta = element.metadata
    
    # Common metadata fields
    print(f"Page: {meta.page_number}")
    print(f"Filename: {meta.filename}")
    print(f"Filetype: {meta.filetype}")
    print(f"Coordinates: {meta.coordinates}")
    print(f"Languages: {meta.languages}")

Chunking for AI/RAG

from unstructured.partition.auto import partition
from unstructured.chunking.title import chunk_by_title
from unstructured.chunking.basic import chunk_elements

# Partition document
elements = partition("document.pdf")

# Chunk by title (semantic chunks)
chunks = chunk_by_title(
    elements,
    max_characters=1000,
    combine_text_under_n_chars=200,
)

# Or basic chunking
chunks = chunk_elements(
    elements,
    max_characters=500,
    overlap=50,
)

for chunk in chunks:
    print(f"Chunk ({len(chunk.text)} chars):")
    print(chunk.text[:100] + "...")

Batch Processing

from unstructured.partition.auto import partition
from pathlib import Path
from concurrent.futures import ThreadPoolExecutor

def process_document(file_path):
    """Process single document."""
    try:
        elements = partition(str(file_path))
        return {
            'file': str(file_path),
            'status': 'success',
            'elements': len(elements),
            'text': '\n\n'.join([e.text for e in elements])
        }
    except Exception as e:
        return {
            'file': str(file_path),
            'status': 'error',
            'error': str(e)
        }

def batch_process(input_dir, max_workers=4):
    """Process all documents in directory."""
    input_path = Path(input_dir)
    files = list(input_path.glob('*'))
    
    with ThreadPoolExecutor(max_workers=max_workers) as executor:
        results = list(executor.map(process_document, files))
    
    return results

Export Formats

from unstructured.partition.auto import partition
from unstructured.staging.base import elements_to_json, elements_to_dicts

elements = partition("document.pdf")

# To JSON string
json_str = elements_to_json(elements)

# To list of dicts
dicts = elements_to_dicts(elements)

# To DataFrame
import pandas as pd
df = pd.DataFrame(dicts)

Best Practices

Choose Strategy Wisely: "fast" for speed, "hi_res" for accuracy
Enable Table Detection: For documents with tables
Specify Language: For better OCR on non-English docs
Chunk for RAG: Use semantic chunking for AI applications
Handle Errors: Some formats may fail gracefully

Common Patterns

Document to JSON

def document_to_json(file_path, output_path=None):
    """Convert document to structured JSON."""
    from unstructured.partition.auto import partition
    from unstructured.staging.base import elements_to_json
    import json
    
    elements = partition(file_path)
    
    # Create structured output
    output = {
        'source': file_path,
        'elements': []
    }
    
    for element in elements:
        output['elements'].append({
            'type': type(element).__name__,
            'text': element.text,
            'metadata': {
                'page': element.metadata.page_number,
                'coordinates': element.metadata.coordinates.to_dict() if element.metadata.coordinates else None
            }
        })
    
    if output_path:
        with open(output_path, 'w') as f:
            json.dump(output, f, indent=2)
    
    return output

Email Parser

from unstructured.partition.email import partition_email

def parse_email(email_path):
    """Extract structured data from email."""
    
    elements = partition_email(email_path)
    
    email_data = {
        'subject': None,
        'from': None,
        'to': [],
        'date': None,
        'body': [],
        'attachments': []
    }
    
    for element in elements:
        meta = element.metadata
        
        # Extract headers from metadata
        if meta.subject:
            email_data['subject'] = meta.subject
        if meta.sent_from:
            email_data['from'] = meta.sent_from
        if meta.sent_to:
            email_data['to'] = meta.sent_to
        
        # Body content
        email_data['body'].append({
            'type': type(element).__name__,
            'text': element.text
        })
    
    return email_data

Examples

Example 1: Research Paper Extraction

from unstructured.partition.pdf import partition_pdf
from unstructured.chunking.title import chunk_by_title

def extract_paper(pdf_path):
    """Extract structured data from research paper."""
    
    elements = partition_pdf(
        filename=pdf_path,
        strategy="hi_res",
        infer_table_structure=True,
        include_page_breaks=True
    )
    
    paper = {
        'title': None,
        'abstract': None,
        'sections': [],
        'tables': [],
        'references': []
    }
    
    # Find title (usually first Title element)
    for element in elements:
        if element.category == "Title" and not paper['title']:
            paper['title'] = element.text
            break
    
    # Extract tables
    for element in elements:
        if element.category == "Table":
            paper['tables'].append({
                'page': element.metadata.page_number,
                'content': element.text,
                'html': element.metadata.text_as_html if hasattr(element.metadata, 'text_as_html') else None
            })
    
    # Chunk into sections
    chunks = chunk_by_title(elements, max_characters=2000)
    
    current_section = None
    for chunk in chunks:
        if chunk.category == "Title":
            paper['sections'].append({
                'title': chunk.text,
                'content': ''
            })
        elif paper['sections']:
            paper['sections'][-1]['content'] += chunk.text + '\n'
    
    return paper

paper = extract_paper('research_paper.pdf')
print(f"Title: {paper['title']}")
print(f"Tables: {len(paper['tables'])}")
print(f"Sections: {len(paper['sections'])}")

Example 2: Invoice Data Extraction

from unstructured.partition.auto import partition
import re

def extract_invoice_data(file_path):
    """Extract key data from invoice."""
    
    elements = partition(file_path, strategy="hi_res")
    

---

*Content truncated.*

More by openclaw

View all skills by openclaw →

a-stock-analysis

openclaw

A股实时行情与分时量能分析。获取沪深股票实时价格、涨跌、成交量，分析分时量能分布（早盘/尾盘放量）、主力动向（抢筹/出货信号）、涨停封单。支持持仓管理和盈亏分析。Use when: (1) 查询A股实时行情, (2) 分析主力资金动向, (3) 查看分时成交量分布, (4) 管理股票持仓, (5) 分析持仓盈亏。

317125

research-paper-writer

openclaw

Creates formal academic research papers following IEEE/ACM formatting standards with proper structure, citations, and scholarly writing style. Use when the user asks to write a research paper, academic paper, or conference paper on any topic.

4774

gog

openclaw

Google Workspace CLI for Gmail, Calendar, Drive, Contacts, Sheets, and Docs.

16470

seedream-image-gen

openclaw

Generate images via Seedream API (doubao-seedream models). Synchronous generation.

4062

weread

openclaw

WeChat Reading (微信读书) CLI tool for fetching notes and highlights. Use when: (1) user asks about weread/微信读书 notes or highlights, (2) fetching today's or recent reading notes, (3) exporting book highlights, (4) managing reading bookshelf, (5) any task involving reading notes from WeChat Reading.

5061

keyword-research

openclaw

Discovers high-value keywords with search intent analysis, difficulty assessment, and content opportunity mapping. Essential for starting any SEO or GEO content strategy.

28057

flutter-development

aj-geddes

Build beautiful cross-platform mobile apps with Flutter and Dart. Covers widgets, state management with Provider/BLoC, navigation, API integration, and material design.

1,6881,430

ui-ux-pro-max

nextlevelbuilder

"UI/UX design intelligence. 50 styles, 21 palettes, 50 font pairings, 20 charts, 8 stacks (React, Next.js, Vue, Svelte, SwiftUI, React Native, Flutter, Tailwind). Actions: plan, build, create, design, implement, review, fix, improve, optimize, enhance, refactor, check UI/UX code. Projects: website, landing page, dashboard, admin panel, e-commerce, SaaS, portfolio, blog, mobile app, .html, .tsx, .vue, .svelte. Elements: button, modal, navbar, sidebar, card, table, form, chart. Styles: glassmorphism, claymorphism, minimalism, brutalism, neumorphism, bento grid, dark mode, responsive, skeuomorphism, flat design. Topics: color palette, accessibility, animation, layout, typography, font pairing, spacing, hover, shadow, gradient."

1,2721,338

drawio-diagrams-enhanced

jgtolentino

Create professional draw.io (diagrams.net) diagrams in XML format (.drawio files) with integrated PMP/PMBOK methodologies, extensive visual asset libraries, and industry-standard professional templates. Use this skill when users ask to create flowcharts, swimlane diagrams, cross-functional flowcharts, org charts, network diagrams, UML diagrams, BPMN, project management diagrams (WBS, Gantt, PERT, RACI), risk matrices, stakeholder maps, or any other visual diagram in draw.io format. This skill includes access to custom shape libraries for icons, clipart, and professional symbols.

1,5471,154

godot

bfollington

This skill should be used when working on Godot Engine projects. It provides specialized knowledge of Godot's file formats (.gd, .tscn, .tres), architecture patterns (component-based, signal-driven, resource-based), common pitfalls, validation tools, code templates, and CLI workflows. The `godot` command is available for running the game, validating scripts, importing resources, and exporting builds. Use this skill for tasks involving Godot game development, debugging scene/resource files, implementing game systems, or creating new Godot components.

1,359809

nano-banana-pro

garg-aayush

Generate and edit images using Google's Nano Banana Pro (Gemini 3 Pro Image) API. Use when the user asks to generate, create, edit, modify, change, alter, or update images. Also use when user references an existing image file and asks to modify it in any way (e.g., "modify this image", "change the background", "replace X with Y"). Supports both text-to-image generation and image-to-image editing with configurable resolution (1K default, 2K, or 4K for high resolution). DO NOT read the image file first - use this skill directly with the --input-image parameter.

1,269732

pdf-to-markdown

aliceisjustplaying

Convert entire PDF documents to clean, structured Markdown for full context loading. Use this skill when the user wants to extract ALL text from a PDF into context (not grep/search), when discussing or analyzing PDF content in full, when the user mentions "load the whole PDF", "bring the PDF into context", "read the entire PDF", or when partial extraction/grepping would miss important context. This is the preferred method for PDF text extraction over page-by-page or grep approaches.

1,498687

Related MCP Servers

Browse all servers

Dumpling AI

Dumpling AI offers advanced web scraping tools, acting as a web scraper to extract structured data from websites and doc

290 tools

Dumpling AI

Dumpling AI is a powerful web scraper offering advanced web scraping tools to extract, process, and analyze data from di

290 tools

Orion Vision (Azure Form Recognizer)

Automate document workflows with Orion Vision and Azure Form Recognizer for intelligent document processing and assembly

20 tools

MCP-Upstage-Server

MCP-Upstage-Server: AI document extraction with Upstage AI — automatic data extraction from documents, custom schemas an

20 tools

XPath

XPath enables Claude to execute xpath queries on XML and HTML, supporting web scraping for structured data extraction wi

2 tools

Beat and Raise (SEC Filings)

Access the SEC EDGAR database for financial analysis, research, and compliance with millions of US Securities and Exchan

0 tools

Install

mkdir -p .claude/skills/data-extractor && curl -L -o skill.zip "https://mcp.directory/api/skills/download/6476" && unzip -o skill.zip -d .claude/skills/data-extractor && rm skill.zip

Installs to .claude/skills/data-extractor

Stats

Views

Installs

Author

openclaw

7 skills published

Links

Source Code

data-extractor

Install

About this skill

Data Extractor Skill

Overview

How to Use

Domain Knowledge

unstructured Fundamentals

Supported Formats

Element Types

Auto Partition

Format-Specific Partitioning

Working with Tables

Metadata Access

Chunking for AI/RAG

Batch Processing

Export Formats

Best Practices

Common Patterns

Document to JSON

Email Parser

Examples

Example 1: Research Paper Extraction

Example 2: Invoice Data Extraction

More by openclaw

a-stock-analysis

research-paper-writer

gog

seedream-image-gen

weread

keyword-research

You might also like

flutter-development

ui-ux-pro-max

drawio-diagrams-enhanced

godot

nano-banana-pro

pdf-to-markdown

Related MCP Servers