table-extractor

Name: table-extractor
Author: openclaw

by openclaw

50views

10installs

Source

Extract tables from PDFs with high accuracy using camelot - handles complex table structures

Install

mkdir -p .claude/skills/table-extractor && curl -L -o skill.zip "https://mcp.directory/api/skills/download/1857" && unzip -o skill.zip -d .claude/skills/table-extractor && rm skill.zip

Installs to .claude/skills/table-extractor

About this skill

Table Extractor Skill

Overview

This skill enables precise extraction of tables from PDF documents using camelot - the gold standard for PDF table extraction. Handle complex tables with merged cells, borderless tables, and multi-page layouts with high accuracy.

How to Use

Provide the PDF containing tables
Optionally specify pages or table detection method
I'll extract tables as pandas DataFrames

Example prompts:

"Extract all tables from this PDF"
"Get the table on page 5 of this report"
"Extract borderless tables from this document"
"Convert PDF tables to Excel format"

Domain Knowledge

camelot Fundamentals

import camelot

# Extract tables from PDF
tables = camelot.read_pdf('document.pdf')

# Access results
print(f"Found {len(tables)} tables")

# Get first table as DataFrame
df = tables[0].df
print(df)

Extraction Methods

Method	Use Case	Description
`lattice`	Bordered tables	Detects table by lines/borders
`stream`	Borderless tables	Uses text positioning

# Lattice method (default) - for tables with visible borders
tables = camelot.read_pdf('document.pdf', flavor='lattice')

# Stream method - for borderless tables
tables = camelot.read_pdf('document.pdf', flavor='stream')

Page Selection

# Single page
tables = camelot.read_pdf('document.pdf', pages='1')

# Multiple pages
tables = camelot.read_pdf('document.pdf', pages='1,3,5')

# Page range
tables = camelot.read_pdf('document.pdf', pages='1-5')

# All pages
tables = camelot.read_pdf('document.pdf', pages='all')

Advanced Options

Lattice Options

tables = camelot.read_pdf(
    'document.pdf',
    flavor='lattice',
    line_scale=40,              # Line detection sensitivity
    copy_text=['h', 'v'],       # Copy text across merged cells
    shift_text=['l', 't'],      # Shift text alignment
    split_text=True,            # Split text at newlines
    flag_size=True,             # Flag super/subscripts
    strip_text='\n',            # Characters to strip
    process_background=False,   # Process background lines
)

Stream Options

tables = camelot.read_pdf(
    'document.pdf',
    flavor='stream',
    edge_tol=500,               # Edge tolerance
    row_tol=10,                 # Row tolerance
    column_tol=0,               # Column tolerance
    strip_text='\n',            # Characters to strip
)

Table Area Specification

# Extract from specific area (x1, y1, x2, y2)
# Coordinates from bottom-left, in PDF points (72 points = 1 inch)
tables = camelot.read_pdf(
    'document.pdf',
    table_areas=['72,720,540,400'],  # One area
)

# Multiple areas
tables = camelot.read_pdf(
    'document.pdf',
    table_areas=['72,720,540,400', '72,380,540,200'],
)

Column Specification

# Manually specify column positions (for stream method)
tables = camelot.read_pdf(
    'document.pdf',
    flavor='stream',
    columns=['100,200,300,400'],  # X positions of column separators
)

Working with Results

import camelot

tables = camelot.read_pdf('document.pdf')

for i, table in enumerate(tables):
    # Access DataFrame
    df = table.df
    
    # Table metadata
    print(f"Table {i+1}:")
    print(f"  Page: {table.page}")
    print(f"  Accuracy: {table.accuracy}")
    print(f"  Whitespace: {table.whitespace}")
    print(f"  Order: {table.order}")
    print(f"  Shape: {df.shape}")
    
    # Parsing report
    report = table.parsing_report
    print(f"  Report: {report}")

Export Options

import camelot

tables = camelot.read_pdf('document.pdf')

# Export to CSV
tables[0].to_csv('table.csv')

# Export to Excel
tables[0].to_excel('table.xlsx')

# Export to JSON
tables[0].to_json('table.json')

# Export to HTML
tables[0].to_html('table.html')

# Export all tables
for i, table in enumerate(tables):
    table.to_excel(f'table_{i+1}.xlsx')

Visual Debugging

import camelot

# Enable visual debugging
tables = camelot.read_pdf('document.pdf')

# Plot detected table areas
camelot.plot(tables[0], kind='contour').show()

# Plot text on table
camelot.plot(tables[0], kind='text').show()

# Plot detected lines (lattice only)
camelot.plot(tables[0], kind='joint').show()
camelot.plot(tables[0], kind='line').show()

# Save plot
fig = camelot.plot(tables[0])
fig.savefig('debug.png')

Handling Multi-page Tables

import camelot
import pandas as pd

def extract_multipage_table(pdf_path, pages='all'):
    """Extract and combine tables that span multiple pages."""
    
    tables = camelot.read_pdf(pdf_path, pages=pages)
    
    # Group tables by similar structure (columns)
    table_groups = {}
    
    for table in tables:
        cols = tuple(table.df.columns)
        if cols not in table_groups:
            table_groups[cols] = []
        table_groups[cols].append(table.df)
    
    # Combine similar tables
    combined = []
    for cols, dfs in table_groups.items():
        if len(dfs) > 1:
            # Combine and deduplicate header rows
            combined_df = pd.concat(dfs, ignore_index=True)
            combined.append(combined_df)
        else:
            combined.append(dfs[0])
    
    return combined

Best Practices

Try Both Methods: Lattice for bordered, stream for borderless
Check Accuracy Score: Above 90% is usually good
Use Visual Debugging: Understand extraction results
Specify Areas: For PDFs with multiple table types
Handle Headers: First row often needs special treatment

Common Patterns

Batch Table Extraction

import camelot
from pathlib import Path
import pandas as pd

def batch_extract_tables(input_dir, output_dir):
    """Extract tables from all PDFs in directory."""
    
    input_path = Path(input_dir)
    output_path = Path(output_dir)
    output_path.mkdir(exist_ok=True)
    
    results = []
    
    for pdf_file in input_path.glob('*.pdf'):
        try:
            tables = camelot.read_pdf(str(pdf_file), pages='all')
            
            for i, table in enumerate(tables):
                # Skip low accuracy tables
                if table.accuracy < 80:
                    continue
                
                output_file = output_path / f"{pdf_file.stem}_table_{i+1}.xlsx"
                table.to_excel(str(output_file))
                
                results.append({
                    'source': str(pdf_file),
                    'table': i + 1,
                    'page': table.page,
                    'accuracy': table.accuracy,
                    'output': str(output_file)
                })
        
        except Exception as e:
            results.append({
                'source': str(pdf_file),
                'error': str(e)
            })
    
    return results

Auto-detect Table Method

import camelot

def smart_extract_tables(pdf_path, pages='1'):
    """Try both methods and return best results."""
    
    # Try lattice first
    lattice_tables = camelot.read_pdf(pdf_path, pages=pages, flavor='lattice')
    
    # Try stream
    stream_tables = camelot.read_pdf(pdf_path, pages=pages, flavor='stream')
    
    # Compare and return best
    results = []
    
    if lattice_tables and lattice_tables[0].accuracy > 70:
        results.extend(lattice_tables)
    elif stream_tables:
        results.extend(stream_tables)
    
    return results

Examples

Example 1: Financial Statement Extraction

import camelot
import pandas as pd

def extract_financial_tables(pdf_path):
    """Extract financial tables from annual report."""
    
    # Extract all tables
    tables = camelot.read_pdf(pdf_path, pages='all', flavor='lattice')
    
    financial_data = {
        'income_statement': None,
        'balance_sheet': None,
        'cash_flow': None,
        'other_tables': []
    }
    
    for table in tables:
        df = table.df
        text = df.to_string().lower()
        
        # Identify table type
        if 'revenue' in text or 'sales' in text:
            if 'operating income' in text or 'net income' in text:
                financial_data['income_statement'] = df
        elif 'asset' in text and 'liabilities' in text:
            financial_data['balance_sheet'] = df
        elif 'cash flow' in text or 'operating activities' in text:
            financial_data['cash_flow'] = df
        else:
            financial_data['other_tables'].append({
                'page': table.page,
                'data': df,
                'accuracy': table.accuracy
            })
    
    return financial_data

financials = extract_financial_tables('annual_report.pdf')
if financials['income_statement'] is not None:
    print("Income Statement found:")
    print(financials['income_statement'])

Example 2: Scientific Data Extraction

import camelot
import pandas as pd

def extract_research_data(pdf_path, pages='all'):
    """Extract data tables from research paper."""
    
    # Try lattice for bordered tables
    tables = camelot.read_pdf(pdf_path, pages=pages, flavor='lattice')
    
    if not tables or all(t.accuracy < 70 for t in tables):
        # Fall back to stream for borderless
        tables = camelot.read_pdf(pdf_path, pages=pages, flavor='stream')
    
    extracted_data = []
    
    for table in tables:
        df = table.df
        
        # Clean up the DataFrame
        # Set first row as header if it looks like one
        if not df.iloc[0].str.contains(r'\d').any():
            df.columns = df.iloc[0]
            df = df[1:]
            df = df.reset_index(drop=True)
        
        extracted_data.append({
            'page': table.page,
            'accuracy': table.accuracy,
            'data': df
        })
    
    return extracted_data

data = extract_research_data('research_paper.pdf')
for i, item in enumerate(data):
    pri

---

*Content truncated.*

More by openclaw

View all skills by openclaw →

a-stock-analysis

openclaw

A股实时行情与分时量能分析。获取沪深股票实时价格、涨跌、成交量，分析分时量能分布（早盘/尾盘放量）、主力动向（抢筹/出货信号）、涨停封单。支持持仓管理和盈亏分析。Use when: (1) 查询A股实时行情, (2) 分析主力资金动向, (3) 查看分时成交量分布, (4) 管理股票持仓, (5) 分析持仓盈亏。

703267

fivem

openclaw

Fix, create, or validate FiveM server resources for QBCore/ESX (config.lua, fxmanifest.lua, items, housing/furniture, scripts, MLOs). Use when asked to debug resource errors, convert ESX↔QB, update fxmanifest versions, add items, or source scripts from GitHub. Also use for SSH key generation for SFTP access.

329193

research-paper-writer

openclaw

Creates formal academic research papers following IEEE/ACM formatting standards with proper structure, citations, and scholarly writing style. Use when the user asks to write a research paper, academic paper, or conference paper on any topic.

74165

keyword-research

openclaw

Discovers high-value keywords with search intent analysis, difficulty assessment, and content opportunity mapping. Essential for starting any SEO or GEO content strategy.

429102

weread

openclaw

WeChat Reading (微信读书) CLI tool for fetching notes and highlights. Use when: (1) user asks about weread/微信读书 notes or highlights, (2) fetching today's or recent reading notes, (3) exporting book highlights, (4) managing reading bookshelf, (5) any task involving reading notes from WeChat Reading.

10384

html-to-ppt

openclaw

Convert HTML/Markdown to PowerPoint presentations using Marp

29679

ui-ux-pro-max

nextlevelbuilder

"UI/UX design intelligence. 50 styles, 21 palettes, 50 font pairings, 20 charts, 8 stacks (React, Next.js, Vue, Svelte, SwiftUI, React Native, Flutter, Tailwind). Actions: plan, build, create, design, implement, review, fix, improve, optimize, enhance, refactor, check UI/UX code. Projects: website, landing page, dashboard, admin panel, e-commerce, SaaS, portfolio, blog, mobile app, .html, .tsx, .vue, .svelte. Elements: button, modal, navbar, sidebar, card, table, form, chart. Styles: glassmorphism, claymorphism, minimalism, brutalism, neumorphism, bento grid, dark mode, responsive, skeuomorphism, flat design. Topics: color palette, accessibility, animation, layout, typography, font pairing, spacing, hover, shadow, gradient."

2,6092,340

flutter-development

aj-geddes

Build beautiful cross-platform mobile apps with Flutter and Dart. Covers widgets, state management with Provider/BLoC, navigation, API integration, and material design.

2,1111,619

pdf-to-markdown

aliceisjustplaying

Convert entire PDF documents to clean, structured Markdown for full context loading. Use this skill when the user wants to extract ALL text from a PDF into context (not grep/search), when discussing or analyzing PDF content in full, when the user mentions "load the whole PDF", "bring the PDF into context", "read the entire PDF", or when partial extraction/grepping would miss important context. This is the preferred method for PDF text extraction over page-by-page or grep approaches.

3,4341,487

drawio-diagrams-enhanced

jgtolentino

Create professional draw.io (diagrams.net) diagrams in XML format (.drawio files) with integrated PMP/PMBOK methodologies, extensive visual asset libraries, and industry-standard professional templates. Use this skill when users ask to create flowcharts, swimlane diagrams, cross-functional flowcharts, org charts, network diagrams, UML diagrams, BPMN, project management diagrams (WBS, Gantt, PERT, RACI), risk matrices, stakeholder maps, or any other visual diagram in draw.io format. This skill includes access to custom shape libraries for icons, clipart, and professional symbols.

2,1961,420

godot

bfollington

This skill should be used when working on Godot Engine projects. It provides specialized knowledge of Godot's file formats (.gd, .tscn, .tres), architecture patterns (component-based, signal-driven, resource-based), common pitfalls, validation tools, code templates, and CLI workflows. The `godot` command is available for running the game, validating scripts, importing resources, and exporting builds. Use this skill for tasks involving Godot game development, debugging scene/resource files, implementing game systems, or creating new Godot components.

2,3131,173

nano-banana-pro

garg-aayush

Generate and edit images using Google's Nano Banana Pro (Gemini 3 Pro Image) API. Use when the user asks to generate, create, edit, modify, change, alter, or update images. Also use when user references an existing image file and asks to modify it in any way (e.g., "modify this image", "change the background", "replace X with Y"). Supports both text-to-image generation and image-to-image editing with configurable resolution (1K default, 2K, or 4K for high resolution). DO NOT read the image file first - use this skill directly with the --input-image parameter.

1,882941

Related MCP Servers

Browse all servers

PDF Reader

Securely extract text, metadata, & pages from PDFs using Adobe Acrobat PDF editor software for local & remote files.

5421 tools

PDF Reader

Securely extract text and page info from PDFs using pdfjs-dist. Works with local files or remote URLs, like Adobe Acrobat PDF Editor.

5421 tools

PageIndex

PageIndex: a reasoning-based RAG system for fast, accurate analysis of long PDFs — extract insights, cite sources, and navigate documents with ease.

2540 tools

Textin MCP Server

Textin MCP Server: OCR to Markdown for images, PDFs & Word docs — fast document OCR, PDF OCR converter and OCR data extraction to extract key info.

280 tools

AWS S3

Access AWS S3 storage to list buckets, browse objects, and extract text from files like PDFs with ease.

220 tools

Firecrawl

Unlock AI-ready web data with Firecrawl: scrape any website, handle dynamic content, and automate web scraping for research or automation.

89,5930 tools

Install

mkdir -p .claude/skills/table-extractor && curl -L -o skill.zip "https://mcp.directory/api/skills/download/1857" && unzip -o skill.zip -d .claude/skills/table-extractor && rm skill.zip

Installs to .claude/skills/table-extractor

Stats

Views

Installs

Author

openclaw

7 skills published

Links

Source Code

table-extractor

Install

About this skill

Table Extractor Skill

Overview

How to Use

Domain Knowledge

camelot Fundamentals

Extraction Methods

Page Selection

Advanced Options

Lattice Options

Stream Options

Table Area Specification

Column Specification

Working with Results

Export Options

Visual Debugging

Handling Multi-page Tables

Best Practices

Common Patterns

Batch Table Extraction

Auto-detect Table Method

Examples

Example 1: Financial Statement Extraction

Example 2: Scientific Data Extraction

More by openclaw

a-stock-analysis

fivem

research-paper-writer

keyword-research

weread

html-to-ppt

You might also like

ui-ux-pro-max

flutter-development

pdf-to-markdown

drawio-diagrams-enhanced

godot

nano-banana-pro

Related MCP Servers