table-extractor
Extract tables from PDFs with high accuracy using camelot - handles complex table structures
Install
mkdir -p .claude/skills/table-extractor && curl -L -o skill.zip "https://mcp.directory/api/skills/download/1857" && unzip -o skill.zip -d .claude/skills/table-extractor && rm skill.zipInstalls to .claude/skills/table-extractor
About this skill
Table Extractor Skill
Overview
This skill enables precise extraction of tables from PDF documents using camelot - the gold standard for PDF table extraction. Handle complex tables with merged cells, borderless tables, and multi-page layouts with high accuracy.
How to Use
- Provide the PDF containing tables
- Optionally specify pages or table detection method
- I'll extract tables as pandas DataFrames
Example prompts:
- "Extract all tables from this PDF"
- "Get the table on page 5 of this report"
- "Extract borderless tables from this document"
- "Convert PDF tables to Excel format"
Domain Knowledge
camelot Fundamentals
import camelot
# Extract tables from PDF
tables = camelot.read_pdf('document.pdf')
# Access results
print(f"Found {len(tables)} tables")
# Get first table as DataFrame
df = tables[0].df
print(df)
Extraction Methods
| Method | Use Case | Description |
|---|---|---|
lattice | Bordered tables | Detects table by lines/borders |
stream | Borderless tables | Uses text positioning |
# Lattice method (default) - for tables with visible borders
tables = camelot.read_pdf('document.pdf', flavor='lattice')
# Stream method - for borderless tables
tables = camelot.read_pdf('document.pdf', flavor='stream')
Page Selection
# Single page
tables = camelot.read_pdf('document.pdf', pages='1')
# Multiple pages
tables = camelot.read_pdf('document.pdf', pages='1,3,5')
# Page range
tables = camelot.read_pdf('document.pdf', pages='1-5')
# All pages
tables = camelot.read_pdf('document.pdf', pages='all')
Advanced Options
Lattice Options
tables = camelot.read_pdf(
'document.pdf',
flavor='lattice',
line_scale=40, # Line detection sensitivity
copy_text=['h', 'v'], # Copy text across merged cells
shift_text=['l', 't'], # Shift text alignment
split_text=True, # Split text at newlines
flag_size=True, # Flag super/subscripts
strip_text='\n', # Characters to strip
process_background=False, # Process background lines
)
Stream Options
tables = camelot.read_pdf(
'document.pdf',
flavor='stream',
edge_tol=500, # Edge tolerance
row_tol=10, # Row tolerance
column_tol=0, # Column tolerance
strip_text='\n', # Characters to strip
)
Table Area Specification
# Extract from specific area (x1, y1, x2, y2)
# Coordinates from bottom-left, in PDF points (72 points = 1 inch)
tables = camelot.read_pdf(
'document.pdf',
table_areas=['72,720,540,400'], # One area
)
# Multiple areas
tables = camelot.read_pdf(
'document.pdf',
table_areas=['72,720,540,400', '72,380,540,200'],
)
Column Specification
# Manually specify column positions (for stream method)
tables = camelot.read_pdf(
'document.pdf',
flavor='stream',
columns=['100,200,300,400'], # X positions of column separators
)
Working with Results
import camelot
tables = camelot.read_pdf('document.pdf')
for i, table in enumerate(tables):
# Access DataFrame
df = table.df
# Table metadata
print(f"Table {i+1}:")
print(f" Page: {table.page}")
print(f" Accuracy: {table.accuracy}")
print(f" Whitespace: {table.whitespace}")
print(f" Order: {table.order}")
print(f" Shape: {df.shape}")
# Parsing report
report = table.parsing_report
print(f" Report: {report}")
Export Options
import camelot
tables = camelot.read_pdf('document.pdf')
# Export to CSV
tables[0].to_csv('table.csv')
# Export to Excel
tables[0].to_excel('table.xlsx')
# Export to JSON
tables[0].to_json('table.json')
# Export to HTML
tables[0].to_html('table.html')
# Export all tables
for i, table in enumerate(tables):
table.to_excel(f'table_{i+1}.xlsx')
Visual Debugging
import camelot
# Enable visual debugging
tables = camelot.read_pdf('document.pdf')
# Plot detected table areas
camelot.plot(tables[0], kind='contour').show()
# Plot text on table
camelot.plot(tables[0], kind='text').show()
# Plot detected lines (lattice only)
camelot.plot(tables[0], kind='joint').show()
camelot.plot(tables[0], kind='line').show()
# Save plot
fig = camelot.plot(tables[0])
fig.savefig('debug.png')
Handling Multi-page Tables
import camelot
import pandas as pd
def extract_multipage_table(pdf_path, pages='all'):
"""Extract and combine tables that span multiple pages."""
tables = camelot.read_pdf(pdf_path, pages=pages)
# Group tables by similar structure (columns)
table_groups = {}
for table in tables:
cols = tuple(table.df.columns)
if cols not in table_groups:
table_groups[cols] = []
table_groups[cols].append(table.df)
# Combine similar tables
combined = []
for cols, dfs in table_groups.items():
if len(dfs) > 1:
# Combine and deduplicate header rows
combined_df = pd.concat(dfs, ignore_index=True)
combined.append(combined_df)
else:
combined.append(dfs[0])
return combined
Best Practices
- Try Both Methods: Lattice for bordered, stream for borderless
- Check Accuracy Score: Above 90% is usually good
- Use Visual Debugging: Understand extraction results
- Specify Areas: For PDFs with multiple table types
- Handle Headers: First row often needs special treatment
Common Patterns
Batch Table Extraction
import camelot
from pathlib import Path
import pandas as pd
def batch_extract_tables(input_dir, output_dir):
"""Extract tables from all PDFs in directory."""
input_path = Path(input_dir)
output_path = Path(output_dir)
output_path.mkdir(exist_ok=True)
results = []
for pdf_file in input_path.glob('*.pdf'):
try:
tables = camelot.read_pdf(str(pdf_file), pages='all')
for i, table in enumerate(tables):
# Skip low accuracy tables
if table.accuracy < 80:
continue
output_file = output_path / f"{pdf_file.stem}_table_{i+1}.xlsx"
table.to_excel(str(output_file))
results.append({
'source': str(pdf_file),
'table': i + 1,
'page': table.page,
'accuracy': table.accuracy,
'output': str(output_file)
})
except Exception as e:
results.append({
'source': str(pdf_file),
'error': str(e)
})
return results
Auto-detect Table Method
import camelot
def smart_extract_tables(pdf_path, pages='1'):
"""Try both methods and return best results."""
# Try lattice first
lattice_tables = camelot.read_pdf(pdf_path, pages=pages, flavor='lattice')
# Try stream
stream_tables = camelot.read_pdf(pdf_path, pages=pages, flavor='stream')
# Compare and return best
results = []
if lattice_tables and lattice_tables[0].accuracy > 70:
results.extend(lattice_tables)
elif stream_tables:
results.extend(stream_tables)
return results
Examples
Example 1: Financial Statement Extraction
import camelot
import pandas as pd
def extract_financial_tables(pdf_path):
"""Extract financial tables from annual report."""
# Extract all tables
tables = camelot.read_pdf(pdf_path, pages='all', flavor='lattice')
financial_data = {
'income_statement': None,
'balance_sheet': None,
'cash_flow': None,
'other_tables': []
}
for table in tables:
df = table.df
text = df.to_string().lower()
# Identify table type
if 'revenue' in text or 'sales' in text:
if 'operating income' in text or 'net income' in text:
financial_data['income_statement'] = df
elif 'asset' in text and 'liabilities' in text:
financial_data['balance_sheet'] = df
elif 'cash flow' in text or 'operating activities' in text:
financial_data['cash_flow'] = df
else:
financial_data['other_tables'].append({
'page': table.page,
'data': df,
'accuracy': table.accuracy
})
return financial_data
financials = extract_financial_tables('annual_report.pdf')
if financials['income_statement'] is not None:
print("Income Statement found:")
print(financials['income_statement'])
Example 2: Scientific Data Extraction
import camelot
import pandas as pd
def extract_research_data(pdf_path, pages='all'):
"""Extract data tables from research paper."""
# Try lattice for bordered tables
tables = camelot.read_pdf(pdf_path, pages=pages, flavor='lattice')
if not tables or all(t.accuracy < 70 for t in tables):
# Fall back to stream for borderless
tables = camelot.read_pdf(pdf_path, pages=pages, flavor='stream')
extracted_data = []
for table in tables:
df = table.df
# Clean up the DataFrame
# Set first row as header if it looks like one
if not df.iloc[0].str.contains(r'\d').any():
df.columns = df.iloc[0]
df = df[1:]
df = df.reset_index(drop=True)
extracted_data.append({
'page': table.page,
'accuracy': table.accuracy,
'data': df
})
return extracted_data
data = extract_research_data('research_paper.pdf')
for i, item in enumerate(data):
pri
---
*Content truncated.*
More by openclaw
View all skills by openclaw →You might also like
flutter-development
aj-geddes
Build beautiful cross-platform mobile apps with Flutter and Dart. Covers widgets, state management with Provider/BLoC, navigation, API integration, and material design.
ui-ux-pro-max
nextlevelbuilder
"UI/UX design intelligence. 50 styles, 21 palettes, 50 font pairings, 20 charts, 8 stacks (React, Next.js, Vue, Svelte, SwiftUI, React Native, Flutter, Tailwind). Actions: plan, build, create, design, implement, review, fix, improve, optimize, enhance, refactor, check UI/UX code. Projects: website, landing page, dashboard, admin panel, e-commerce, SaaS, portfolio, blog, mobile app, .html, .tsx, .vue, .svelte. Elements: button, modal, navbar, sidebar, card, table, form, chart. Styles: glassmorphism, claymorphism, minimalism, brutalism, neumorphism, bento grid, dark mode, responsive, skeuomorphism, flat design. Topics: color palette, accessibility, animation, layout, typography, font pairing, spacing, hover, shadow, gradient."
drawio-diagrams-enhanced
jgtolentino
Create professional draw.io (diagrams.net) diagrams in XML format (.drawio files) with integrated PMP/PMBOK methodologies, extensive visual asset libraries, and industry-standard professional templates. Use this skill when users ask to create flowcharts, swimlane diagrams, cross-functional flowcharts, org charts, network diagrams, UML diagrams, BPMN, project management diagrams (WBS, Gantt, PERT, RACI), risk matrices, stakeholder maps, or any other visual diagram in draw.io format. This skill includes access to custom shape libraries for icons, clipart, and professional symbols.
godot
bfollington
This skill should be used when working on Godot Engine projects. It provides specialized knowledge of Godot's file formats (.gd, .tscn, .tres), architecture patterns (component-based, signal-driven, resource-based), common pitfalls, validation tools, code templates, and CLI workflows. The `godot` command is available for running the game, validating scripts, importing resources, and exporting builds. Use this skill for tasks involving Godot game development, debugging scene/resource files, implementing game systems, or creating new Godot components.
nano-banana-pro
garg-aayush
Generate and edit images using Google's Nano Banana Pro (Gemini 3 Pro Image) API. Use when the user asks to generate, create, edit, modify, change, alter, or update images. Also use when user references an existing image file and asks to modify it in any way (e.g., "modify this image", "change the background", "replace X with Y"). Supports both text-to-image generation and image-to-image editing with configurable resolution (1K default, 2K, or 4K for high resolution). DO NOT read the image file first - use this skill directly with the --input-image parameter.
pdf-to-markdown
aliceisjustplaying
Convert entire PDF documents to clean, structured Markdown for full context loading. Use this skill when the user wants to extract ALL text from a PDF into context (not grep/search), when discussing or analyzing PDF content in full, when the user mentions "load the whole PDF", "bring the PDF into context", "read the entire PDF", or when partial extraction/grepping would miss important context. This is the preferred method for PDF text extraction over page-by-page or grep approaches.
Related MCP Servers
Browse all serversSecurely extract text, metadata, & pages from PDFs using Adobe Acrobat PDF editor software for local & remote files.
Securely extract text and page info from PDFs using pdfjs-dist. Works with local files or remote URLs, like Adobe Acroba
PageIndex: a reasoning-based RAG system for fast, accurate analysis of long PDFs — extract insights, cite sources, and n
Textin MCP Server: OCR to Markdown for images, PDFs & Word docs — fast document OCR, PDF OCR converter and OCR data extr
Access AWS S3 storage to list buckets, browse objects, and extract text from files like PDFs with ease.
Unlock AI-ready web data with Firecrawl: scrape any website, handle dynamic content, and automate web scraping for resea
Stay ahead of the MCP ecosystem
Get weekly updates on new skills and servers.