docetl

Name: docetl
Author: ucbepic

9views

2installs

Build and run LLM-powered data processing pipelines with DocETL. Use when users say "docetl", want to analyze unstructured data, process documents, extract information, or run ETL tasks on text. Helps with data collection, pipeline creation, execution, and optimization.

Install

mkdir -p .claude/skills/docetl && curl -L -o skill.zip "https://mcp.directory/api/skills/download/2308" && unzip -o skill.zip -d .claude/skills/docetl && rm skill.zip

Installs to .claude/skills/docetl

About this skill

DocETL Pipeline Development

DocETL is a system for creating LLM-powered data processing pipelines. This skill helps you build end-to-end pipelines: from data preparation to execution and optimization.

Workflow Overview: Iterative Data Analysis

Work like a data analyst: write → run → inspect → iterate. Never write all scripts at once and run them all at once. Each phase should be completed and validated before moving to the next.

Phase 1: Data Collection

Write data collection script
Run it immediately (with user permission)
Inspect the dataset - show the user:
- Total document count
- Keys/fields in each document
- Sample documents (first 3-5)
- Length distribution (avg chars, min/max)
- Any other relevant statistics
Iterate if needed (e.g., collect more data, fix parsing issues)

Phase 2: Pipeline Development

Read sample documents to understand format
Write pipeline YAML with sample: 10-20 for testing
Run the test pipeline
Inspect intermediate results - show the user:
- Extraction quality on samples
- Domain/category distributions
- Any validation failures
Iterate on prompts/schema based on results
Remove sample parameter and run full pipeline
Show final results - distributions, trends, key insights

Phase 3: Visualization & Presentation

Write visualization script based on actual output structure
Run and show the report to the user
Iterate on charts/tables if needed

Visualization Aesthetics:

Clean and minimalist - no clutter, generous whitespace
Warm and elegant color theme - 1-2 accent colors max
Subtle borders - not too rounded (border-radius: 8-10px max)
Sans-serif fonts - system fonts like -apple-system, Segoe UI, Roboto
"Created by DocETL" - add subtitle after the main title
Mix of charts and tables - charts for distributions, tables for detailed summaries
Light background - off-white (#f5f5f5) with white cards for content

Report Structure:

Title + "Created by DocETL" subtitle
Key stats cards (document count, categories, etc.)
Distribution charts (bar charts, pie charts)
Summary table with detailed analysis
Minimal footer

Interactive Tables:

All truncated content must be expandable - never use static "..." truncation
Long text: Show first ~250 chars with "(show more)" toggle
Long lists: Show first 4-6 items with "(+N more)" toggle
Use JavaScript to toggle visibility, not page reloads

Source Document Links:

Link aggregated results to source documents - users should be able to drill down
Clickable links that open a modal/popup with source content
Modal should show: extracted fields + original source text
Original text can be collapsed by default with "Show original" toggle
Embed source data as JSON in the page for JavaScript access

Key principle: The user should see results at every step. Don't proceed to the next phase until the current phase produces good results.

Step 1: Data Preparation

DocETL datasets must be JSON arrays or CSV files.

JSON Format

[
  {"id": 1, "text": "First document content...", "metadata": "value"},
  {"id": 2, "text": "Second document content...", "metadata": "value"}
]

CSV Format

id,text,metadata
1,"First document content...","value"
2,"Second document content...","value"

Data Collection Scripts

If user needs to collect data, write a Python script:

import json

# Collect/transform data
documents = []
for source in sources:
    documents.append({
        "id": source.id,
        "text": source.content,  # DO NOT truncate text
        # Add relevant fields
    })

# Save as DocETL dataset
with open("dataset.json", "w") as f:
    json.dump(documents, f, indent=2)

Important: Never truncate document text in collection scripts. DocETL operations like split handle long documents properly. Truncation loses information.

After Running Data Collection

Always run the collection script and inspect results before proceeding. Show the user:

import json
data = json.load(open("dataset.json"))

print(f"Total documents: {len(data)}")
print(f"Keys: {list(data[0].keys())}")
print(f"Avg length: {sum(len(str(d)) for d in data) // len(data)} chars")

# Show sample
print("\nSample document:")
print(json.dumps(data[0], indent=2)[:500])

Only proceed to pipeline development once the data looks correct.

Step 2: Read and Understand the Data

CRITICAL: Before writing any prompts, READ the actual input data to understand:

The structure and format of documents
The vocabulary and terminology used
What information is present vs. absent
Edge cases and variations

import json
with open("dataset.json") as f:
    data = json.load(f)
# Examine several examples
for doc in data[:5]:
    print(doc)

This understanding is essential for writing specific, effective prompts.

Step 3: Pipeline Structure

Create a YAML file with this structure:

default_model: gpt-5-nano

system_prompt:
  dataset_description: <describe the data based on what you observed>
  persona: <role for the LLM to adopt>

datasets:
  input_data:
    type: file
    path: "dataset.json"  # or dataset.csv

operations:
  - name: <operation_name>
    type: <operation_type>
    prompt: |
      <Detailed, specific prompt based on the actual data>
    output:
      schema:
        <field_name>: <type>

pipeline:
  steps:
    - name: process
      input: input_data
      operations:
        - <operation_name>
  output:
    type: file
    path: "output.json"
    intermediate_dir: "intermediates"  # ALWAYS set this for debugging

Key Configuration

default_model: Use gpt-5-nano or gpt-5-mini for extraction/map operations
intermediate_dir: Always set to log intermediate results
system_prompt: Describe the data based on what you actually observed

Model Selection by Operation Type

Operation Type	Recommended Model	Rationale
Map (extraction)	`gpt-5-nano` or `gpt-5-mini`	High volume, simple per-doc tasks
Filter	`gpt-5-nano`	Simple yes/no decisions
Reduce (summarization)	`gpt-4.1` or `gpt-5.1`	Complex synthesis across many docs
Resolve (deduplication)	`gpt-5-nano` or `gpt-5-mini`	Simple pairwise comparisons

Use cheaper models for high-volume extraction, and more capable models for synthesis/summarization where quality matters most.

Step 4: Writing Effective Prompts

Prompts must be specific to the data, not generic. After reading the input data:

Bad (Generic) Prompt

prompt: |
  Extract key information from this document.
  {{ input.text }}

Good (Specific) Prompt

prompt: |
  You are analyzing a medical transcript from a doctor-patient visit.

  The transcript follows this format:
  - Doctor statements are prefixed with "DR:"
  - Patient statements are prefixed with "PT:"
  - Timestamps appear in brackets like [00:05:23]

  From the following transcript, extract:
  1. All medications mentioned (brand names or generic)
  2. Dosages if specified
  3. Patient-reported side effects or concerns

  Transcript:
  {{ input.transcript }}

  Be thorough - patients often mention medication names informally.
  If a medication is unclear, include it with a note.

Prompt Writing Guidelines

Describe the data format you observed
Be specific about what to extract - list exact fields
Mention edge cases you noticed in the data
Provide examples if the task is ambiguous
Set expectations for handling missing/unclear information

Step 5: Choosing Operations

Many tasks only need a single map operation. Use good judgement:

Task	Recommended Approach
Extract info from each doc	Single `map`
Multiple extractions	Multiple `map` operations chained
Extract then summarize	`map` → `reduce`
Filter then process	`filter` → `map`
Split long docs	`split` → `map` → `reduce`
Deduplicate entities	`map` → `unnest` → `resolve`

Operation Reference

Map Operation

Applies an LLM transformation to each document independently.

- name: extract_info
  type: map
  prompt: |
    Analyze this document:
    {{ input.text }}

    Extract the main topic and 3 key points.
  output:
    schema:
      topic: string
      key_points: list[string]
  model: gpt-5-nano  # optional, uses default_model if not set
  skip_on_error: true  # recommended for large-scale runs
  validate:  # optional
    - len(output["key_points"]) == 3
  num_retries_on_validate_failure: 2  # optional

Key parameters:

prompt: Jinja2 template, use {{ input.field }} to reference fields
output.schema: Define output structure
skip_on_error: Set true to continue on LLM errors (recommended at scale)
validate: Python expressions to validate output
sample: Process only N documents (for testing)
limit: Stop after producing N outputs

Filter Operation

Keeps or removes documents based on LLM criteria. Output schema must have exactly one boolean field.

- name: filter_relevant
  type: filter
  skip_on_error: true
  prompt: |
    Document: {{ input.text }}

    Is this document relevant to climate change?
    Respond true or false.
  output:
    schema:
      is_relevant: boolean

Reduce Operation

Aggregates documents by a key using an LLM.

Always include fold_prompt and fold_batch_size for reduce operations. This handles cases where the group is too large to fit in context.

- name: summarize_by_category
  type: reduce
  reduce_key: category  # use "_all" to aggregate everything
  skip_on_error: true
  prompt: |
    Summarize these {{ inputs | length }} items for category "{{ inputs[0].category }}":

    {% for item in inputs %}
    - {{ 

---

*Content truncated.*

ui-ux-pro-max

nextlevelbuilder

"UI/UX design intelligence. 50 styles, 21 palettes, 50 font pairings, 20 charts, 8 stacks (React, Next.js, Vue, Svelte, SwiftUI, React Native, Flutter, Tailwind). Actions: plan, build, create, design, implement, review, fix, improve, optimize, enhance, refactor, check UI/UX code. Projects: website, landing page, dashboard, admin panel, e-commerce, SaaS, portfolio, blog, mobile app, .html, .tsx, .vue, .svelte. Elements: button, modal, navbar, sidebar, card, table, form, chart. Styles: glassmorphism, claymorphism, minimalism, brutalism, neumorphism, bento grid, dark mode, responsive, skeuomorphism, flat design. Topics: color palette, accessibility, animation, layout, typography, font pairing, spacing, hover, shadow, gradient."

2,8732,523

pdf-to-markdown

aliceisjustplaying

Convert entire PDF documents to clean, structured Markdown for full context loading. Use this skill when the user wants to extract ALL text from a PDF into context (not grep/search), when discussing or analyzing PDF content in full, when the user mentions "load the whole PDF", "bring the PDF into context", "read the entire PDF", or when partial extraction/grepping would miss important context. This is the preferred method for PDF text extraction over page-by-page or grep approaches.

3,8011,654

flutter-development

aj-geddes

Build beautiful cross-platform mobile apps with Flutter and Dart. Covers widgets, state management with Provider/BLoC, navigation, API integration, and material design.

2,1491,640

drawio-diagrams-enhanced

jgtolentino

Create professional draw.io (diagrams.net) diagrams in XML format (.drawio files) with integrated PMP/PMBOK methodologies, extensive visual asset libraries, and industry-standard professional templates. Use this skill when users ask to create flowcharts, swimlane diagrams, cross-functional flowcharts, org charts, network diagrams, UML diagrams, BPMN, project management diagrams (WBS, Gantt, PERT, RACI), risk matrices, stakeholder maps, or any other visual diagram in draw.io format. This skill includes access to custom shape libraries for icons, clipart, and professional symbols.

2,2671,467

godot

bfollington

This skill should be used when working on Godot Engine projects. It provides specialized knowledge of Godot's file formats (.gd, .tscn, .tres), architecture patterns (component-based, signal-driven, resource-based), common pitfalls, validation tools, code templates, and CLI workflows. The `godot` command is available for running the game, validating scripts, importing resources, and exporting builds. Use this skill for tasks involving Godot game development, debugging scene/resource files, implementing game systems, or creating new Godot components.

2,4631,222

nano-banana-pro

garg-aayush

Generate and edit images using Google's Nano Banana Pro (Gemini 3 Pro Image) API. Use when the user asks to generate, create, edit, modify, change, alter, or update images. Also use when user references an existing image file and asks to modify it in any way (e.g., "modify this image", "change the background", "replace X with Y"). Supports both text-to-image generation and image-to-image editing with configurable resolution (1K default, 2K, or 4K for high resolution). DO NOT read the image file first - use this skill directly with the --input-image parameter.

1,957969

Related MCP Servers

Browse all servers

Azure DevOps

Boost your productivity by managing Azure DevOps projects, pipelines, and repos in VS Code. Streamline dev workflows with ease.

1,37377 tools

Buildkite

Integrate with Buildkite CI/CD to access pipelines, builds, job logs, artifacts and user data for monitoring workflows and troubleshooting builds.

480 tools

Workflows

Automate complex multi-step processes like code reviews with powerful workflow automation software for data processing and AP automation.

290 tools

Tenzir

Tenzir: Execute cybersecurity data workflows with OCSF-compatible pipelines to retrieve structured security events for efficient threat hunting and analysis.

87 tools

Cribl Stream

Streamline DevOps workflows by managing configurations, pipelines, and metrics with Cribl Stream for optimized data processing.

50 tools

Knowledge Graph Memory

Build persistent semantic networks for enterprise & engineering data management. Enable data persistence and memory across chats efficiently.

80,5279 tools

Install

mkdir -p .claude/skills/docetl && curl -L -o skill.zip "https://mcp.directory/api/skills/download/2308" && unzip -o skill.zip -d .claude/skills/docetl && rm skill.zip

Installs to .claude/skills/docetl

Stats

Views

Installs

Author

ucbepic

Links

Source Code

docetl

Install

About this skill

DocETL Pipeline Development

Workflow Overview: Iterative Data Analysis

Phase 1: Data Collection

Phase 2: Pipeline Development

Phase 3: Visualization & Presentation

Step 1: Data Preparation

JSON Format

CSV Format

Data Collection Scripts

After Running Data Collection

Step 2: Read and Understand the Data

Step 3: Pipeline Structure

Key Configuration

Model Selection by Operation Type

Step 4: Writing Effective Prompts

Bad (Generic) Prompt

Good (Specific) Prompt

Prompt Writing Guidelines

Step 5: Choosing Operations

Operation Reference

Map Operation

Filter Operation

Reduce Operation

You might also like

ui-ux-pro-max

pdf-to-markdown

flutter-development

drawio-diagrams-enhanced

godot

nano-banana-pro

Related MCP Servers