hugging-face-datasets

4views

1installs

Create and manage datasets on Hugging Face Hub. Supports initializing repos, defining configs/system prompts, streaming row updates, and SQL-based dataset querying/transformation. Designed to work alongside HF MCP server for comprehensive dataset workflows.

Install

mkdir -p .claude/skills/hugging-face-datasets && curl -L -o skill.zip "https://mcp.directory/api/skills/download/4416" && unzip -o skill.zip -d .claude/skills/hugging-face-datasets && rm skill.zip

Installs to .claude/skills/hugging-face-datasets

About this skill

Overview

This skill provides tools to manage datasets on the Hugging Face Hub with a focus on creation, configuration, content management, and SQL-based data manipulation. It is designed to complement the existing Hugging Face MCP server by providing dataset editing and querying capabilities.

Integration with HF MCP Server

Use HF MCP Server for: Dataset discovery, search, and metadata retrieval
Use This Skill for: Dataset creation, content editing, SQL queries, data transformation, and structured data formatting

Version

2.1.0

Dependencies

huggingface_hub
duckdb (for SQL queries)
datasets (for pushing query results to Hub)
json (built-in)
time (built-in)

Core Capabilities

1. Dataset Lifecycle Management

Initialize: Create new dataset repositories with proper structure
Configure: Store detailed configuration including system prompts and metadata
Stream Updates: Add rows efficiently without downloading entire datasets

2. SQL-Based Dataset Querying (NEW)

Query any Hugging Face dataset using DuckDB SQL via scripts/sql_manager.py:

Direct Queries: Run SQL on datasets using the hf:// protocol
Schema Discovery: Describe dataset structure and column types
Data Sampling: Get random samples for exploration
Aggregations: Count, histogram, unique values analysis
Transformations: Filter, join, reshape data with SQL
Export & Push: Save results locally or push to new Hub repos

3. Multi-Format Dataset Support

Supports diverse dataset types through template system:

Chat/Conversational: Chat templating, multi-turn dialogues, tool usage examples
Text Classification: Sentiment analysis, intent detection, topic classification
Question-Answering: Reading comprehension, factual QA, knowledge bases
Text Completion: Language modeling, code completion, creative writing
Tabular Data: Structured data for regression/classification tasks
Custom Formats: Flexible schema definition for specialized needs

4. Quality Assurance Features

JSON Validation: Ensures data integrity during uploads
Batch Processing: Efficient handling of large datasets
Error Recovery: Graceful handling of upload failures and conflicts

Usage Instructions

The skill includes two Python scripts:

scripts/dataset_manager.py - Dataset creation and management
scripts/sql_manager.py - SQL-based dataset querying and transformation

Prerequisites

huggingface_hub library: uv add huggingface_hub
duckdb library (for SQL): uv add duckdb
datasets library (for pushing): uv add datasets
HF_TOKEN environment variable must be set with a Write-access token
Activate virtual environment: source .venv/bin/activate

SQL Dataset Querying (sql_manager.py)

Query, transform, and push Hugging Face datasets using DuckDB SQL. The hf:// protocol provides direct access to any public dataset (or private with token).

Quick Start

# Query a dataset
python scripts/sql_manager.py query \
  --dataset "cais/mmlu" \
  --sql "SELECT * FROM data WHERE subject='nutrition' LIMIT 10"

# Get dataset schema
python scripts/sql_manager.py describe --dataset "cais/mmlu"

# Sample random rows
python scripts/sql_manager.py sample --dataset "cais/mmlu" --n 5

# Count rows with filter
python scripts/sql_manager.py count --dataset "cais/mmlu" --where "subject='nutrition'"

SQL Query Syntax

Use data as the table name in your SQL - it gets replaced with the actual hf:// path:

-- Basic select
SELECT * FROM data LIMIT 10

-- Filtering
SELECT * FROM data WHERE subject='nutrition'

-- Aggregations
SELECT subject, COUNT(*) as cnt FROM data GROUP BY subject ORDER BY cnt DESC

-- Column selection and transformation
SELECT question, choices[answer] AS correct_answer FROM data

-- Regex matching
SELECT * FROM data WHERE regexp_matches(question, 'nutrition|diet')

-- String functions
SELECT regexp_replace(question, '\n', '') AS cleaned FROM data

Common Operations

1. Explore Dataset Structure

# Get schema
python scripts/sql_manager.py describe --dataset "cais/mmlu"

# Get unique values in column
python scripts/sql_manager.py unique --dataset "cais/mmlu" --column "subject"

# Get value distribution
python scripts/sql_manager.py histogram --dataset "cais/mmlu" --column "subject" --bins 20

2. Filter and Transform

# Complex filtering with SQL
python scripts/sql_manager.py query \
  --dataset "cais/mmlu" \
  --sql "SELECT subject, COUNT(*) as cnt FROM data GROUP BY subject HAVING cnt > 100"

# Using transform command
python scripts/sql_manager.py transform \
  --dataset "cais/mmlu" \
  --select "subject, COUNT(*) as cnt" \
  --group-by "subject" \
  --order-by "cnt DESC" \
  --limit 10

3. Create Subsets and Push to Hub

# Query and push to new dataset
python scripts/sql_manager.py query \
  --dataset "cais/mmlu" \
  --sql "SELECT * FROM data WHERE subject='nutrition'" \
  --push-to "username/mmlu-nutrition-subset" \
  --private

# Transform and push
python scripts/sql_manager.py transform \
  --dataset "ibm/duorc" \
  --config "ParaphraseRC" \
  --select "question, answers" \
  --where "LENGTH(question) > 50" \
  --push-to "username/duorc-long-questions"

4. Export to Local Files

# Export to Parquet
python scripts/sql_manager.py export \
  --dataset "cais/mmlu" \
  --sql "SELECT * FROM data WHERE subject='nutrition'" \
  --output "nutrition.parquet" \
  --format parquet

# Export to JSONL
python scripts/sql_manager.py export \
  --dataset "cais/mmlu" \
  --sql "SELECT * FROM data LIMIT 100" \
  --output "sample.jsonl" \
  --format jsonl

5. Working with Dataset Configs/Splits

# Specify config (subset)
python scripts/sql_manager.py query \
  --dataset "ibm/duorc" \
  --config "ParaphraseRC" \
  --sql "SELECT * FROM data LIMIT 5"

# Specify split
python scripts/sql_manager.py query \
  --dataset "cais/mmlu" \
  --split "test" \
  --sql "SELECT COUNT(*) FROM data"

# Query all splits
python scripts/sql_manager.py query \
  --dataset "cais/mmlu" \
  --split "*" \
  --sql "SELECT * FROM data LIMIT 10"

6. Raw SQL with Full Paths

For complex queries or joining datasets:

python scripts/sql_manager.py raw --sql "
  SELECT a.*, b.* 
  FROM 'hf://datasets/dataset1@~parquet/default/train/*.parquet' a
  JOIN 'hf://datasets/dataset2@~parquet/default/train/*.parquet' b
  ON a.id = b.id
  LIMIT 100
"

Python API Usage

from sql_manager import HFDatasetSQL

sql = HFDatasetSQL()

# Query
results = sql.query("cais/mmlu", "SELECT * FROM data WHERE subject='nutrition' LIMIT 10")

# Get schema
schema = sql.describe("cais/mmlu")

# Sample
samples = sql.sample("cais/mmlu", n=5, seed=42)

# Count
count = sql.count("cais/mmlu", where="subject='nutrition'")

# Histogram
dist = sql.histogram("cais/mmlu", "subject")

# Filter and transform
results = sql.filter_and_transform(
    "cais/mmlu",
    select="subject, COUNT(*) as cnt",
    group_by="subject",
    order_by="cnt DESC",
    limit=10
)

# Push to Hub
url = sql.push_to_hub(
    "cais/mmlu",
    "username/nutrition-subset",
    sql="SELECT * FROM data WHERE subject='nutrition'",
    private=True
)

# Export locally
sql.export_to_parquet("cais/mmlu", "output.parquet", sql="SELECT * FROM data LIMIT 100")

sql.close()

HF Path Format

DuckDB uses the hf:// protocol to access datasets:

hf://datasets/{dataset_id}@{revision}/{config}/{split}/*.parquet

Examples:

hf://datasets/cais/mmlu@~parquet/default/train/*.parquet
hf://datasets/ibm/duorc@~parquet/ParaphraseRC/test/*.parquet

The @~parquet revision provides auto-converted Parquet files for any dataset format.

Useful DuckDB SQL Functions

-- String functions
LENGTH(column)                    -- String length
regexp_replace(col, '\n', '')     -- Regex replace
regexp_matches(col, 'pattern')    -- Regex match
LOWER(col), UPPER(col)           -- Case conversion

-- Array functions  
choices[0]                        -- Array indexing (0-based)
array_length(choices)             -- Array length
unnest(choices)                   -- Expand array to rows

-- Aggregations
COUNT(*), SUM(col), AVG(col)
GROUP BY col HAVING condition

-- Sampling
USING SAMPLE 10                   -- Random sample
USING SAMPLE 10 (RESERVOIR, 42)   -- Reproducible sample

-- Window functions
ROW_NUMBER() OVER (PARTITION BY col ORDER BY col2)

Dataset Creation (dataset_manager.py)

Recommended Workflow

1. Discovery (Use HF MCP Server):

# Use HF MCP tools to find existing datasets
search_datasets("conversational AI training")
get_dataset_details("username/dataset-name")

2. Creation (Use This Skill):

# Initialize new dataset
python scripts/dataset_manager.py init --repo_id "your-username/dataset-name" [--private]

# Configure with detailed system prompt
python scripts/dataset_manager.py config --repo_id "your-username/dataset-name" --system_prompt "$(cat system_prompt.txt)"

3. Content Management (Use This Skill):

# Quick setup with any template
python scripts/dataset_manager.py quick_setup \
  --repo_id "your-username/dataset-name" \
  --template classification

# Add data with template validation
python scripts/dataset_manager.py add_rows \
  --repo_id "your-username/dataset-name" \
  --template qa \
  --rows_json "$(cat your_qa_data.json)"

Template-Based Data Structures

1. Chat Template (--template chat)

{
  "messages": [
    {"role": "user", "content": "Natural user request"},
    {"role": "a

---

*Content truncated.*

More by patchy631

View all skills by patchy631 →

hugging-face-tool-builder

patchy631

Use this skill when the user wants to build tool/scripts or achieve a task where using data from the Hugging Face API would help. This is especially useful when chaining or combining API calls or the task will be repeated/automated. This Skill creates a reusable script to fetch, enrich or process data.

146

hugging-face-paper-publisher

patchy631

Publish and manage research papers on Hugging Face Hub. Supports creating paper pages, linking papers to models/datasets, claiming authorship, and generating professional markdown-based research articles.

345

brightdata-web-mcp

patchy631

Search the web, scrape websites, extract structured data from URLs, and automate browsers using Bright Data's Web MCP. Use when fetching live web content, bypassing blocks/CAPTCHAs, getting product data from Amazon/eBay, social media posts, or when standard requests fail.

485

hugging-face-cli

patchy631

Execute Hugging Face Hub operations using the `hf` CLI. Use when the user needs to download models/datasets/spaces, upload files to Hub repositories, create repos, manage local cache, or run compute jobs on HF infrastructure. Covers authentication, file transfers, repository creation, cache operations, and cloud compute.

442

hugging-face-jobs

patchy631

This skill should be used when users want to run any workload on Hugging Face Jobs infrastructure. Covers UV scripts, Docker-based jobs, hardware selection, cost estimation, authentication with tokens, secrets management, timeout configuration, and result persistence. Designed for general-purpose compute workloads including data processing, inference, experiments, batch jobs, and any Python-based tasks. Should be invoked for tasks involving cloud compute, GPU workloads, or when users mention running jobs on Hugging Face infrastructure without local setup.

hugging-face-model-trainer

patchy631

This skill should be used when users want to train or fine-tune language models using TRL (Transformer Reinforcement Learning) on Hugging Face Jobs infrastructure. Covers SFT, DPO, GRPO and reward modeling training methods, plus GGUF conversion for local deployment. Includes guidance on the TRL Jobs package, UV scripts with PEP 723 format, dataset preparation and validation, hardware selection, cost estimation, Trackio monitoring, Hub authentication, and model persistence. Should be invoked for tasks involving cloud GPU training, GGUF conversion, or when users mention training on Hugging Face Jobs without local GPU setup.

flutter-development

aj-geddes

Build beautiful cross-platform mobile apps with Flutter and Dart. Covers widgets, state management with Provider/BLoC, navigation, API integration, and material design.

1,5651,368

ui-ux-pro-max

nextlevelbuilder

"UI/UX design intelligence. 50 styles, 21 palettes, 50 font pairings, 20 charts, 8 stacks (React, Next.js, Vue, Svelte, SwiftUI, React Native, Flutter, Tailwind). Actions: plan, build, create, design, implement, review, fix, improve, optimize, enhance, refactor, check UI/UX code. Projects: website, landing page, dashboard, admin panel, e-commerce, SaaS, portfolio, blog, mobile app, .html, .tsx, .vue, .svelte. Elements: button, modal, navbar, sidebar, card, table, form, chart. Styles: glassmorphism, claymorphism, minimalism, brutalism, neumorphism, bento grid, dark mode, responsive, skeuomorphism, flat design. Topics: color palette, accessibility, animation, layout, typography, font pairing, spacing, hover, shadow, gradient."

1,1101,184

drawio-diagrams-enhanced

jgtolentino

Create professional draw.io (diagrams.net) diagrams in XML format (.drawio files) with integrated PMP/PMBOK methodologies, extensive visual asset libraries, and industry-standard professional templates. Use this skill when users ask to create flowcharts, swimlane diagrams, cross-functional flowcharts, org charts, network diagrams, UML diagrams, BPMN, project management diagrams (WBS, Gantt, PERT, RACI), risk matrices, stakeholder maps, or any other visual diagram in draw.io format. This skill includes access to custom shape libraries for icons, clipart, and professional symbols.

1,4141,106

godot

bfollington

This skill should be used when working on Godot Engine projects. It provides specialized knowledge of Godot's file formats (.gd, .tscn, .tres), architecture patterns (component-based, signal-driven, resource-based), common pitfalls, validation tools, code templates, and CLI workflows. The `godot` command is available for running the game, validating scripts, importing resources, and exporting builds. Use this skill for tasks involving Godot game development, debugging scene/resource files, implementing game systems, or creating new Godot components.

1,192746

nano-banana-pro

garg-aayush

Generate and edit images using Google's Nano Banana Pro (Gemini 3 Pro Image) API. Use when the user asks to generate, create, edit, modify, change, alter, or update images. Also use when user references an existing image file and asks to modify it in any way (e.g., "modify this image", "change the background", "replace X with Y"). Supports both text-to-image generation and image-to-image editing with configurable resolution (1K default, 2K, or 4K for high resolution). DO NOT read the image file first - use this skill directly with the --input-image parameter.

1,148683

pdf-to-markdown

aliceisjustplaying

Convert entire PDF documents to clean, structured Markdown for full context loading. Use this skill when the user wants to extract ALL text from a PDF into context (not grep/search), when discussing or analyzing PDF content in full, when the user mentions "load the whole PDF", "bring the PDF into context", "read the entire PDF", or when partial extraction/grepping would miss important context. This is the preferred method for PDF text extraction over page-by-page or grep approaches.

1,304608

Related MCP Servers

Browse all servers

Honeycomb

Use Honeycomb to manage datasets and events with a TypeScript-based interface, offering an alternative to Datadog API an

20 tools

MCP Server Chart

Effortlessly create 25+ chart types with MCP Server Chart. Visualize complex datasets using TypeScript and AntV for powe

3,77726 tools

Hugging Face

Search Hugging Face models, datasets, and papers — connect dynamically to Gradio examples on Hugging Face Spaces for ext

2050 tools

HubSpot

Seamlessly manage your HubSpot CRM system: query, create, and update records through natural language with full HubSpot

300 tools

Zoom

Easily manage Zoom meetings with Zoom API integration—create, update, delete, and fetch events without navigating the Zo

70 tools

Galileo

Galileo: Integrate with Galileo to create datasets, manage prompt templates, run experiments, analyze logs, and monitor

50 tools

Stay ahead of the MCP ecosystem

Get weekly updates on new skills and servers.

Install

mkdir -p .claude/skills/hugging-face-datasets && curl -L -o skill.zip "https://mcp.directory/api/skills/download/4416" && unzip -o skill.zip -d .claude/skills/hugging-face-datasets && rm skill.zip

Installs to .claude/skills/hugging-face-datasets

Stats

Views

Installs

Author

patchy631

7 skills published

Links

Source Code

hugging-face-datasets

Install

About this skill

Overview

Integration with HF MCP Server

Version

Dependencies

Core Capabilities

1. Dataset Lifecycle Management

2. SQL-Based Dataset Querying (NEW)

3. Multi-Format Dataset Support

4. Quality Assurance Features

Usage Instructions

Prerequisites

SQL Dataset Querying (sql_manager.py)

Quick Start

SQL Query Syntax

Common Operations

1. Explore Dataset Structure

2. Filter and Transform

3. Create Subsets and Push to Hub

4. Export to Local Files

5. Working with Dataset Configs/Splits

6. Raw SQL with Full Paths

Python API Usage

HF Path Format

Useful DuckDB SQL Functions

Dataset Creation (dataset_manager.py)

Recommended Workflow

Template-Based Data Structures

More by patchy631

hugging-face-tool-builder

hugging-face-paper-publisher

brightdata-web-mcp

hugging-face-cli

hugging-face-jobs

hugging-face-model-trainer

You might also like

flutter-development

ui-ux-pro-max

drawio-diagrams-enhanced

godot

nano-banana-pro

pdf-to-markdown

Related MCP Servers

Stay ahead of the MCP ecosystem