hugging-face-dataset-creator

67
1
Source

Create and manage datasets on Hugging Face Hub. Supports initializing repos, defining configs/system prompts, and streaming row updates. Designed to work alongside HF MCP server for comprehensive dataset workflows.

Install

mkdir -p .claude/skills/hugging-face-dataset-creator && curl -L -o skill.zip "https://mcp.directory/api/skills/download/461" && unzip -o skill.zip -d .claude/skills/hugging-face-dataset-creator && rm skill.zip

Installs to .claude/skills/hugging-face-dataset-creator

About this skill

Overview

This skill provides tools to manage datasets on the Hugging Face Hub with a focus on creation, configuration, and content management. It is designed to complement the existing Hugging Face MCP server by providing dataset editing capabilities that the MCP server doesn't offer.

Integration with HF MCP Server

  • Use HF MCP Server for: Dataset discovery, search, and metadata retrieval
  • Use This Skill for: Dataset creation, content editing, configuration management, and structured data formatting

Version

2.0.0

Dependencies

  • huggingface_hub
  • json (built-in)
  • time (built-in)

Core Capabilities

1. Dataset Lifecycle Management

  • Initialize: Create new dataset repositories with proper structure
  • Configure: Store detailed configuration including system prompts and metadata
  • Stream Updates: Add rows efficiently without downloading entire datasets

2. Multi-Format Dataset Support

Supports diverse dataset types through template system:

  • Chat/Conversational: Chat templating, multi-turn dialogues, tool usage examples
  • Text Classification: Sentiment analysis, intent detection, topic classification
  • Question-Answering: Reading comprehension, factual QA, knowledge bases
  • Text Completion: Language modeling, code completion, creative writing
  • Tabular Data: Structured data for regression/classification tasks
  • Custom Formats: Flexible schema definition for specialized needs

3. Quality Assurance Features

  • JSON Validation: Ensures data integrity during uploads
  • Batch Processing: Efficient handling of large datasets
  • Error Recovery: Graceful handling of upload failures and conflicts

Usage Instructions

The skill includes a Python script scripts/dataset_manager.py to perform operations.

Prerequisites

  • huggingface_hub library must be installed via uv add huggingface_hub
  • HF_TOKEN environment variable must be set with a Write-access token
  • Activate virtual environment: source .venv/bin/activate

Recommended Workflow

1. Discovery (Use HF MCP Server):

# Use HF MCP tools to find existing datasets
search_datasets("conversational AI training")
get_dataset_details("username/dataset-name")

2. Creation (Use This Skill):

# Initialize new dataset
python scripts/dataset_manager.py init --repo_id "your-username/dataset-name" [--private]

# Configure with detailed system prompt
python scripts/dataset_manager.py config --repo_id "your-username/dataset-name" --system_prompt "$(cat system_prompt.txt)"

3. Content Management (Use This Skill):

# Quick setup with any template
python scripts/dataset_manager.py quick_setup \
  --repo_id "your-username/dataset-name" \
  --template classification

# Add data with template validation
python scripts/dataset_manager.py add_rows \
  --repo_id "your-username/dataset-name" \
  --template qa \
  --rows_json "$(cat your_qa_data.json)"

Template-Based Data Structures

1. Chat Template (--template chat)

{
  "messages": [
    {"role": "user", "content": "Natural user request"},
    {"role": "assistant", "content": "Response with tool usage"},
    {"role": "tool", "content": "Tool response", "tool_call_id": "call_123"}
  ],
  "scenario": "Description of use case",
  "complexity": "simple|intermediate|advanced"
}

2. Classification Template (--template classification)

{
  "text": "Input text to be classified",
  "label": "classification_label",
  "confidence": 0.95,
  "metadata": {"domain": "technology", "language": "en"}
}

3. QA Template (--template qa)

{
  "question": "What is the question being asked?",
  "answer": "The complete answer",
  "context": "Additional context if needed",
  "answer_type": "factual|explanatory|opinion",
  "difficulty": "easy|medium|hard"
}

4. Completion Template (--template completion)

{
  "prompt": "The beginning text or context",
  "completion": "The expected continuation",
  "domain": "code|creative|technical|conversational",
  "style": "description of writing style"
}

5. Tabular Template (--template tabular)

{
  "columns": [
    {"name": "feature1", "type": "numeric", "description": "First feature"},
    {"name": "target", "type": "categorical", "description": "Target variable"}
  ],
  "data": [
    {"feature1": 123, "target": "class_a"},
    {"feature1": 456, "target": "class_b"}
  ]
}

Advanced System Prompt Template

For high-quality training data generation:

You are an AI assistant expert at using MCP tools effectively.

## MCP SERVER DEFINITIONS
[Define available servers and tools]

## TRAINING EXAMPLE STRUCTURE
[Specify exact JSON schema for chat templating]

## QUALITY GUIDELINES
[Detail requirements for realistic scenarios, progressive complexity, proper tool usage]

## EXAMPLE CATEGORIES
[List development workflows, debugging scenarios, data management tasks]

Example Categories & Templates

The skill includes diverse training examples beyond just MCP usage:

Available Example Sets:

  • training_examples.json - MCP tool usage examples (debugging, project setup, database analysis)
  • diverse_training_examples.json - Broader scenarios including:
    • Educational Chat - Explaining programming concepts, tutorials
    • Git Workflows - Feature branches, version control guidance
    • Code Analysis - Performance optimization, architecture review
    • Content Generation - Professional writing, creative brainstorming
    • Codebase Navigation - Legacy code exploration, systematic analysis
    • Conversational Support - Problem-solving, technical discussions

Using Different Example Sets:

# Add MCP-focused examples
python scripts/dataset_manager.py add_rows --repo_id "your-username/dataset-name" \
  --rows_json "$(cat examples/training_examples.json)"

# Add diverse conversational examples
python scripts/dataset_manager.py add_rows --repo_id "your-username/dataset-name" \
  --rows_json "$(cat examples/diverse_training_examples.json)"

# Mix both for comprehensive training data
python scripts/dataset_manager.py add_rows --repo_id "your-username/dataset-name" \
  --rows_json "$(jq -s '.[0] + .[1]' examples/training_examples.json examples/diverse_training_examples.json)"

Commands Reference

List Available Templates:

python scripts/dataset_manager.py list_templates

Quick Setup (Recommended):

python scripts/dataset_manager.py quick_setup --repo_id "your-username/dataset-name" --template classification

Manual Setup:

# Initialize repository
python scripts/dataset_manager.py init --repo_id "your-username/dataset-name" [--private]

# Configure with system prompt
python scripts/dataset_manager.py config --repo_id "your-username/dataset-name" --system_prompt "Your prompt here"

# Add data with validation
python scripts/dataset_manager.py add_rows \
  --repo_id "your-username/dataset-name" \
  --template qa \
  --rows_json '[{"question": "What is AI?", "answer": "Artificial Intelligence..."}]'

View Dataset Statistics:

python scripts/dataset_manager.py stats --repo_id "your-username/dataset-name"

Error Handling

  • Repository exists: Script will notify and continue with configuration
  • Invalid JSON: Clear error message with parsing details
  • Network issues: Automatic retry for transient failures
  • Token permissions: Validation before operations begin

You might also like

flutter-development

aj-geddes

Build beautiful cross-platform mobile apps with Flutter and Dart. Covers widgets, state management with Provider/BLoC, navigation, API integration, and material design.

294790

drawio-diagrams-enhanced

jgtolentino

Create professional draw.io (diagrams.net) diagrams in XML format (.drawio files) with integrated PMP/PMBOK methodologies, extensive visual asset libraries, and industry-standard professional templates. Use this skill when users ask to create flowcharts, swimlane diagrams, cross-functional flowcharts, org charts, network diagrams, UML diagrams, BPMN, project management diagrams (WBS, Gantt, PERT, RACI), risk matrices, stakeholder maps, or any other visual diagram in draw.io format. This skill includes access to custom shape libraries for icons, clipart, and professional symbols.

213415

godot

bfollington

This skill should be used when working on Godot Engine projects. It provides specialized knowledge of Godot's file formats (.gd, .tscn, .tres), architecture patterns (component-based, signal-driven, resource-based), common pitfalls, validation tools, code templates, and CLI workflows. The `godot` command is available for running the game, validating scripts, importing resources, and exporting builds. Use this skill for tasks involving Godot game development, debugging scene/resource files, implementing game systems, or creating new Godot components.

213296

nano-banana-pro

garg-aayush

Generate and edit images using Google's Nano Banana Pro (Gemini 3 Pro Image) API. Use when the user asks to generate, create, edit, modify, change, alter, or update images. Also use when user references an existing image file and asks to modify it in any way (e.g., "modify this image", "change the background", "replace X with Y"). Supports both text-to-image generation and image-to-image editing with configurable resolution (1K default, 2K, or 4K for high resolution). DO NOT read the image file first - use this skill directly with the --input-image parameter.

222234

ui-ux-pro-max

nextlevelbuilder

"UI/UX design intelligence. 50 styles, 21 palettes, 50 font pairings, 20 charts, 8 stacks (React, Next.js, Vue, Svelte, SwiftUI, React Native, Flutter, Tailwind). Actions: plan, build, create, design, implement, review, fix, improve, optimize, enhance, refactor, check UI/UX code. Projects: website, landing page, dashboard, admin panel, e-commerce, SaaS, portfolio, blog, mobile app, .html, .tsx, .vue, .svelte. Elements: button, modal, navbar, sidebar, card, table, form, chart. Styles: glassmorphism, claymorphism, minimalism, brutalism, neumorphism, bento grid, dark mode, responsive, skeuomorphism, flat design. Topics: color palette, accessibility, animation, layout, typography, font pairing, spacing, hover, shadow, gradient."

174201

rust-coding-skill

UtakataKyosui

Guides Claude in writing idiomatic, efficient, well-structured Rust code using proper data modeling, traits, impl organization, macros, and build-speed best practices.

167173

Stay ahead of the MCP ecosystem

Get weekly updates on new skills and servers.