hugging-face-evaluation

15
2
Source

Add and manage evaluation results in Hugging Face model cards. Supports extracting eval tables from README content, importing scores from Artificial Analysis API, and running custom model evaluations with vLLM/lighteval. Works with the model-index metadata format.

Install

mkdir -p .claude/skills/hugging-face-evaluation && curl -L -o skill.zip "https://mcp.directory/api/skills/download/2401" && unzip -o skill.zip -d .claude/skills/hugging-face-evaluation && rm skill.zip

Installs to .claude/skills/hugging-face-evaluation

About this skill

Overview

This skill provides tools to add structured evaluation results to Hugging Face model cards. It supports multiple methods for adding evaluation data:

  • Extracting existing evaluation tables from README content
  • Importing benchmark scores from Artificial Analysis
  • Running custom model evaluations with vLLM or accelerate backends (lighteval/inspect-ai)

Integration with HF Ecosystem

  • Model Cards: Updates model-index metadata for leaderboard integration
  • Artificial Analysis: Direct API integration for benchmark imports
  • Papers with Code: Compatible with their model-index specification
  • Jobs: Run evaluations directly on Hugging Face Jobs with uv integration
  • vLLM: Efficient GPU inference for custom model evaluation
  • lighteval: HuggingFace's evaluation library with vLLM/accelerate backends
  • inspect-ai: UK AI Safety Institute's evaluation framework

Version

1.3.0

Dependencies

Core Dependencies

  • huggingface_hub>=0.26.0
  • markdown-it-py>=3.0.0
  • python-dotenv>=1.2.1
  • pyyaml>=6.0.3
  • requests>=2.32.5
  • re (built-in)

Inference Provider Evaluation

  • inspect-ai>=0.3.0
  • inspect-evals
  • openai

vLLM Custom Model Evaluation (GPU required)

  • lighteval[accelerate,vllm]>=0.6.0
  • vllm>=0.4.0
  • torch>=2.0.0
  • transformers>=4.40.0
  • accelerate>=0.30.0

Note: vLLM dependencies are installed automatically via PEP 723 script headers when using uv run.

IMPORTANT: Using This Skill

⚠️ CRITICAL: Check for Existing PRs Before Creating New Ones

Before creating ANY pull request with --create-pr, you MUST check for existing open PRs:

uv run scripts/evaluation_manager.py get-prs --repo-id "username/model-name"

If open PRs exist:

  1. DO NOT create a new PR - this creates duplicate work for maintainers
  2. Warn the user that open PRs already exist
  3. Show the user the existing PR URLs so they can review them
  4. Only proceed if the user explicitly confirms they want to create another PR

This prevents spamming model repositories with duplicate evaluation PRs.


All paths are relative to the directory containing this SKILL.md file. Before running any script, first cd to that directory or use the full path.

Use --help for the latest workflow guidance. Works with plain Python or uv run:

uv run scripts/evaluation_manager.py --help
uv run scripts/evaluation_manager.py inspect-tables --help
uv run scripts/evaluation_manager.py extract-readme --help

Key workflow (matches CLI help):

  1. get-prs → check for existing open PRs first
  2. inspect-tables → find table numbers/columns
  3. extract-readme --table N → prints YAML by default
  4. add --apply (push) or --create-pr to write changes

Core Capabilities

1. Inspect and Extract Evaluation Tables from README

  • Inspect Tables: Use inspect-tables to see all tables in a README with structure, columns, and sample rows
  • Parse Markdown Tables: Accurate parsing using markdown-it-py (ignores code blocks and examples)
  • Table Selection: Use --table N to extract from a specific table (required when multiple tables exist)
  • Format Detection: Recognize common formats (benchmarks as rows, columns, or comparison tables with multiple models)
  • Column Matching: Automatically identify model columns/rows; prefer --model-column-index (index from inspect output). Use --model-name-override only with exact column header text.
  • YAML Generation: Convert selected table to model-index YAML format
  • Task Typing: --task-type sets the task.type field in model-index output (e.g., text-generation, summarization)

2. Import from Artificial Analysis

  • API Integration: Fetch benchmark scores directly from Artificial Analysis
  • Automatic Formatting: Convert API responses to model-index format
  • Metadata Preservation: Maintain source attribution and URLs
  • PR Creation: Automatically create pull requests with evaluation updates

3. Model-Index Management

  • YAML Generation: Create properly formatted model-index entries
  • Merge Support: Add evaluations to existing model cards without overwriting
  • Validation: Ensure compliance with Papers with Code specification
  • Batch Operations: Process multiple models efficiently

4. Run Evaluations on HF Jobs (Inference Providers)

  • Inspect-AI Integration: Run standard evaluations using the inspect-ai library
  • UV Integration: Seamlessly run Python scripts with ephemeral dependencies on HF infrastructure
  • Zero-Config: No Dockerfiles or Space management required
  • Hardware Selection: Configure CPU or GPU hardware for the evaluation job
  • Secure Execution: Handles API tokens safely via secrets passed through the CLI

5. Run Custom Model Evaluations with vLLM (NEW)

⚠️ Important: This approach is only possible on devices with uv installed and sufficient GPU memory. Benefits: No need to use hf_jobs() MCP tool, can run scripts directly in terminal When to use: User working in local device directly when GPU is available

Before running the script

  • check the script path
  • check uv is installed
  • check gpu is available with nvidia-smi

Running the script

uv run scripts/train_sft_example.py

Features

  • vLLM Backend: High-performance GPU inference (5-10x faster than standard HF methods)
  • lighteval Framework: HuggingFace's evaluation library with Open LLM Leaderboard tasks
  • inspect-ai Framework: UK AI Safety Institute's evaluation library
  • Standalone or Jobs: Run locally or submit to HF Jobs infrastructure

Usage Instructions

The skill includes Python scripts in scripts/ to perform operations.

Prerequisites

  • Preferred: use uv run (PEP 723 header auto-installs deps)
  • Optional manual fallback: uv pip install huggingface-hub markdown-it-py python-dotenv pyyaml requests
  • Set HF_TOKEN environment variable with Write-access token
  • For Artificial Analysis: Set AA_API_KEY environment variable
  • .env is loaded automatically if python-dotenv is installed

Method 1: Extract from README (CLI workflow)

Recommended flow (matches --help):

# 1) Inspect tables to get table numbers and column hints
uv run scripts/evaluation_manager.py inspect-tables --repo-id "username/model"

# 2) Extract a specific table (prints YAML by default)
uv run scripts/evaluation_manager.py extract-readme \
  --repo-id "username/model" \
  --table 1 \
  [--model-column-index <column index shown by inspect-tables>] \
  [--model-name-override "<column header/model name>"]  # use exact header text if you can't use the index

# 3) Apply changes (push or PR)
uv run scripts/evaluation_manager.py extract-readme \
  --repo-id "username/model" \
  --table 1 \
  --apply       # push directly
# or
uv run scripts/evaluation_manager.py extract-readme \
  --repo-id "username/model" \
  --table 1 \
  --create-pr   # open a PR

Validation checklist:

  • YAML is printed by default; compare against the README table before applying.
  • Prefer --model-column-index; if using --model-name-override, the column header text must be exact.
  • For transposed tables (models as rows), ensure only one row is extracted.

Method 2: Import from Artificial Analysis

Fetch benchmark scores from Artificial Analysis API and add them to a model card.

Basic Usage:

AA_API_KEY="your-api-key" uv run scripts/evaluation_manager.py import-aa \
  --creator-slug "anthropic" \
  --model-name "claude-sonnet-4" \
  --repo-id "username/model-name"

With Environment File:

# Create .env file
echo "AA_API_KEY=your-api-key" >> .env
echo "HF_TOKEN=your-hf-token" >> .env

# Run import
uv run scripts/evaluation_manager.py import-aa \
  --creator-slug "anthropic" \
  --model-name "claude-sonnet-4" \
  --repo-id "username/model-name"

Create Pull Request:

uv run scripts/evaluation_manager.py import-aa \
  --creator-slug "anthropic" \
  --model-name "claude-sonnet-4" \
  --repo-id "username/model-name" \
  --create-pr

Method 3: Run Evaluation Job

Submit an evaluation job on Hugging Face infrastructure using the hf jobs uv run CLI.

Direct CLI Usage:

HF_TOKEN=$HF_TOKEN \
hf jobs uv run hf-evaluation/scripts/inspect_eval_uv.py \
  --flavor cpu-basic \
  --secret HF_TOKEN=$HF_TOKEN \
  -- --model "meta-llama/Llama-2-7b-hf" \
     --task "mmlu"

GPU Example (A10G):

HF_TOKEN=$HF_TOKEN \
hf jobs uv run hf-evaluation/scripts/inspect_eval_uv.py \
  --flavor a10g-small \
  --secret HF_TOKEN=$HF_TOKEN \
  -- --model "meta-llama/Llama-2-7b-hf" \
     --task "gsm8k"

Python Helper (optional):

uv run scripts/run_eval_job.py \
  --model "meta-llama/Llama-2-7b-hf" \
  --task "mmlu" \
  --hardware "t4-small"

Method 4: Run Custom Model Evaluation with vLLM

Evaluate custom HuggingFace models directly on GPU using vLLM or accelerate backends. These scripts are separate from inference provider scripts and run models locally on the job's hardware.

When to Use vLLM Evaluation (vs Inference Providers)

FeaturevLLM ScriptsInference Provider Scripts
Model accessAny HF modelModels with API endpoints
HardwareYour GPU (or HF Jobs GPU)Provider's infrastructure
CostHF Jobs compute costAPI usage fees
SpeedvLLM optimizedDepends on provider
OfflineYes (after download)No

Option A: lighteval with vLLM Backend

lighteval is HuggingFace's evaluation library, supporting Open LLM Leaderboard tasks.

Standalone (local GPU):

# Run MMLU 5-shot with vLLM
uv run scripts/lighteval_vllm_uv.py \
  --model meta-llama/Llama-3.2-1B \
  --tasks "leaderboard|mmlu|5"

# Run multiple tasks
uv run scripts/lighteval_vllm_uv.py \
  --model meta-llama/Llama-3.2-1B \
  --tasks "leaderboard|mmlu|5,leaderboard|gsm8k|5"

# Use accelerate backe

---

*Content truncated.*

You might also like

flutter-development

aj-geddes

Build beautiful cross-platform mobile apps with Flutter and Dart. Covers widgets, state management with Provider/BLoC, navigation, API integration, and material design.

1,5691,369

ui-ux-pro-max

nextlevelbuilder

"UI/UX design intelligence. 50 styles, 21 palettes, 50 font pairings, 20 charts, 8 stacks (React, Next.js, Vue, Svelte, SwiftUI, React Native, Flutter, Tailwind). Actions: plan, build, create, design, implement, review, fix, improve, optimize, enhance, refactor, check UI/UX code. Projects: website, landing page, dashboard, admin panel, e-commerce, SaaS, portfolio, blog, mobile app, .html, .tsx, .vue, .svelte. Elements: button, modal, navbar, sidebar, card, table, form, chart. Styles: glassmorphism, claymorphism, minimalism, brutalism, neumorphism, bento grid, dark mode, responsive, skeuomorphism, flat design. Topics: color palette, accessibility, animation, layout, typography, font pairing, spacing, hover, shadow, gradient."

1,1151,187

drawio-diagrams-enhanced

jgtolentino

Create professional draw.io (diagrams.net) diagrams in XML format (.drawio files) with integrated PMP/PMBOK methodologies, extensive visual asset libraries, and industry-standard professional templates. Use this skill when users ask to create flowcharts, swimlane diagrams, cross-functional flowcharts, org charts, network diagrams, UML diagrams, BPMN, project management diagrams (WBS, Gantt, PERT, RACI), risk matrices, stakeholder maps, or any other visual diagram in draw.io format. This skill includes access to custom shape libraries for icons, clipart, and professional symbols.

1,4171,108

godot

bfollington

This skill should be used when working on Godot Engine projects. It provides specialized knowledge of Godot's file formats (.gd, .tscn, .tres), architecture patterns (component-based, signal-driven, resource-based), common pitfalls, validation tools, code templates, and CLI workflows. The `godot` command is available for running the game, validating scripts, importing resources, and exporting builds. Use this skill for tasks involving Godot game development, debugging scene/resource files, implementing game systems, or creating new Godot components.

1,192747

nano-banana-pro

garg-aayush

Generate and edit images using Google's Nano Banana Pro (Gemini 3 Pro Image) API. Use when the user asks to generate, create, edit, modify, change, alter, or update images. Also use when user references an existing image file and asks to modify it in any way (e.g., "modify this image", "change the background", "replace X with Y"). Supports both text-to-image generation and image-to-image editing with configurable resolution (1K default, 2K, or 4K for high resolution). DO NOT read the image file first - use this skill directly with the --input-image parameter.

1,152683

pdf-to-markdown

aliceisjustplaying

Convert entire PDF documents to clean, structured Markdown for full context loading. Use this skill when the user wants to extract ALL text from a PDF into context (not grep/search), when discussing or analyzing PDF content in full, when the user mentions "load the whole PDF", "bring the PDF into context", "read the entire PDF", or when partial extraction/grepping would miss important context. This is the preferred method for PDF text extraction over page-by-page or grep approaches.

1,309614

Stay ahead of the MCP ecosystem

Get weekly updates on new skills and servers.