evaluating-llms-harness
Evaluates LLMs across 60+ academic benchmarks (MMLU, HumanEval, GSM8K, TruthfulQA, HellaSwag). Use when benchmarking model quality, comparing models, reporting academic results, or tracking training progress. Industry standard used by EleutherAI, HuggingFace, and major labs. Supports HuggingFace, vLLM, APIs.
Install
mkdir -p .claude/skills/evaluating-llms-harness && curl -L -o skill.zip "https://mcp.directory/api/skills/download/1519" && unzip -o skill.zip -d .claude/skills/evaluating-llms-harness && rm skill.zipInstalls to .claude/skills/evaluating-llms-harness
About this skill
lm-evaluation-harness - LLM Benchmarking
Quick start
lm-evaluation-harness evaluates LLMs across 60+ academic benchmarks using standardized prompts and metrics.
Installation:
pip install lm-eval
Evaluate any HuggingFace model:
lm_eval --model hf \
--model_args pretrained=meta-llama/Llama-2-7b-hf \
--tasks mmlu,gsm8k,hellaswag \
--device cuda:0 \
--batch_size 8
View available tasks:
lm_eval --tasks list
Common workflows
Workflow 1: Standard benchmark evaluation
Evaluate model on core benchmarks (MMLU, GSM8K, HumanEval).
Copy this checklist:
Benchmark Evaluation:
- [ ] Step 1: Choose benchmark suite
- [ ] Step 2: Configure model
- [ ] Step 3: Run evaluation
- [ ] Step 4: Analyze results
Step 1: Choose benchmark suite
Core reasoning benchmarks:
- MMLU (Massive Multitask Language Understanding) - 57 subjects, multiple choice
- GSM8K - Grade school math word problems
- HellaSwag - Common sense reasoning
- TruthfulQA - Truthfulness and factuality
- ARC (AI2 Reasoning Challenge) - Science questions
Code benchmarks:
- HumanEval - Python code generation (164 problems)
- MBPP (Mostly Basic Python Problems) - Python coding
Standard suite (recommended for model releases):
--tasks mmlu,gsm8k,hellaswag,truthfulqa,arc_challenge
Step 2: Configure model
HuggingFace model:
lm_eval --model hf \
--model_args pretrained=meta-llama/Llama-2-7b-hf,dtype=bfloat16 \
--tasks mmlu \
--device cuda:0 \
--batch_size auto # Auto-detect optimal batch size
Quantized model (4-bit/8-bit):
lm_eval --model hf \
--model_args pretrained=meta-llama/Llama-2-7b-hf,load_in_4bit=True \
--tasks mmlu \
--device cuda:0
Custom checkpoint:
lm_eval --model hf \
--model_args pretrained=/path/to/my-model,tokenizer=/path/to/tokenizer \
--tasks mmlu \
--device cuda:0
Step 3: Run evaluation
# Full MMLU evaluation (57 subjects)
lm_eval --model hf \
--model_args pretrained=meta-llama/Llama-2-7b-hf \
--tasks mmlu \
--num_fewshot 5 \ # 5-shot evaluation (standard)
--batch_size 8 \
--output_path results/ \
--log_samples # Save individual predictions
# Multiple benchmarks at once
lm_eval --model hf \
--model_args pretrained=meta-llama/Llama-2-7b-hf \
--tasks mmlu,gsm8k,hellaswag,truthfulqa,arc_challenge \
--num_fewshot 5 \
--batch_size 8 \
--output_path results/llama2-7b-eval.json
Step 4: Analyze results
Results saved to results/llama2-7b-eval.json:
{
"results": {
"mmlu": {
"acc": 0.459,
"acc_stderr": 0.004
},
"gsm8k": {
"exact_match": 0.142,
"exact_match_stderr": 0.006
},
"hellaswag": {
"acc_norm": 0.765,
"acc_norm_stderr": 0.004
}
},
"config": {
"model": "hf",
"model_args": "pretrained=meta-llama/Llama-2-7b-hf",
"num_fewshot": 5
}
}
Workflow 2: Track training progress
Evaluate checkpoints during training.
Training Progress Tracking:
- [ ] Step 1: Set up periodic evaluation
- [ ] Step 2: Choose quick benchmarks
- [ ] Step 3: Automate evaluation
- [ ] Step 4: Plot learning curves
Step 1: Set up periodic evaluation
Evaluate every N training steps:
#!/bin/bash
# eval_checkpoint.sh
CHECKPOINT_DIR=$1
STEP=$2
lm_eval --model hf \
--model_args pretrained=$CHECKPOINT_DIR/checkpoint-$STEP \
--tasks gsm8k,hellaswag \
--num_fewshot 0 \ # 0-shot for speed
--batch_size 16 \
--output_path results/step-$STEP.json
Step 2: Choose quick benchmarks
Fast benchmarks for frequent evaluation:
- HellaSwag: ~10 minutes on 1 GPU
- GSM8K: ~5 minutes
- PIQA: ~2 minutes
Avoid for frequent eval (too slow):
- MMLU: ~2 hours (57 subjects)
- HumanEval: Requires code execution
Step 3: Automate evaluation
Integrate with training script:
# In training loop
if step % eval_interval == 0:
model.save_pretrained(f"checkpoints/step-{step}")
# Run evaluation
os.system(f"./eval_checkpoint.sh checkpoints step-{step}")
Or use PyTorch Lightning callbacks:
from pytorch_lightning import Callback
class EvalHarnessCallback(Callback):
def on_validation_epoch_end(self, trainer, pl_module):
step = trainer.global_step
checkpoint_path = f"checkpoints/step-{step}"
# Save checkpoint
trainer.save_checkpoint(checkpoint_path)
# Run lm-eval
os.system(f"lm_eval --model hf --model_args pretrained={checkpoint_path} ...")
Step 4: Plot learning curves
import json
import matplotlib.pyplot as plt
# Load all results
steps = []
mmlu_scores = []
for file in sorted(glob.glob("results/step-*.json")):
with open(file) as f:
data = json.load(f)
step = int(file.split("-")[1].split(".")[0])
steps.append(step)
mmlu_scores.append(data["results"]["mmlu"]["acc"])
# Plot
plt.plot(steps, mmlu_scores)
plt.xlabel("Training Step")
plt.ylabel("MMLU Accuracy")
plt.title("Training Progress")
plt.savefig("training_curve.png")
Workflow 3: Compare multiple models
Benchmark suite for model comparison.
Model Comparison:
- [ ] Step 1: Define model list
- [ ] Step 2: Run evaluations
- [ ] Step 3: Generate comparison table
Step 1: Define model list
# models.txt
meta-llama/Llama-2-7b-hf
meta-llama/Llama-2-13b-hf
mistralai/Mistral-7B-v0.1
microsoft/phi-2
Step 2: Run evaluations
#!/bin/bash
# eval_all_models.sh
TASKS="mmlu,gsm8k,hellaswag,truthfulqa"
while read model; do
echo "Evaluating $model"
# Extract model name for output file
model_name=$(echo $model | sed 's/\//-/g')
lm_eval --model hf \
--model_args pretrained=$model,dtype=bfloat16 \
--tasks $TASKS \
--num_fewshot 5 \
--batch_size auto \
--output_path results/$model_name.json
done < models.txt
Step 3: Generate comparison table
import json
import pandas as pd
models = [
"meta-llama-Llama-2-7b-hf",
"meta-llama-Llama-2-13b-hf",
"mistralai-Mistral-7B-v0.1",
"microsoft-phi-2"
]
tasks = ["mmlu", "gsm8k", "hellaswag", "truthfulqa"]
results = []
for model in models:
with open(f"results/{model}.json") as f:
data = json.load(f)
row = {"Model": model.replace("-", "/")}
for task in tasks:
# Get primary metric for each task
metrics = data["results"][task]
if "acc" in metrics:
row[task.upper()] = f"{metrics['acc']:.3f}"
elif "exact_match" in metrics:
row[task.upper()] = f"{metrics['exact_match']:.3f}"
results.append(row)
df = pd.DataFrame(results)
print(df.to_markdown(index=False))
Output:
| Model | MMLU | GSM8K | HELLASWAG | TRUTHFULQA |
|------------------------|-------|-------|-----------|------------|
| meta-llama/Llama-2-7b | 0.459 | 0.142 | 0.765 | 0.391 |
| meta-llama/Llama-2-13b | 0.549 | 0.287 | 0.801 | 0.430 |
| mistralai/Mistral-7B | 0.626 | 0.395 | 0.812 | 0.428 |
| microsoft/phi-2 | 0.560 | 0.613 | 0.682 | 0.447 |
Workflow 4: Evaluate with vLLM (faster inference)
Use vLLM backend for 5-10x faster evaluation.
vLLM Evaluation:
- [ ] Step 1: Install vLLM
- [ ] Step 2: Configure vLLM backend
- [ ] Step 3: Run evaluation
Step 1: Install vLLM
pip install vllm
Step 2: Configure vLLM backend
lm_eval --model vllm \
--model_args pretrained=meta-llama/Llama-2-7b-hf,tensor_parallel_size=1,dtype=auto,gpu_memory_utilization=0.8 \
--tasks mmlu \
--batch_size auto
Step 3: Run evaluation
vLLM is 5-10× faster than standard HuggingFace:
# Standard HF: ~2 hours for MMLU on 7B model
lm_eval --model hf \
--model_args pretrained=meta-llama/Llama-2-7b-hf \
--tasks mmlu \
--batch_size 8
# vLLM: ~15-20 minutes for MMLU on 7B model
lm_eval --model vllm \
--model_args pretrained=meta-llama/Llama-2-7b-hf,tensor_parallel_size=2 \
--tasks mmlu \
--batch_size auto
When to use vs alternatives
Use lm-evaluation-harness when:
- Benchmarking models for academic papers
- Comparing model quality across standard tasks
- Tracking training progress
- Reporting standardized metrics (everyone uses same prompts)
- Need reproducible evaluation
Use alternatives instead:
- HELM (Stanford): Broader evaluation (fairness, efficiency, calibration)
- AlpacaEval: Instruction-following evaluation with LLM judges
- MT-Bench: Conversational multi-turn evaluation
- Custom scripts: Domain-specific evaluation
Common issues
Issue: Evaluation too slow
Use vLLM backend:
lm_eval --model vllm \
--model_args pretrained=model-name,tensor_parallel_size=2
Or reduce fewshot examples:
--num_fewshot 0 # Instead of 5
Or evaluate subset of MMLU:
--tasks mmlu_stem # Only STEM subjects
Issue: Out of memory
Reduce batch size:
--batch_size 1 # Or --batch_size auto
Use quantization:
--model_args pretrained=model-name,load_in_8bit=True
Enable CPU offloading:
--model_args pretrained=model-name,device_map=auto,offload_folder=offload
Issue: Different results than reported
Check fewshot count:
--num_fewshot 5 # Most papers use 5-shot
Check exact task name:
--tasks mmlu # Not mmlu_direct or mmlu_fewshot
Verify model and tokenizer match:
--model_args pretrained=model-name,tokenizer=same-model-name
Issue: HumanEval not executing code
Install execution dependencies:
pip install human-eval
Enable code execution:
lm_eval --model hf \
--model_args pretrained=model-name \
--tasks humaneval \
--allow_code_execution # Required for HumanEval
Advanced topics
Benchmark descriptions: Se
Content truncated.
More by davila7
View all skills by davila7 →You might also like
flutter-development
aj-geddes
Build beautiful cross-platform mobile apps with Flutter and Dart. Covers widgets, state management with Provider/BLoC, navigation, API integration, and material design.
ui-ux-pro-max
nextlevelbuilder
"UI/UX design intelligence. 50 styles, 21 palettes, 50 font pairings, 20 charts, 8 stacks (React, Next.js, Vue, Svelte, SwiftUI, React Native, Flutter, Tailwind). Actions: plan, build, create, design, implement, review, fix, improve, optimize, enhance, refactor, check UI/UX code. Projects: website, landing page, dashboard, admin panel, e-commerce, SaaS, portfolio, blog, mobile app, .html, .tsx, .vue, .svelte. Elements: button, modal, navbar, sidebar, card, table, form, chart. Styles: glassmorphism, claymorphism, minimalism, brutalism, neumorphism, bento grid, dark mode, responsive, skeuomorphism, flat design. Topics: color palette, accessibility, animation, layout, typography, font pairing, spacing, hover, shadow, gradient."
drawio-diagrams-enhanced
jgtolentino
Create professional draw.io (diagrams.net) diagrams in XML format (.drawio files) with integrated PMP/PMBOK methodologies, extensive visual asset libraries, and industry-standard professional templates. Use this skill when users ask to create flowcharts, swimlane diagrams, cross-functional flowcharts, org charts, network diagrams, UML diagrams, BPMN, project management diagrams (WBS, Gantt, PERT, RACI), risk matrices, stakeholder maps, or any other visual diagram in draw.io format. This skill includes access to custom shape libraries for icons, clipart, and professional symbols.
godot
bfollington
This skill should be used when working on Godot Engine projects. It provides specialized knowledge of Godot's file formats (.gd, .tscn, .tres), architecture patterns (component-based, signal-driven, resource-based), common pitfalls, validation tools, code templates, and CLI workflows. The `godot` command is available for running the game, validating scripts, importing resources, and exporting builds. Use this skill for tasks involving Godot game development, debugging scene/resource files, implementing game systems, or creating new Godot components.
nano-banana-pro
garg-aayush
Generate and edit images using Google's Nano Banana Pro (Gemini 3 Pro Image) API. Use when the user asks to generate, create, edit, modify, change, alter, or update images. Also use when user references an existing image file and asks to modify it in any way (e.g., "modify this image", "change the background", "replace X with Y"). Supports both text-to-image generation and image-to-image editing with configurable resolution (1K default, 2K, or 4K for high resolution). DO NOT read the image file first - use this skill directly with the --input-image parameter.
pdf-to-markdown
aliceisjustplaying
Convert entire PDF documents to clean, structured Markdown for full context loading. Use this skill when the user wants to extract ALL text from a PDF into context (not grep/search), when discussing or analyzing PDF content in full, when the user mentions "load the whole PDF", "bring the PDF into context", "read the entire PDF", or when partial extraction/grepping would miss important context. This is the preferred method for PDF text extraction over page-by-page or grep approaches.
Related MCP Servers
Browse all serversEmpower AI with the Exa MCP Server—an AI research tool for real-time web search, academic data, and smarter, up-to-date
Integrate with OpenDAL for unified access to multiple storage systems, enabling LLMs to manage data efficiently across b
Riza offers a secure bridge between LLMs and a sandboxed code interpreter, allowing safe code execution and persistent t
Unlock AI-ready web data with Firecrawl: scrape any website, handle dynamic content, and automate web scraping for resea
Build persistent semantic networks for enterprise & engineering data management. Enable data persistence and memory acro
Browser Use lets LLMs and agents access and scrape any website in real time, making web scraping and web page scraping e
Stay ahead of the MCP ecosystem
Get weekly updates on new skills and servers.