nemo-evaluator-sdk
Evaluates LLMs across 100+ benchmarks from 18+ harnesses (MMLU, HumanEval, GSM8K, safety, VLM) with multi-backend execution. Use when needing scalable evaluation on local Docker, Slurm HPC, or cloud platforms. NVIDIA's enterprise-grade platform with container-first architecture for reproducible benchmarking.
Install
mkdir -p .claude/skills/nemo-evaluator-sdk && curl -L -o skill.zip "https://mcp.directory/api/skills/download/5196" && unzip -o skill.zip -d .claude/skills/nemo-evaluator-sdk && rm skill.zipInstalls to .claude/skills/nemo-evaluator-sdk
About this skill
NeMo Evaluator SDK - Enterprise LLM Benchmarking
Quick Start
NeMo Evaluator SDK evaluates LLMs across 100+ benchmarks from 18+ harnesses using containerized, reproducible evaluation with multi-backend execution (local Docker, Slurm HPC, Lepton cloud).
Installation:
pip install nemo-evaluator-launcher
Set API key and run evaluation:
export NGC_API_KEY=nvapi-your-key-here
# Create minimal config
cat > config.yaml << 'EOF'
defaults:
- execution: local
- deployment: none
- _self_
execution:
output_dir: ./results
target:
api_endpoint:
model_id: meta/llama-3.1-8b-instruct
url: https://integrate.api.nvidia.com/v1/chat/completions
api_key_name: NGC_API_KEY
evaluation:
tasks:
- name: ifeval
EOF
# Run evaluation
nemo-evaluator-launcher run --config-dir . --config-name config
View available tasks:
nemo-evaluator-launcher ls tasks
Common Workflows
Workflow 1: Evaluate Model on Standard Benchmarks
Run core academic benchmarks (MMLU, GSM8K, IFEval) on any OpenAI-compatible endpoint.
Checklist:
Standard Evaluation:
- [ ] Step 1: Configure API endpoint
- [ ] Step 2: Select benchmarks
- [ ] Step 3: Run evaluation
- [ ] Step 4: Check results
Step 1: Configure API endpoint
# config.yaml
defaults:
- execution: local
- deployment: none
- _self_
execution:
output_dir: ./results
target:
api_endpoint:
model_id: meta/llama-3.1-8b-instruct
url: https://integrate.api.nvidia.com/v1/chat/completions
api_key_name: NGC_API_KEY
For self-hosted endpoints (vLLM, TRT-LLM):
target:
api_endpoint:
model_id: my-model
url: http://localhost:8000/v1/chat/completions
api_key_name: "" # No key needed for local
Step 2: Select benchmarks
Add tasks to your config:
evaluation:
tasks:
- name: ifeval # Instruction following
- name: gpqa_diamond # Graduate-level QA
env_vars:
HF_TOKEN: HF_TOKEN # Some tasks need HF token
- name: gsm8k_cot_instruct # Math reasoning
- name: humaneval # Code generation
Step 3: Run evaluation
# Run with config file
nemo-evaluator-launcher run \
--config-dir . \
--config-name config
# Override output directory
nemo-evaluator-launcher run \
--config-dir . \
--config-name config \
-o execution.output_dir=./my_results
# Limit samples for quick testing
nemo-evaluator-launcher run \
--config-dir . \
--config-name config \
-o +evaluation.nemo_evaluator_config.config.params.limit_samples=10
Step 4: Check results
# Check job status
nemo-evaluator-launcher status <invocation_id>
# List all runs
nemo-evaluator-launcher ls runs
# View results
cat results/<invocation_id>/<task>/artifacts/results.yml
Workflow 2: Run Evaluation on Slurm HPC Cluster
Execute large-scale evaluation on HPC infrastructure.
Checklist:
Slurm Evaluation:
- [ ] Step 1: Configure Slurm settings
- [ ] Step 2: Set up model deployment
- [ ] Step 3: Launch evaluation
- [ ] Step 4: Monitor job status
Step 1: Configure Slurm settings
# slurm_config.yaml
defaults:
- execution: slurm
- deployment: vllm
- _self_
execution:
hostname: cluster.example.com
account: my_slurm_account
partition: gpu
output_dir: /shared/results
walltime: "04:00:00"
nodes: 1
gpus_per_node: 8
Step 2: Set up model deployment
deployment:
checkpoint_path: /shared/models/llama-3.1-8b
tensor_parallel_size: 2
data_parallel_size: 4
max_model_len: 4096
target:
api_endpoint:
model_id: llama-3.1-8b
# URL auto-generated by deployment
Step 3: Launch evaluation
nemo-evaluator-launcher run \
--config-dir . \
--config-name slurm_config
Step 4: Monitor job status
# Check status (queries sacct)
nemo-evaluator-launcher status <invocation_id>
# View detailed info
nemo-evaluator-launcher info <invocation_id>
# Kill if needed
nemo-evaluator-launcher kill <invocation_id>
Workflow 3: Compare Multiple Models
Benchmark multiple models on the same tasks for comparison.
Checklist:
Model Comparison:
- [ ] Step 1: Create base config
- [ ] Step 2: Run evaluations with overrides
- [ ] Step 3: Export and compare results
Step 1: Create base config
# base_eval.yaml
defaults:
- execution: local
- deployment: none
- _self_
execution:
output_dir: ./comparison_results
evaluation:
nemo_evaluator_config:
config:
params:
temperature: 0.01
parallelism: 4
tasks:
- name: mmlu_pro
- name: gsm8k_cot_instruct
- name: ifeval
Step 2: Run evaluations with model overrides
# Evaluate Llama 3.1 8B
nemo-evaluator-launcher run \
--config-dir . \
--config-name base_eval \
-o target.api_endpoint.model_id=meta/llama-3.1-8b-instruct \
-o target.api_endpoint.url=https://integrate.api.nvidia.com/v1/chat/completions
# Evaluate Mistral 7B
nemo-evaluator-launcher run \
--config-dir . \
--config-name base_eval \
-o target.api_endpoint.model_id=mistralai/mistral-7b-instruct-v0.3 \
-o target.api_endpoint.url=https://integrate.api.nvidia.com/v1/chat/completions
Step 3: Export and compare
# Export to MLflow
nemo-evaluator-launcher export <invocation_id_1> --dest mlflow
nemo-evaluator-launcher export <invocation_id_2> --dest mlflow
# Export to local JSON
nemo-evaluator-launcher export <invocation_id> --dest local --format json
# Export to Weights & Biases
nemo-evaluator-launcher export <invocation_id> --dest wandb
Workflow 4: Safety and Vision-Language Evaluation
Evaluate models on safety benchmarks and VLM tasks.
Checklist:
Safety/VLM Evaluation:
- [ ] Step 1: Configure safety tasks
- [ ] Step 2: Set up VLM tasks (if applicable)
- [ ] Step 3: Run evaluation
Step 1: Configure safety tasks
evaluation:
tasks:
- name: aegis # Safety harness
- name: wildguard # Safety classification
- name: garak # Security probing
Step 2: Configure VLM tasks
# For vision-language models
target:
api_endpoint:
type: vlm # Vision-language endpoint
model_id: nvidia/llama-3.2-90b-vision-instruct
url: https://integrate.api.nvidia.com/v1/chat/completions
evaluation:
tasks:
- name: ocrbench # OCR evaluation
- name: chartqa # Chart understanding
- name: mmmu # Multimodal understanding
When to Use vs Alternatives
Use NeMo Evaluator when:
- Need 100+ benchmarks from 18+ harnesses in one platform
- Running evaluations on Slurm HPC clusters or cloud
- Requiring reproducible containerized evaluation
- Evaluating against OpenAI-compatible APIs (vLLM, TRT-LLM, NIMs)
- Need enterprise-grade evaluation with result export (MLflow, W&B)
Use alternatives instead:
- lm-evaluation-harness: Simpler setup for quick local evaluation
- bigcode-evaluation-harness: Focused only on code benchmarks
- HELM: Stanford's broader evaluation (fairness, efficiency)
- Custom scripts: Highly specialized domain evaluation
Supported Harnesses and Tasks
| Harness | Task Count | Categories |
|---|---|---|
lm-evaluation-harness | 60+ | MMLU, GSM8K, HellaSwag, ARC |
simple-evals | 20+ | GPQA, MATH, AIME |
bigcode-evaluation-harness | 25+ | HumanEval, MBPP, MultiPL-E |
safety-harness | 3 | Aegis, WildGuard |
garak | 1 | Security probing |
vlmevalkit | 6+ | OCRBench, ChartQA, MMMU |
bfcl | 6 | Function calling v2/v3 |
mtbench | 2 | Multi-turn conversation |
livecodebench | 10+ | Live coding evaluation |
helm | 15 | Medical domain |
nemo-skills | 8 | Math, science, agentic |
Common Issues
Issue: Container pull fails
Ensure NGC credentials are configured:
docker login nvcr.io -u '$oauthtoken' -p $NGC_API_KEY
Issue: Task requires environment variable
Some tasks need HF_TOKEN or JUDGE_API_KEY:
evaluation:
tasks:
- name: gpqa_diamond
env_vars:
HF_TOKEN: HF_TOKEN # Maps env var name to env var
Issue: Evaluation timeout
Increase parallelism or reduce samples:
-o +evaluation.nemo_evaluator_config.config.params.parallelism=8
-o +evaluation.nemo_evaluator_config.config.params.limit_samples=100
Issue: Slurm job not starting
Check Slurm account and partition:
execution:
account: correct_account
partition: gpu
qos: normal # May need specific QOS
Issue: Different results than expected
Verify configuration matches reported settings:
evaluation:
nemo_evaluator_config:
config:
params:
temperature: 0.0 # Deterministic
num_fewshot: 5 # Check paper's fewshot count
CLI Reference
| Command | Description |
|---|---|
run | Execute evaluation with config |
status <id> | Check job status |
info <id> | View detailed job info |
ls tasks | List available benchmarks |
ls runs | List all invocations |
export <id> | Export results (mlflow/wandb/local) |
kill <id> | Terminate running job |
Configuration Override Examples
# Override model endpoint
-o target.api_endpoint.model_id=my-model
-o target.api_endpoint.url=http://localhost:8000/v1/chat/completions
# Add evaluation parameters
-o +evaluation.nemo_evaluator_config.config.params.temperature=0.5
-o +evaluation.nemo_evaluator_config.config.params.parallelism=8
-o +evaluation.nemo_evaluator_config.config.params.limit_samples=50
# Change execution settings
-o execution.output_dir=/custom/path
-o execution.mode=parallel
# Dynamically set tasks
-o 'evaluation.tasks=[{name: ifeval}, {name: gsm8k}]'
Python API Usage
For programmatic evaluation without the CLI:
from nemo_evaluator.core.evaluate import evaluate
from nemo_evaluat
---
*Content truncated.*
More by davila7
View all skills by davila7 →You might also like
flutter-development
aj-geddes
Build beautiful cross-platform mobile apps with Flutter and Dart. Covers widgets, state management with Provider/BLoC, navigation, API integration, and material design.
drawio-diagrams-enhanced
jgtolentino
Create professional draw.io (diagrams.net) diagrams in XML format (.drawio files) with integrated PMP/PMBOK methodologies, extensive visual asset libraries, and industry-standard professional templates. Use this skill when users ask to create flowcharts, swimlane diagrams, cross-functional flowcharts, org charts, network diagrams, UML diagrams, BPMN, project management diagrams (WBS, Gantt, PERT, RACI), risk matrices, stakeholder maps, or any other visual diagram in draw.io format. This skill includes access to custom shape libraries for icons, clipart, and professional symbols.
ui-ux-pro-max
nextlevelbuilder
"UI/UX design intelligence. 50 styles, 21 palettes, 50 font pairings, 20 charts, 8 stacks (React, Next.js, Vue, Svelte, SwiftUI, React Native, Flutter, Tailwind). Actions: plan, build, create, design, implement, review, fix, improve, optimize, enhance, refactor, check UI/UX code. Projects: website, landing page, dashboard, admin panel, e-commerce, SaaS, portfolio, blog, mobile app, .html, .tsx, .vue, .svelte. Elements: button, modal, navbar, sidebar, card, table, form, chart. Styles: glassmorphism, claymorphism, minimalism, brutalism, neumorphism, bento grid, dark mode, responsive, skeuomorphism, flat design. Topics: color palette, accessibility, animation, layout, typography, font pairing, spacing, hover, shadow, gradient."
godot
bfollington
This skill should be used when working on Godot Engine projects. It provides specialized knowledge of Godot's file formats (.gd, .tscn, .tres), architecture patterns (component-based, signal-driven, resource-based), common pitfalls, validation tools, code templates, and CLI workflows. The `godot` command is available for running the game, validating scripts, importing resources, and exporting builds. Use this skill for tasks involving Godot game development, debugging scene/resource files, implementing game systems, or creating new Godot components.
nano-banana-pro
garg-aayush
Generate and edit images using Google's Nano Banana Pro (Gemini 3 Pro Image) API. Use when the user asks to generate, create, edit, modify, change, alter, or update images. Also use when user references an existing image file and asks to modify it in any way (e.g., "modify this image", "change the background", "replace X with Y"). Supports both text-to-image generation and image-to-image editing with configurable resolution (1K default, 2K, or 4K for high resolution). DO NOT read the image file first - use this skill directly with the --input-image parameter.
fastapi-templates
wshobson
Create production-ready FastAPI projects with async patterns, dependency injection, and comprehensive error handling. Use when building new FastAPI applications or setting up backend API projects.
Related MCP Servers
Browse all serversIntegrate with OpenDAL for unified access to multiple storage systems, enabling LLMs to manage data efficiently across b
Riza offers a secure bridge between LLMs and a sandboxed code interpreter, allowing safe code execution and persistent t
Unlock AI-ready web data with Firecrawl: scrape any website, handle dynamic content, and automate web scraping for resea
Build persistent semantic networks for enterprise & engineering data management. Enable data persistence and memory acro
Browser Use lets LLMs and agents access and scrape any website in real time, making web scraping and web page scraping e
Boost productivity with Task Master: an AI-powered tool for project management and agile development workflows, integrat
Stay ahead of the MCP ecosystem
Get weekly updates on new skills and servers.