nemo-evaluator-sdk

Name: nemo-evaluator-sdk
Author: davila7

0views

1installs

Evaluates LLMs across 100+ benchmarks from 18+ harnesses (MMLU, HumanEval, GSM8K, safety, VLM) with multi-backend execution. Use when needing scalable evaluation on local Docker, Slurm HPC, or cloud platforms. NVIDIA's enterprise-grade platform with container-first architecture for reproducible benchmarking.

Install

mkdir -p .claude/skills/nemo-evaluator-sdk && curl -L -o skill.zip "https://mcp.directory/api/skills/download/5196" && unzip -o skill.zip -d .claude/skills/nemo-evaluator-sdk && rm skill.zip

Installs to .claude/skills/nemo-evaluator-sdk

About this skill

NeMo Evaluator SDK - Enterprise LLM Benchmarking

Quick Start

NeMo Evaluator SDK evaluates LLMs across 100+ benchmarks from 18+ harnesses using containerized, reproducible evaluation with multi-backend execution (local Docker, Slurm HPC, Lepton cloud).

Installation:

pip install nemo-evaluator-launcher

Set API key and run evaluation:

export NGC_API_KEY=nvapi-your-key-here

# Create minimal config
cat > config.yaml << 'EOF'
defaults:
  - execution: local
  - deployment: none
  - _self_

execution:
  output_dir: ./results

target:
  api_endpoint:
    model_id: meta/llama-3.1-8b-instruct
    url: https://integrate.api.nvidia.com/v1/chat/completions
    api_key_name: NGC_API_KEY

evaluation:
  tasks:
    - name: ifeval
EOF

# Run evaluation
nemo-evaluator-launcher run --config-dir . --config-name config

View available tasks:

nemo-evaluator-launcher ls tasks

Common Workflows

Workflow 1: Evaluate Model on Standard Benchmarks

Run core academic benchmarks (MMLU, GSM8K, IFEval) on any OpenAI-compatible endpoint.

Checklist:

Standard Evaluation:
- [ ] Step 1: Configure API endpoint
- [ ] Step 2: Select benchmarks
- [ ] Step 3: Run evaluation
- [ ] Step 4: Check results

Step 1: Configure API endpoint

# config.yaml
defaults:
  - execution: local
  - deployment: none
  - _self_

execution:
  output_dir: ./results

target:
  api_endpoint:
    model_id: meta/llama-3.1-8b-instruct
    url: https://integrate.api.nvidia.com/v1/chat/completions
    api_key_name: NGC_API_KEY

For self-hosted endpoints (vLLM, TRT-LLM):

target:
  api_endpoint:
    model_id: my-model
    url: http://localhost:8000/v1/chat/completions
    api_key_name: ""  # No key needed for local

Step 2: Select benchmarks

Add tasks to your config:

evaluation:
  tasks:
    - name: ifeval           # Instruction following
    - name: gpqa_diamond     # Graduate-level QA
      env_vars:
        HF_TOKEN: HF_TOKEN   # Some tasks need HF token
    - name: gsm8k_cot_instruct  # Math reasoning
    - name: humaneval        # Code generation

Step 3: Run evaluation

# Run with config file
nemo-evaluator-launcher run \
  --config-dir . \
  --config-name config

# Override output directory
nemo-evaluator-launcher run \
  --config-dir . \
  --config-name config \
  -o execution.output_dir=./my_results

# Limit samples for quick testing
nemo-evaluator-launcher run \
  --config-dir . \
  --config-name config \
  -o +evaluation.nemo_evaluator_config.config.params.limit_samples=10

Step 4: Check results

# Check job status
nemo-evaluator-launcher status <invocation_id>

# List all runs
nemo-evaluator-launcher ls runs

# View results
cat results/<invocation_id>/<task>/artifacts/results.yml

Workflow 2: Run Evaluation on Slurm HPC Cluster

Execute large-scale evaluation on HPC infrastructure.

Checklist:

Slurm Evaluation:
- [ ] Step 1: Configure Slurm settings
- [ ] Step 2: Set up model deployment
- [ ] Step 3: Launch evaluation
- [ ] Step 4: Monitor job status

Step 1: Configure Slurm settings

# slurm_config.yaml
defaults:
  - execution: slurm
  - deployment: vllm
  - _self_

execution:
  hostname: cluster.example.com
  account: my_slurm_account
  partition: gpu
  output_dir: /shared/results
  walltime: "04:00:00"
  nodes: 1
  gpus_per_node: 8

Step 2: Set up model deployment

deployment:
  checkpoint_path: /shared/models/llama-3.1-8b
  tensor_parallel_size: 2
  data_parallel_size: 4
  max_model_len: 4096

target:
  api_endpoint:
    model_id: llama-3.1-8b
    # URL auto-generated by deployment

Step 3: Launch evaluation

nemo-evaluator-launcher run \
  --config-dir . \
  --config-name slurm_config

Step 4: Monitor job status

# Check status (queries sacct)
nemo-evaluator-launcher status <invocation_id>

# View detailed info
nemo-evaluator-launcher info <invocation_id>

# Kill if needed
nemo-evaluator-launcher kill <invocation_id>

Workflow 3: Compare Multiple Models

Benchmark multiple models on the same tasks for comparison.

Checklist:

Model Comparison:
- [ ] Step 1: Create base config
- [ ] Step 2: Run evaluations with overrides
- [ ] Step 3: Export and compare results

Step 1: Create base config

# base_eval.yaml
defaults:
  - execution: local
  - deployment: none
  - _self_

execution:
  output_dir: ./comparison_results

evaluation:
  nemo_evaluator_config:
    config:
      params:
        temperature: 0.01
        parallelism: 4
  tasks:
    - name: mmlu_pro
    - name: gsm8k_cot_instruct
    - name: ifeval

Step 2: Run evaluations with model overrides

# Evaluate Llama 3.1 8B
nemo-evaluator-launcher run \
  --config-dir . \
  --config-name base_eval \
  -o target.api_endpoint.model_id=meta/llama-3.1-8b-instruct \
  -o target.api_endpoint.url=https://integrate.api.nvidia.com/v1/chat/completions

# Evaluate Mistral 7B
nemo-evaluator-launcher run \
  --config-dir . \
  --config-name base_eval \
  -o target.api_endpoint.model_id=mistralai/mistral-7b-instruct-v0.3 \
  -o target.api_endpoint.url=https://integrate.api.nvidia.com/v1/chat/completions

Step 3: Export and compare

# Export to MLflow
nemo-evaluator-launcher export <invocation_id_1> --dest mlflow
nemo-evaluator-launcher export <invocation_id_2> --dest mlflow

# Export to local JSON
nemo-evaluator-launcher export <invocation_id> --dest local --format json

# Export to Weights & Biases
nemo-evaluator-launcher export <invocation_id> --dest wandb

Workflow 4: Safety and Vision-Language Evaluation

Evaluate models on safety benchmarks and VLM tasks.

Checklist:

Safety/VLM Evaluation:
- [ ] Step 1: Configure safety tasks
- [ ] Step 2: Set up VLM tasks (if applicable)
- [ ] Step 3: Run evaluation

Step 1: Configure safety tasks

evaluation:
  tasks:
    - name: aegis              # Safety harness
    - name: wildguard          # Safety classification
    - name: garak              # Security probing

Step 2: Configure VLM tasks

# For vision-language models
target:
  api_endpoint:
    type: vlm  # Vision-language endpoint
    model_id: nvidia/llama-3.2-90b-vision-instruct
    url: https://integrate.api.nvidia.com/v1/chat/completions

evaluation:
  tasks:
    - name: ocrbench           # OCR evaluation
    - name: chartqa            # Chart understanding
    - name: mmmu               # Multimodal understanding

When to Use vs Alternatives

Use NeMo Evaluator when:

Need 100+ benchmarks from 18+ harnesses in one platform
Running evaluations on Slurm HPC clusters or cloud
Requiring reproducible containerized evaluation
Evaluating against OpenAI-compatible APIs (vLLM, TRT-LLM, NIMs)
Need enterprise-grade evaluation with result export (MLflow, W&B)

Use alternatives instead:

lm-evaluation-harness: Simpler setup for quick local evaluation
bigcode-evaluation-harness: Focused only on code benchmarks
HELM: Stanford's broader evaluation (fairness, efficiency)
Custom scripts: Highly specialized domain evaluation

Supported Harnesses and Tasks

Harness	Task Count	Categories
`lm-evaluation-harness`	60+	MMLU, GSM8K, HellaSwag, ARC
`simple-evals`	20+	GPQA, MATH, AIME
`bigcode-evaluation-harness`	25+	HumanEval, MBPP, MultiPL-E
`safety-harness`	3	Aegis, WildGuard
`garak`	1	Security probing
`vlmevalkit`	6+	OCRBench, ChartQA, MMMU
`bfcl`	6	Function calling v2/v3
`mtbench`	2	Multi-turn conversation
`livecodebench`	10+	Live coding evaluation
`helm`	15	Medical domain
`nemo-skills`	8	Math, science, agentic

Common Issues

Issue: Container pull fails

Ensure NGC credentials are configured:

docker login nvcr.io -u '$oauthtoken' -p $NGC_API_KEY

Issue: Task requires environment variable

Some tasks need HF_TOKEN or JUDGE_API_KEY:

evaluation:
  tasks:
    - name: gpqa_diamond
      env_vars:
        HF_TOKEN: HF_TOKEN  # Maps env var name to env var

Issue: Evaluation timeout

Increase parallelism or reduce samples:

-o +evaluation.nemo_evaluator_config.config.params.parallelism=8
-o +evaluation.nemo_evaluator_config.config.params.limit_samples=100

Issue: Slurm job not starting

Check Slurm account and partition:

execution:
  account: correct_account
  partition: gpu
  qos: normal  # May need specific QOS

Issue: Different results than expected

Verify configuration matches reported settings:

evaluation:
  nemo_evaluator_config:
    config:
      params:
        temperature: 0.0  # Deterministic
        num_fewshot: 5    # Check paper's fewshot count

CLI Reference

Command	Description
`run`	Execute evaluation with config
`status <id>`	Check job status
`info <id>`	View detailed job info
`ls tasks`	List available benchmarks
`ls runs`	List all invocations
`export <id>`	Export results (mlflow/wandb/local)
`kill <id>`	Terminate running job

Configuration Override Examples

# Override model endpoint
-o target.api_endpoint.model_id=my-model
-o target.api_endpoint.url=http://localhost:8000/v1/chat/completions

# Add evaluation parameters
-o +evaluation.nemo_evaluator_config.config.params.temperature=0.5
-o +evaluation.nemo_evaluator_config.config.params.parallelism=8
-o +evaluation.nemo_evaluator_config.config.params.limit_samples=50

# Change execution settings
-o execution.output_dir=/custom/path
-o execution.mode=parallel

# Dynamically set tasks
-o 'evaluation.tasks=[{name: ifeval}, {name: gsm8k}]'

Python API Usage

For programmatic evaluation without the CLI:

from nemo_evaluator.core.evaluate import evaluate
from nemo_evaluat

---

*Content truncated.*

More by davila7

View all skills by davila7 →

software-architecture

davila7

Guide for quality focused software architecture. This skill should be used when users want to write code, design architecture, analyze code, in any case that relates to software development.

1,254479

planning-with-files

davila7

Implements Manus-style file-based planning for complex tasks. Creates task_plan.md, findings.md, and progress.md. Use when starting complex multi-step tasks, research projects, or any task requiring >5 tool calls.

128311

telegram-bot-builder

davila7

Expert in building Telegram bots that solve real problems - from simple automation to complex AI-powered bots. Covers bot architecture, the Telegram Bot API, user experience, monetization strategies, and scaling bots to thousands of users. Use when: telegram bot, bot api, telegram automation, chat bot telegram, tg bot.

157176

scientific-brainstorming

davila7

Research ideation partner. Generate hypotheses, explore interdisciplinary connections, challenge assumptions, develop methodologies, identify research gaps, for creative scientific problem-solving.

329173

scroll-experience

davila7

Expert in building immersive scroll-driven experiences - parallax storytelling, scroll animations, interactive narratives, and cinematic web experiences. Like NY Times interactives, Apple product pages, and award-winning web experiences. Makes websites feel like experiences, not just pages. Use when: scroll animation, parallax, scroll storytelling, interactive story, cinematic website.

148106

humanizer

davila7

Remove signs of AI-generated writing from text. Use when editing or reviewing text to make it sound more natural and human-written. Based on Wikipedia's comprehensive "Signs of AI writing" guide. Detects and fixes patterns including: inflated symbolism, promotional language, superficial -ing analyses, vague attributions, em dash overuse, rule of three, AI vocabulary words, negative parallelisms, and excessive conjunctive phrases. Credits: Original skill by @blader - https://github.com/blader/humanizer

222105

ui-ux-pro-max

nextlevelbuilder

"UI/UX design intelligence. 50 styles, 21 palettes, 50 font pairings, 20 charts, 8 stacks (React, Next.js, Vue, Svelte, SwiftUI, React Native, Flutter, Tailwind). Actions: plan, build, create, design, implement, review, fix, improve, optimize, enhance, refactor, check UI/UX code. Projects: website, landing page, dashboard, admin panel, e-commerce, SaaS, portfolio, blog, mobile app, .html, .tsx, .vue, .svelte. Elements: button, modal, navbar, sidebar, card, table, form, chart. Styles: glassmorphism, claymorphism, minimalism, brutalism, neumorphism, bento grid, dark mode, responsive, skeuomorphism, flat design. Topics: color palette, accessibility, animation, layout, typography, font pairing, spacing, hover, shadow, gradient."

2,8692,519

pdf-to-markdown

aliceisjustplaying

Convert entire PDF documents to clean, structured Markdown for full context loading. Use this skill when the user wants to extract ALL text from a PDF into context (not grep/search), when discussing or analyzing PDF content in full, when the user mentions "load the whole PDF", "bring the PDF into context", "read the entire PDF", or when partial extraction/grepping would miss important context. This is the preferred method for PDF text extraction over page-by-page or grep approaches.

3,7961,653

flutter-development

aj-geddes

Build beautiful cross-platform mobile apps with Flutter and Dart. Covers widgets, state management with Provider/BLoC, navigation, API integration, and material design.

2,1491,640

drawio-diagrams-enhanced

jgtolentino

Create professional draw.io (diagrams.net) diagrams in XML format (.drawio files) with integrated PMP/PMBOK methodologies, extensive visual asset libraries, and industry-standard professional templates. Use this skill when users ask to create flowcharts, swimlane diagrams, cross-functional flowcharts, org charts, network diagrams, UML diagrams, BPMN, project management diagrams (WBS, Gantt, PERT, RACI), risk matrices, stakeholder maps, or any other visual diagram in draw.io format. This skill includes access to custom shape libraries for icons, clipart, and professional symbols.

2,2651,465

godot

bfollington

This skill should be used when working on Godot Engine projects. It provides specialized knowledge of Godot's file formats (.gd, .tscn, .tres), architecture patterns (component-based, signal-driven, resource-based), common pitfalls, validation tools, code templates, and CLI workflows. The `godot` command is available for running the game, validating scripts, importing resources, and exporting builds. Use this skill for tasks involving Godot game development, debugging scene/resource files, implementing game systems, or creating new Godot components.

2,4611,222

nano-banana-pro

garg-aayush

Generate and edit images using Google's Nano Banana Pro (Gemini 3 Pro Image) API. Use when the user asks to generate, create, edit, modify, change, alter, or update images. Also use when user references an existing image file and asks to modify it in any way (e.g., "modify this image", "change the background", "replace X with Y"). Supports both text-to-image generation and image-to-image editing with configurable resolution (1K default, 2K, or 4K for high resolution). DO NOT read the image file first - use this skill directly with the --input-image parameter.

1,955969

Related MCP Servers

Browse all servers

OpenDAL

Integrate with OpenDAL for unified access to multiple storage systems, enabling LLMs to manage data efficiently across backends.

340 tools

Riza

Riza offers a secure bridge between LLMs and a sandboxed code interpreter, allowing safe code execution and persistent tool management.

120 tools

Firecrawl

Unlock AI-ready web data with Firecrawl: scrape any website, handle dynamic content, and automate web scraping for research or automation.

89,5930 tools

Knowledge Graph Memory

Build persistent semantic networks for enterprise & engineering data management. Enable data persistence and memory across chats efficiently.

80,5279 tools

Browser Use

Browser Use lets LLMs and agents access and scrape any website in real time, making web scraping and web page scraping effortless via API.

79,9420 tools

Task Master

Boost productivity with Task Master: an AI-powered tool for project management and agile development workflows, integrated with popular editors.

25,8320 tools

Install

mkdir -p .claude/skills/nemo-evaluator-sdk && curl -L -o skill.zip "https://mcp.directory/api/skills/download/5196" && unzip -o skill.zip -d .claude/skills/nemo-evaluator-sdk && rm skill.zip

Installs to .claude/skills/nemo-evaluator-sdk

Stats

Views

Installs

Author

davila7

7 skills published

Links

Source Code

nemo-evaluator-sdk

Install

About this skill

NeMo Evaluator SDK - Enterprise LLM Benchmarking

Quick Start

Common Workflows

Workflow 1: Evaluate Model on Standard Benchmarks

Workflow 2: Run Evaluation on Slurm HPC Cluster

Workflow 3: Compare Multiple Models

Workflow 4: Safety and Vision-Language Evaluation

When to Use vs Alternatives

Supported Harnesses and Tasks

Common Issues

CLI Reference

Configuration Override Examples

Python API Usage

More by davila7

software-architecture

planning-with-files

telegram-bot-builder

scientific-brainstorming

scroll-experience

humanizer

You might also like

ui-ux-pro-max

pdf-to-markdown

flutter-development

drawio-diagrams-enhanced

godot

nano-banana-pro

Related MCP Servers