evals

5views

1installs

Agent evaluation framework based on Anthropic's best practices. USE WHEN eval, evaluate, test agent, benchmark, verify behavior, regression test, capability test. Includes three grader types (code-based, model-based, human), transcript capture, pass@k/pass^k metrics, and ALGORITHM integration.

Install

mkdir -p .claude/skills/evals && curl -L -o skill.zip "https://mcp.directory/api/skills/download/7825" && unzip -o skill.zip -d .claude/skills/evals && rm skill.zip

Installs to .claude/skills/evals

About this skill

Customization

Before executing, check for user customizations at: ~/.claude/skills/PAI/USER/SKILLCUSTOMIZATIONS/Evals/

If this directory exists, load and apply any PREFERENCES.md, configurations, or resources found there. These override default behavior. If the directory does not exist, proceed with skill defaults.

🚨 MANDATORY: Voice Notification (REQUIRED BEFORE ANY ACTION)

You MUST send this notification BEFORE doing anything else when this skill is invoked.

Send voice notification:

curl -s -X POST http://localhost:8888/notify \
  -H "Content-Type: application/json" \
  -d '{"message": "Running the WORKFLOWNAME workflow in the Evals skill to ACTION"}' \
  > /dev/null 2>&1 &

Output text notification:

Running the **WorkflowName** workflow in the **Evals** skill to ACTION...

This is not optional. Execute this curl command immediately upon skill invocation.

Evals - AI Agent Evaluation Framework

Comprehensive agent evaluation system based on Anthropic's "Demystifying Evals for AI Agents" (Jan 2026).

Key differentiator: Evaluates agent workflows (transcripts, tool calls, multi-turn conversations), not just single outputs.

When to Activate

"run evals", "test this agent", "evaluate", "check quality", "benchmark"
"regression test", "capability test"
Compare agent behaviors across changes
Validate agent workflows before deployment
Verify ALGORITHM ISC rows
Create new evaluation tasks from failures

Core Concepts

Three Grader Types

Type	Strengths	Weaknesses	Use For
Code-based	Fast, cheap, deterministic, reproducible	Brittle, lacks nuance	Tests, state checks, tool verification
Model-based	Flexible, captures nuance, scalable	Non-deterministic, expensive	Quality rubrics, assertions, comparisons
Human	Gold standard, handles subjectivity	Expensive, slow	Calibration, spot checks, A/B testing

Evaluation Types

Type	Pass Target	Purpose
Capability	~70%	Stretch goals, measuring improvement potential
Regression	~99%	Quality gates, detecting backsliding

Key Metrics

pass@k: Probability of at least 1 success in k trials (measures capability)
pass^k: Probability all k trials succeed (measures consistency/reliability)

Workflow Routing

Trigger	Workflow
"run evals", "evaluate suite"	Run suite via `Tools/AlgorithmBridge.ts`
"log failure"	Log failure via `Tools/FailureToTask.ts log`
"convert failures"	Convert to tasks via `Tools/FailureToTask.ts convert-all`
"create suite"	Create suite via `Tools/SuiteManager.ts create`
"check saturation"	Check via `Tools/SuiteManager.ts check-saturation`

Quick Reference

CLI Commands

# Run an eval suite
bun run ~/.claude/skills/Evals/Tools/AlgorithmBridge.ts -s <suite>

# Log a failure for later conversion
bun run ~/.claude/skills/Evals/Tools/FailureToTask.ts log "description" -c category -s severity

# Convert failures to test tasks
bun run ~/.claude/skills/Evals/Tools/FailureToTask.ts convert-all

# Manage suites
bun run ~/.claude/skills/Evals/Tools/SuiteManager.ts create <name> -t capability -d "description"
bun run ~/.claude/skills/Evals/Tools/SuiteManager.ts list
bun run ~/.claude/skills/Evals/Tools/SuiteManager.ts check-saturation <name>
bun run ~/.claude/skills/Evals/Tools/SuiteManager.ts graduate <name>

ALGORITHM Integration

Evals is a verification method for THE ALGORITHM ISC rows:

# Run eval and update ISC row
bun run ~/.claude/skills/Evals/Tools/AlgorithmBridge.ts -s regression-core -r 3 -u

ISC rows can specify eval verification:

| # | What Ideal Looks Like | Verify |
|---|----------------------|--------|
| 1 | Auth bypass fixed | eval:auth-security |
| 2 | Tests all pass | eval:regression |

Available Graders

Code-Based (Fast, Deterministic)

Grader	Use Case
`string_match`	Exact substring matching
`regex_match`	Pattern matching
`binary_tests`	Run test files
`static_analysis`	Lint, type-check, security scan
`state_check`	Verify system state after execution
`tool_calls`	Verify specific tools were called

Model-Based (Nuanced)

Grader	Use Case
`llm_rubric`	Score against detailed rubric
`natural_language_assert`	Check assertions are true
`pairwise_comparison`	Compare to reference with position swap

Domain Patterns

Pre-configured grader stacks for common agent types:

Domain	Primary Graders
`coding`	binary_tests + static_analysis + tool_calls + llm_rubric
`conversational`	llm_rubric + natural_language_assert + state_check
`research`	llm_rubric + natural_language_assert + tool_calls
`computer_use`	state_check + tool_calls + llm_rubric

See Data/DomainPatterns.yaml for full configurations.

Task Schema (YAML)

task:
  id: "fix-auth-bypass_1"
  description: "Fix authentication bypass when password is empty"
  type: regression  # or capability
  domain: coding

  graders:
    - type: binary_tests
      required: [test_empty_pw.py]
      weight: 0.30

    - type: tool_calls
      weight: 0.20
      params:
        sequence: [read_file, edit_file, run_tests]

    - type: llm_rubric
      weight: 0.50
      params:
        rubric: prompts/security_review.md

  trials: 3
  pass_threshold: 0.75

Resource Index

Resource	Purpose
`Types/index.ts`	Core type definitions
`Graders/CodeBased/`	Deterministic graders
`Graders/ModelBased/`	LLM-powered graders
`Tools/TranscriptCapture.ts`	Capture agent trajectories
`Tools/TrialRunner.ts`	Multi-trial execution with pass@k
`Tools/SuiteManager.ts`	Suite management and saturation
`Tools/FailureToTask.ts`	Convert failures to test tasks
`Tools/AlgorithmBridge.ts`	ALGORITHM integration
`Data/DomainPatterns.yaml`	Domain-specific grader configs

Key Principles (from Anthropic)

Start with 20-50 real failures - Don't overthink, capture what actually broke
Unambiguous tasks - Two experts should reach identical verdicts
Balanced problem sets - Test both "should do" AND "should NOT do"
Grade outputs, not paths - Don't penalize valid creative solutions
Calibrate LLM judges - Against human expert judgment
Check transcripts regularly - Verify graders work correctly
Monitor saturation - Graduate to regression when hitting 95%+
Build infrastructure early - Evals shape how quickly you can adopt new models

ALGORITHM: Evals is a verification method
Science: Evals implements scientific method
Browser: For visual verification graders

More by danielmiessler

View all skills by danielmiessler →

alex-hormozi-pitch

danielmiessler

Create irresistible offers and pitches using Alex Hormozi's methodology from $100M Offers. Guides through value equation, guarantee frameworks, pricing psychology, and creating offers "too good not to take" for any product or service.

14872

research

danielmiessler

Comprehensive research, analysis, and content extraction system. USE WHEN user says 'research' (ANY form - this is the MANDATORY trigger), 'do research', 'extensive research', 'quick research', 'minor research', 'research this', 'find information', 'investigate', 'extract wisdom', 'extract alpha', 'analyze content', 'can't get this content', 'use fabric', OR requests any web/content research. Supports three research modes (quick/standard/extensive), deep content analysis, intelligent retrieval, and 242+ Fabric patterns. NOTE: For due diligence, OSINT, or background checks, use OSINT skill instead.

6818

prompting

danielmiessler

Prompt engineering standards and context engineering principles for AI agents based on Anthropic best practices. Covers clarity, structure, progressive discovery, and optimization for signal-to-noise ratio.

10418

art

danielmiessler

Complete visual content system for Unsupervised Learning. FOURTEEN workflows - (1) VISUALIZE (adaptive multi-modal orchestrator), (2) MERMAID (Excalidraw-style technical diagrams), (3) Editorial illustrations, (4) Technical diagrams, (5) Visual taxonomies, (6) Timelines, (7) Frameworks, (8) Comparisons, (9) Annotated screenshots, (10) Recipe cards, (11) Aphorisms, (12) Conceptual maps, (13) Stats, (14) Comics. USE WHEN user requests any visual content: 'visualize', 'mermaid', 'flowchart', 'sequence diagram', 'state diagram', 'infographic', 'art', 'illustration', 'diagram', 'taxonomy', 'timeline', 'framework', 'comparison', 'screenshot', 'recipe', 'aphorism', 'quote card', 'map', 'stat card', 'comic'. Note: Blogging skill auto-routes header images here.

9314

osint

danielmiessler

Open source intelligence gathering. USE WHEN OSINT, due diligence, background check, research person, company intel, investigate. SkillSearch('osint') for docs.

5313

documents

danielmiessler

Document processing. USE WHEN document, process file. SkillSearch('documents') for docs.

617

ui-ux-pro-max

nextlevelbuilder

"UI/UX design intelligence. 50 styles, 21 palettes, 50 font pairings, 20 charts, 8 stacks (React, Next.js, Vue, Svelte, SwiftUI, React Native, Flutter, Tailwind). Actions: plan, build, create, design, implement, review, fix, improve, optimize, enhance, refactor, check UI/UX code. Projects: website, landing page, dashboard, admin panel, e-commerce, SaaS, portfolio, blog, mobile app, .html, .tsx, .vue, .svelte. Elements: button, modal, navbar, sidebar, card, table, form, chart. Styles: glassmorphism, claymorphism, minimalism, brutalism, neumorphism, bento grid, dark mode, responsive, skeuomorphism, flat design. Topics: color palette, accessibility, animation, layout, typography, font pairing, spacing, hover, shadow, gradient."

1,5461,552

flutter-development

aj-geddes

Build beautiful cross-platform mobile apps with Flutter and Dart. Covers widgets, state management with Provider/BLoC, navigation, API integration, and material design.

1,8211,479

drawio-diagrams-enhanced

jgtolentino

Create professional draw.io (diagrams.net) diagrams in XML format (.drawio files) with integrated PMP/PMBOK methodologies, extensive visual asset libraries, and industry-standard professional templates. Use this skill when users ask to create flowcharts, swimlane diagrams, cross-functional flowcharts, org charts, network diagrams, UML diagrams, BPMN, project management diagrams (WBS, Gantt, PERT, RACI), risk matrices, stakeholder maps, or any other visual diagram in draw.io format. This skill includes access to custom shape libraries for icons, clipart, and professional symbols.

1,7031,234

godot

bfollington

This skill should be used when working on Godot Engine projects. It provides specialized knowledge of Godot's file formats (.gd, .tscn, .tres), architecture patterns (component-based, signal-driven, resource-based), common pitfalls, validation tools, code templates, and CLI workflows. The `godot` command is available for running the game, validating scripts, importing resources, and exporting builds. Use this skill for tasks involving Godot game development, debugging scene/resource files, implementing game systems, or creating new Godot components.

1,597897

pdf-to-markdown

aliceisjustplaying

Convert entire PDF documents to clean, structured Markdown for full context loading. Use this skill when the user wants to extract ALL text from a PDF into context (not grep/search), when discussing or analyzing PDF content in full, when the user mentions "load the whole PDF", "bring the PDF into context", "read the entire PDF", or when partial extraction/grepping would miss important context. This is the preferred method for PDF text extraction over page-by-page or grep approaches.

1,872827

nano-banana-pro

garg-aayush

Generate and edit images using Google's Nano Banana Pro (Gemini 3 Pro Image) API. Use when the user asks to generate, create, edit, modify, change, alter, or update images. Also use when user references an existing image file and asks to modify it in any way (e.g., "modify this image", "change the background", "replace X with Y"). Supports both text-to-image generation and image-to-image editing with configurable resolution (1K default, 2K, or 4K for high resolution). DO NOT read the image file first - use this skill directly with the --input-image parameter.

1,428789

Related MCP Servers

Browse all servers

Dual-Cycle Reasoner

Dual-Cycle Reasoner enables agents to detect repetitive behavior, diagnose failure causes, and recover with advanced met

90 tools

care-membrane-mcp

AI safety toolkit: care scoring, threat detection, burnout analysis, relationship prediction, and Care Membrane framewor

8 tools

Figma Context

Unlock seamless Figma to code: streamline Figma to HTML with Framelink MCP Server for fast, accurate design-to-code work

13,4900 tools

Chrome MCP

Chrome extension-based MCP server that exposes browser functionality to AI assistants. Control tabs, capture screenshots

10,6750 tools

Uno Platform

Uno Platform — Documentation and prompts for building cross-platform .NET apps with a single codebase. Get guides, sampl

9,8441 tools

MCP Use

The fullstack MCP framework for developing MCP apps for ChatGPT, Claude, and building MCP servers for AI agents. Connect

9,3960 tools

Install

mkdir -p .claude/skills/evals && curl -L -o skill.zip "https://mcp.directory/api/skills/download/7825" && unzip -o skill.zip -d .claude/skills/evals && rm skill.zip

Installs to .claude/skills/evals

Stats

Views

Installs

Author

danielmiessler

7 skills published

Links

Source Code

evals

Install

About this skill

Customization

🚨 MANDATORY: Voice Notification (REQUIRED BEFORE ANY ACTION)

Evals - AI Agent Evaluation Framework

When to Activate

Core Concepts

Three Grader Types

Evaluation Types

Key Metrics

Workflow Routing

Quick Reference

CLI Commands

ALGORITHM Integration

Available Graders

Code-Based (Fast, Deterministic)

Model-Based (Nuanced)

Domain Patterns

Task Schema (YAML)

Resource Index

Key Principles (from Anthropic)

Related

More by danielmiessler

alex-hormozi-pitch

research

prompting

art

osint

documents

You might also like

ui-ux-pro-max

flutter-development

drawio-diagrams-enhanced

godot

pdf-to-markdown

nano-banana-pro

Related MCP Servers