create-eval

0views

1installs

This skill should be used when the user asks to "create an eval", "write an eval test", "add a new eval", "create a test case", "write a test for Holmes", or discusses LLM evaluation tests, eval fixtures, or test_case.yaml files for the HolmesGPT project.

Install

mkdir -p .claude/skills/create-eval && curl -L -o skill.zip "https://mcp.directory/api/skills/download/4631" && unzip -o skill.zip -d .claude/skills/create-eval && rm skill.zip

Installs to .claude/skills/create-eval

About this skill

Creating HolmesGPT Eval Tests

This skill provides the complete workflow for creating LLM evaluation tests in the HolmesGPT project. Eval tests validate that Holmes can correctly answer questions by querying real infrastructure and services.

Test Structure

Each eval lives in its own directory under tests/llm/fixtures/test_ask_holmes/:

tests/llm/fixtures/test_ask_holmes/<NNN>_<descriptive_name>/
├── test_case.yaml          # Required: test definition
├── toolsets.yaml            # Optional: enable specific toolsets
├── manifest.yaml            # Optional: Kubernetes manifests
├── generate_*.py            # Optional: data generation scripts
└── other supporting files

Naming convention: <3-digit-number>_<snake_case_description> (e.g., 212_large_configmap_needle).

Creation Workflow

Step 1: Choose Test Number and Namespace

Check existing tests to find the next available number:

ls tests/llm/fixtures/test_ask_holmes/ | sort -n | tail -5

The namespace must be app-<testid> (e.g., app-212). All pod and resource names must be unique across all tests.

Step 2: Validate Tags

Only use tags that exist in pyproject.toml markers section. Using invalid tags causes test collection failures. Read pyproject.toml and check the [tool.pytest.ini_options] markers list before assigning tags. Ask the user before adding any new tag.

Step 3: Write test_case.yaml

Core fields:

user_prompt: "Specific question for Holmes to answer"

expected_output:
  - "Criterion 1: Must report exact value X"
  - "Criterion 2: Must include specific identifier Y"

tags:
  - kubernetes
  - question-answer
  - hard

before_test: |
  set -e
  # Setup infrastructure...

after_test: |
  kubectl delete namespace app-NNN --ignore-not-found

For the complete field reference and all available options, consult references/test-case-format.md.

Step 4: Write toolsets.yaml (if needed)

When the test requires specific toolsets (Prometheus, Grafana, Elasticsearch, etc.):

toolsets:
  kubernetes/core:
    enabled: true
  prometheus/metrics:
    enabled: true
    config:
      prometheus_url: http://localhost:10033

When a toolsets.yaml exists, only explicitly enabled toolsets are available to the LLM. All others are disabled.

Step 5: Write Setup Scripts

The before_test script runs from the test's directory via /bin/bash. Key rules:

Always start with set -e to fail on any error
Use kubectl create namespace app-NNN --dry-run=client -o yaml | kubectl apply -f - for idempotent namespace creation
Use exit 1 when verification fails to fail the test early
Clean up temp files at the end of before_test

Verification focus: verify the needle, not the haystack. The only verification that matters is that Holmes can discover the answer. Run the same kind of query Holmes would run and check that the expected value (the "smoking gun") is present. Do NOT exhaustively verify every piece of infrastructure — if the needle is queryable, the environment is working. Keep setup scripts short and readable.

# GOOD - verify the needle is discoverable (one targeted check)
kubectl get configmap platform-config -n app-212 \
  -o jsonpath='{.data.platform-config\.yaml}' | grep -q '7k3m9x'

# BAD - verifying everything (pod health, service endpoints, API responses, readiness...)
# This bloats the script without adding value

For retry loop patterns and other infrastructure details, consult references/infrastructure-patterns.md.

Step 6: Design Anti-Hallucination Measures

Every eval must be designed so the LLM cannot pass by guessing. This is the most critical aspect of eval design.

Key principles:

Embed unique random identifiers that cannot be guessed (e.g., 7k3m9x)
Test for specific values discoverable only by querying
Use neutral resource names that don't hint at the problem
Write prompts that test discovery ability, not domain knowledge

For detailed anti-hallucination patterns and examples, consult references/anti-hallucination.md.

Mandatory Testing Workflow

Always run evals before submitting when possible. Follow this sequence:

Phase 1: Verify Collection

poetry run pytest -k "test_name" --collect-only -q --no-cov

Confirm the test appears in the output. If not, check for tag or YAML errors.

Phase 2: Run Setup Only

poetry run pytest -k "test_name" --only-setup --no-cov

Verify setup completes without errors. Check that infrastructure is ready.

Phase 3: Run Full Test

poetry run pytest -k "test_name" --no-cov --skip-setup

Use --skip-setup to reuse the infrastructure from Phase 2. Verify the test passes.

Phase 4: Verify Cleanup

kubectl get namespace app-NNN

Should return NotFound after the test completes (unless --skip-cleanup was used).

Key Rules

Namespace isolation: Every test uses app-<testid> namespace
Unique resource names: Never reuse pod/service names across tests
No :latest tags: Always use specific container image versions
Secrets for scripts: Use Kubernetes Secrets for scripts, not ConfigMaps or inline
No hints in names: Avoid broken-pod, crashloop-app — use neutral names
Sign commits: Always use git commit -s for DCO compliance

Quick Reference: Common Patterns

Pattern	Example
Simple K8s test	Deploy pod, ask about status
Log analysis	Generate logs via script, ask Holmes to analyze
Metrics query	Deploy Prometheus + exporters, query metrics
Large data needle	Create large ConfigMap/resource, find specific value
Cloud service	Test against Elasticsearch/external API via env vars

Additional Resources

Reference Files

For detailed documentation, consult:

references/test-case-format.md — Complete test_case.yaml field reference with all options
references/anti-hallucination.md — Anti-cheat testing patterns and prompt design
references/infrastructure-patterns.md — Setup scripts, retry loops, port forwards, shared infra
references/running-evals.md — CLI flags, environment variables, model comparison, debugging

flutter-development

aj-geddes

Build beautiful cross-platform mobile apps with Flutter and Dart. Covers widgets, state management with Provider/BLoC, navigation, API integration, and material design.

1,6821,428

ui-ux-pro-max

nextlevelbuilder

"UI/UX design intelligence. 50 styles, 21 palettes, 50 font pairings, 20 charts, 8 stacks (React, Next.js, Vue, Svelte, SwiftUI, React Native, Flutter, Tailwind). Actions: plan, build, create, design, implement, review, fix, improve, optimize, enhance, refactor, check UI/UX code. Projects: website, landing page, dashboard, admin panel, e-commerce, SaaS, portfolio, blog, mobile app, .html, .tsx, .vue, .svelte. Elements: button, modal, navbar, sidebar, card, table, form, chart. Styles: glassmorphism, claymorphism, minimalism, brutalism, neumorphism, bento grid, dark mode, responsive, skeuomorphism, flat design. Topics: color palette, accessibility, animation, layout, typography, font pairing, spacing, hover, shadow, gradient."

1,2591,319

drawio-diagrams-enhanced

jgtolentino

Create professional draw.io (diagrams.net) diagrams in XML format (.drawio files) with integrated PMP/PMBOK methodologies, extensive visual asset libraries, and industry-standard professional templates. Use this skill when users ask to create flowcharts, swimlane diagrams, cross-functional flowcharts, org charts, network diagrams, UML diagrams, BPMN, project management diagrams (WBS, Gantt, PERT, RACI), risk matrices, stakeholder maps, or any other visual diagram in draw.io format. This skill includes access to custom shape libraries for icons, clipart, and professional symbols.

1,5271,144

godot

bfollington

This skill should be used when working on Godot Engine projects. It provides specialized knowledge of Godot's file formats (.gd, .tscn, .tres), architecture patterns (component-based, signal-driven, resource-based), common pitfalls, validation tools, code templates, and CLI workflows. The `godot` command is available for running the game, validating scripts, importing resources, and exporting builds. Use this skill for tasks involving Godot game development, debugging scene/resource files, implementing game systems, or creating new Godot components.

1,349807

nano-banana-pro

garg-aayush

Generate and edit images using Google's Nano Banana Pro (Gemini 3 Pro Image) API. Use when the user asks to generate, create, edit, modify, change, alter, or update images. Also use when user references an existing image file and asks to modify it in any way (e.g., "modify this image", "change the background", "replace X with Y"). Supports both text-to-image generation and image-to-image editing with configurable resolution (1K default, 2K, or 4K for high resolution). DO NOT read the image file first - use this skill directly with the --input-image parameter.

1,261727

pdf-to-markdown

aliceisjustplaying

Convert entire PDF documents to clean, structured Markdown for full context loading. Use this skill when the user wants to extract ALL text from a PDF into context (not grep/search), when discussing or analyzing PDF content in full, when the user mentions "load the whole PDF", "bring the PDF into context", "read the entire PDF", or when partial extraction/grepping would miss important context. This is the preferred method for PDF text extraction over page-by-page or grep approaches.

1,466674

Related MCP Servers

Browse all servers

Figma Context

Unlock seamless Figma to code: streamline Figma to HTML with Framelink MCP Server for fast, accurate design-to-code work

13,4900 tools

Laravel Boost

Official Laravel-focused MCP server for augmenting AI-powered local development. Provides deep context about your Larave

3,3240 tools

Grafana

Safely connect cloud Grafana to AI agents with MCP: query, inspect, and manage Grafana resources using simple, focused o

2,4940 tools

Perplexity

Empower your workflows with Perplexity Ask MCP Server—seamless integration of AI research tools for real-time, accurate

1,9990 tools

Azure DevOps

Boost your productivity by managing Azure DevOps projects, pipelines, and repos in VS Code. Streamline dev workflows wit

1,37377 tools

Ref Tools

Boost AI coding agents with Ref Tools—efficient documentation access for faster, smarter code generation than GitHub Cop

1,0040 tools

Install

mkdir -p .claude/skills/create-eval && curl -L -o skill.zip "https://mcp.directory/api/skills/download/4631" && unzip -o skill.zip -d .claude/skills/create-eval && rm skill.zip

Installs to .claude/skills/create-eval

Stats

Views

Installs

Author

HolmesGPT

Links

Source Code

create-eval

Install

About this skill

Creating HolmesGPT Eval Tests

Test Structure

Creation Workflow

Step 1: Choose Test Number and Namespace

Step 2: Validate Tags

Step 3: Write test_case.yaml

Step 4: Write toolsets.yaml (if needed)

Step 5: Write Setup Scripts

Step 6: Design Anti-Hallucination Measures

Mandatory Testing Workflow

Phase 1: Verify Collection

Phase 2: Run Setup Only

Phase 3: Run Full Test

Phase 4: Verify Cleanup

Key Rules

Quick Reference: Common Patterns

Additional Resources

Reference Files

You might also like

flutter-development

ui-ux-pro-max

drawio-diagrams-enhanced

godot

nano-banana-pro

pdf-to-markdown

Related MCP Servers