evaluate-presets

2views

1installs

Use when testing Ralph's hat collection presets, validating preset configurations, or auditing the preset library for bugs and UX issues.

Install

mkdir -p .claude/skills/evaluate-presets && curl -L -o skill.zip "https://mcp.directory/api/skills/download/4662" && unzip -o skill.zip -d .claude/skills/evaluate-presets && rm skill.zip

Installs to .claude/skills/evaluate-presets

About this skill

Evaluate Presets

Overview

Systematically test all hat collection presets using shell scripts. Direct CLI invocation—no meta-orchestration complexity.

When to Use

Testing preset configurations after changes
Auditing the preset library for quality
Validating new presets work correctly
After modifying hat routing logic

Quick Start

Evaluate a single preset:

./tools/evaluate-preset.sh tdd-red-green claude

Evaluate all presets:

./tools/evaluate-all-presets.sh claude

Arguments:

First arg: preset name (without .yml extension)
Second arg: backend (claude or kiro, defaults to claude)

Bash Tool Configuration

IMPORTANT: When invoking these scripts via the Bash tool, use these settings:

Single preset evaluation: Use timeout: 600000 (10 minutes max) and run_in_background: true
All presets evaluation: Use timeout: 600000 (10 minutes max) and run_in_background: true

Since preset evaluations can run for hours (especially the full suite), always run in background mode and use the TaskOutput tool to check progress periodically.

Example invocation pattern:

Bash tool with:
  command: "./tools/evaluate-preset.sh tdd-red-green claude"
  timeout: 600000
  run_in_background: true

After launching, use TaskOutput with block: false to check status without waiting for completion.

What the Scripts Do

`evaluate-preset.sh`

Loads test task from tools/preset-test-tasks.yml (if yq available)
Creates merged config with evaluation settings
Runs Ralph with --record-session for metrics capture
Captures output logs, exit codes, and timing
Extracts metrics: iterations, hats activated, events published

Output structure:

.eval/
├── logs/<preset>/<timestamp>/
│   ├── output.log          # Full stdout/stderr
│   ├── session.jsonl       # Recorded session
│   ├── metrics.json        # Extracted metrics
│   ├── environment.json    # Runtime environment
│   └── merged-config.yml   # Config used
└── logs/<preset>/latest -> <timestamp>

`evaluate-all-presets.sh`

Runs all 12 presets sequentially and generates a summary:

.eval/results/<suite-id>/
├── SUMMARY.md              # Markdown report
├── <preset>.json           # Per-preset metrics
└── latest -> <suite-id>

Presets Under Evaluation

Preset	Test Task
`tdd-red-green`	Add `is_palindrome()` function
`adversarial-review`	Review user input handler for security
`socratic-learning`	Understand `HatRegistry`
`spec-driven`	Specify and implement `StringUtils::truncate()`
`mob-programming`	Implement a `Stack` data structure
`scientific-method`	Debug failing mock test assertion
`code-archaeology`	Understand history of `config.rs`
`performance-optimization`	Profile hat matching
`api-design`	Design a `Cache` trait
`documentation-first`	Document `RateLimiter`
`incident-response`	Respond to "tests failing in CI"
`migration-safety`	Plan v1 to v2 config migration

Interpreting Results

Exit codes from evaluate-preset.sh:

0 — Success (LOOP_COMPLETE reached)
124 — Timeout (preset hung or took too long)
Other — Failure (check output.log)

Metrics in metrics.json:

iterations — How many event loop cycles
hats_activated — Which hats were triggered
events_published — Total events emitted
completed — Whether completion promise was reached

Hat Routing Performance

Critical: Validate that hats get fresh context per Tenet #1 ("Fresh Context Is Reliability").

What Good Looks Like

Each hat should execute in its own iteration:

Iter 1: Ralph → publishes starting event → STOPS
Iter 2: Hat A → does work → publishes next event → STOPS
Iter 3: Hat B → does work → publishes next event → STOPS
Iter 4: Hat C → does work → LOOP_COMPLETE

Red Flags (Same-Iteration Hat Switching)

BAD: Multiple hat personas in one iteration:

Iter 2: Ralph does Blue Team + Red Team + Fixer work
        ^^^ All in one bloated context!

How to Check

1. Count iterations vs events in session.jsonl:

# Count iterations
grep -c "_meta.loop_start\|ITERATION" .eval/logs/<preset>/latest/output.log

# Count events published
grep -c "bus.publish" .eval/logs/<preset>/latest/session.jsonl

Expected: iterations ≈ events published (one event per iteration) Bad sign: 2-3 iterations but 5+ events (all work in single iteration)

2. Check for same-iteration hat switching in output.log:

grep -E "ITERATION|Now I need to perform|Let me put on|I'll switch to" \
    .eval/logs/<preset>/latest/output.log

Red flag: Hat-switching phrases WITHOUT an ITERATION separator between them.

3. Check event timestamps in session.jsonl:

cat .eval/logs/<preset>/latest/session.jsonl | jq -r '.ts'

Red flag: Multiple events with identical timestamps (published in same iteration).

Routing Performance Triage

Pattern	Diagnosis	Action
iterations ≈ events	✅ Good	Hat routing working
iterations << events	⚠️ Same-iteration switching	Check prompt has STOP instruction
iterations >> events	⚠️ Recovery loops	Agent not publishing required events
0 events	❌ Broken	Events not being read from JSONL

Root Cause Checklist

If hat routing is broken:

Check workflow prompt in hatless_ralph.rs:
- Does it say "CRITICAL: STOP after publishing"?
- Is the DELEGATE section clear about yielding control?
Check hat instructions propagation:
- Does HatInfo include instructions field?
- Are instructions rendered in the ## HATS section?
Check events context:
- Is build_prompt(context) using the context parameter?
- Does prompt include ## PENDING EVENTS section?

Autonomous Fix Workflow

After evaluation, delegate fixes to subagents:

Step 1: Triage Results

Read .eval/results/latest/SUMMARY.md and identify:

❌ FAIL → Create code tasks for fixes
⏱️ TIMEOUT → Investigate infinite loops
⚠️ PARTIAL → Check for edge cases

Step 2: Dispatch Task Creation

For each issue, spawn a Task agent:

"Use /code-task-generator to create a task for fixing: [issue from evaluation]
Output to: tasks/preset-fixes/"

Step 3: Dispatch Implementation

For each created task:

"Use /code-assist to implement: tasks/preset-fixes/[task-file].code-task.md
Mode: auto"

Step 4: Re-evaluate

./tools/evaluate-preset.sh <fixed-preset> claude

Prerequisites

yq (optional): For loading test tasks from YAML. Install: brew install yq
Cargo: Must be able to build Ralph

Related Files

tools/evaluate-preset.sh — Single preset evaluation
tools/evaluate-all-presets.sh — Full suite evaluation
tools/preset-test-tasks.yml — Test task definitions
tools/preset-evaluation-findings.md — Manual findings doc
presets/ — The preset collection being evaluated

More by mikeyobrien

View all skills by mikeyobrien →

pdd

mikeyobrien

Transforms a rough idea into a detailed design document with implementation plan. Follows Prompt-Driven Development — iterative requirements clarification, research, design, and planning.

tui-validate

mikeyobrien

Validates Terminal User Interface (TUI) output using freeze for screenshot capture and LLM-as-judge for semantic validation. Supports both visual (PNG/SVG) and text-based validation modes.

234

tmux-terminal

mikeyobrien

Interactive terminal control via tmux for TUI apps, prompts, and long-running CLI workflows.

182

ralph-operations

mikeyobrien

Use when managing Ralph orchestration loops, analyzing diagnostic data, debugging hat selection, investigating backpressure, or performing post-mortem analysis

ralph-tools

mikeyobrien

Use when managing runtime tasks or memories during Ralph orchestration runs

release-bump

mikeyobrien

Use when bumping ralph-orchestrator version for a new release, after fixes are committed and ready to publish

flutter-development

aj-geddes

Build beautiful cross-platform mobile apps with Flutter and Dart. Covers widgets, state management with Provider/BLoC, navigation, API integration, and material design.

1,5711,369

"UI/UX design intelligence. 50 styles, 21 palettes, 50 font pairings, 20 charts, 8 stacks (React, Next.js, Vue, Svelte, SwiftUI, React Native, Flutter, Tailwind). Actions: plan, build, create, design, implement, review, fix, improve, optimize, enhance, refactor, check UI/UX code. Projects: website, landing page, dashboard, admin panel, e-commerce, SaaS, portfolio, blog, mobile app, .html, .tsx, .vue, .svelte. Elements: button, modal, navbar, sidebar, card, table, form, chart. Styles: glassmorphism, claymorphism, minimalism, brutalism, neumorphism, bento grid, dark mode, responsive, skeuomorphism, flat design. Topics: color palette, accessibility, animation, layout, typography, font pairing, spacing, hover, shadow, gradient."

1,1161,191

drawio-diagrams-enhanced

jgtolentino

Create professional draw.io (diagrams.net) diagrams in XML format (.drawio files) with integrated PMP/PMBOK methodologies, extensive visual asset libraries, and industry-standard professional templates. Use this skill when users ask to create flowcharts, swimlane diagrams, cross-functional flowcharts, org charts, network diagrams, UML diagrams, BPMN, project management diagrams (WBS, Gantt, PERT, RACI), risk matrices, stakeholder maps, or any other visual diagram in draw.io format. This skill includes access to custom shape libraries for icons, clipart, and professional symbols.

1,4181,109

godot

bfollington

This skill should be used when working on Godot Engine projects. It provides specialized knowledge of Godot's file formats (.gd, .tscn, .tres), architecture patterns (component-based, signal-driven, resource-based), common pitfalls, validation tools, code templates, and CLI workflows. The `godot` command is available for running the game, validating scripts, importing resources, and exporting builds. Use this skill for tasks involving Godot game development, debugging scene/resource files, implementing game systems, or creating new Godot components.

1,194747

nano-banana-pro

garg-aayush

Generate and edit images using Google's Nano Banana Pro (Gemini 3 Pro Image) API. Use when the user asks to generate, create, edit, modify, change, alter, or update images. Also use when user references an existing image file and asks to modify it in any way (e.g., "modify this image", "change the background", "replace X with Y"). Supports both text-to-image generation and image-to-image editing with configurable resolution (1K default, 2K, or 4K for high resolution). DO NOT read the image file first - use this skill directly with the --input-image parameter.

1,153684

pdf-to-markdown

aliceisjustplaying

Convert entire PDF documents to clean, structured Markdown for full context loading. Use this skill when the user wants to extract ALL text from a PDF into context (not grep/search), when discussing or analyzing PDF content in full, when the user mentions "load the whole PDF", "bring the PDF into context", "read the entire PDF", or when partial extraction/grepping would miss important context. This is the preferred method for PDF text extraction over page-by-page or grep approaches.

1,311614

Related MCP Servers

Browse all servers

Postman Minimal

Empower AI agents for efficient API automation in Postman for API testing. Streamline workflows and boost productivity w

1840 tools

Postman Full

Unlock AI-powered automation for Postman for API testing. Streamline workflows, code sync, and team collaboration with f

1840 tools

API Tester

Automate API testing with Postman collections or OpenAPI specs. Generate test cases in TypeScript, JavaScript, and Pytho

50 tools

Bruno

Bruno: API testing via Bruno CLI — execute requests, manage collections & environments, and generate automated test repo

29 tools

Playwright Browser Automation

Enhance software testing with Playwright MCP: Fast, reliable browser automation, an innovative alternative to Selenium s

28,44922 tools

HexStrike AI

Advanced MCP server enabling AI agents to autonomously run 150+ security and penetration testing tools. Covers reconnais

7,2980 tools

Stay ahead of the MCP ecosystem

Get weekly updates on new skills and servers.

Install

mkdir -p .claude/skills/evaluate-presets && curl -L -o skill.zip "https://mcp.directory/api/skills/download/4662" && unzip -o skill.zip -d .claude/skills/evaluate-presets && rm skill.zip

Installs to .claude/skills/evaluate-presets

Stats

Views

Installs

Author

mikeyobrien

7 skills published

Links

Source Code

evaluate-presets

Install

About this skill

Evaluate Presets

Overview

When to Use

Quick Start

Bash Tool Configuration

What the Scripts Do

evaluate-preset.sh

evaluate-all-presets.sh

Presets Under Evaluation

Interpreting Results

Hat Routing Performance

What Good Looks Like

Red Flags (Same-Iteration Hat Switching)

How to Check

Routing Performance Triage

Root Cause Checklist

Autonomous Fix Workflow

Step 1: Triage Results

Step 2: Dispatch Task Creation

Step 3: Dispatch Implementation

Step 4: Re-evaluate

Prerequisites

Related Files

More by mikeyobrien

pdd

tui-validate

tmux-terminal

ralph-operations

ralph-tools

release-bump

You might also like

flutter-development

ui-ux-pro-max

drawio-diagrams-enhanced

godot

nano-banana-pro

pdf-to-markdown

Related MCP Servers

Stay ahead of the MCP ecosystem

`evaluate-preset.sh`

`evaluate-all-presets.sh`