evaluate-presets
Use when testing Ralph's hat collection presets, validating preset configurations, or auditing the preset library for bugs and UX issues.
Install
mkdir -p .claude/skills/evaluate-presets && curl -L -o skill.zip "https://mcp.directory/api/skills/download/4662" && unzip -o skill.zip -d .claude/skills/evaluate-presets && rm skill.zipInstalls to .claude/skills/evaluate-presets
About this skill
Evaluate Presets
Overview
Systematically test all hat collection presets using shell scripts. Direct CLI invocation—no meta-orchestration complexity.
When to Use
- Testing preset configurations after changes
- Auditing the preset library for quality
- Validating new presets work correctly
- After modifying hat routing logic
Quick Start
Evaluate a single preset:
./tools/evaluate-preset.sh tdd-red-green claude
Evaluate all presets:
./tools/evaluate-all-presets.sh claude
Arguments:
- First arg: preset name (without
.ymlextension) - Second arg: backend (
claudeorkiro, defaults toclaude)
Bash Tool Configuration
IMPORTANT: When invoking these scripts via the Bash tool, use these settings:
- Single preset evaluation: Use
timeout: 600000(10 minutes max) andrun_in_background: true - All presets evaluation: Use
timeout: 600000(10 minutes max) andrun_in_background: true
Since preset evaluations can run for hours (especially the full suite), always run in background mode and use the TaskOutput tool to check progress periodically.
Example invocation pattern:
Bash tool with:
command: "./tools/evaluate-preset.sh tdd-red-green claude"
timeout: 600000
run_in_background: true
After launching, use TaskOutput with block: false to check status without waiting for completion.
What the Scripts Do
evaluate-preset.sh
- Loads test task from
tools/preset-test-tasks.yml(ifyqavailable) - Creates merged config with evaluation settings
- Runs Ralph with
--record-sessionfor metrics capture - Captures output logs, exit codes, and timing
- Extracts metrics: iterations, hats activated, events published
Output structure:
.eval/
├── logs/<preset>/<timestamp>/
│ ├── output.log # Full stdout/stderr
│ ├── session.jsonl # Recorded session
│ ├── metrics.json # Extracted metrics
│ ├── environment.json # Runtime environment
│ └── merged-config.yml # Config used
└── logs/<preset>/latest -> <timestamp>
evaluate-all-presets.sh
Runs all 12 presets sequentially and generates a summary:
.eval/results/<suite-id>/
├── SUMMARY.md # Markdown report
├── <preset>.json # Per-preset metrics
└── latest -> <suite-id>
Presets Under Evaluation
| Preset | Test Task |
|---|---|
tdd-red-green | Add is_palindrome() function |
adversarial-review | Review user input handler for security |
socratic-learning | Understand HatRegistry |
spec-driven | Specify and implement StringUtils::truncate() |
mob-programming | Implement a Stack data structure |
scientific-method | Debug failing mock test assertion |
code-archaeology | Understand history of config.rs |
performance-optimization | Profile hat matching |
api-design | Design a Cache trait |
documentation-first | Document RateLimiter |
incident-response | Respond to "tests failing in CI" |
migration-safety | Plan v1 to v2 config migration |
Interpreting Results
Exit codes from evaluate-preset.sh:
0— Success (LOOP_COMPLETE reached)124— Timeout (preset hung or took too long)- Other — Failure (check
output.log)
Metrics in metrics.json:
iterations— How many event loop cycleshats_activated— Which hats were triggeredevents_published— Total events emittedcompleted— Whether completion promise was reached
Hat Routing Performance
Critical: Validate that hats get fresh context per Tenet #1 ("Fresh Context Is Reliability").
What Good Looks Like
Each hat should execute in its own iteration:
Iter 1: Ralph → publishes starting event → STOPS
Iter 2: Hat A → does work → publishes next event → STOPS
Iter 3: Hat B → does work → publishes next event → STOPS
Iter 4: Hat C → does work → LOOP_COMPLETE
Red Flags (Same-Iteration Hat Switching)
BAD: Multiple hat personas in one iteration:
Iter 2: Ralph does Blue Team + Red Team + Fixer work
^^^ All in one bloated context!
How to Check
1. Count iterations vs events in session.jsonl:
# Count iterations
grep -c "_meta.loop_start\|ITERATION" .eval/logs/<preset>/latest/output.log
# Count events published
grep -c "bus.publish" .eval/logs/<preset>/latest/session.jsonl
Expected: iterations ≈ events published (one event per iteration) Bad sign: 2-3 iterations but 5+ events (all work in single iteration)
2. Check for same-iteration hat switching in output.log:
grep -E "ITERATION|Now I need to perform|Let me put on|I'll switch to" \
.eval/logs/<preset>/latest/output.log
Red flag: Hat-switching phrases WITHOUT an ITERATION separator between them.
3. Check event timestamps in session.jsonl:
cat .eval/logs/<preset>/latest/session.jsonl | jq -r '.ts'
Red flag: Multiple events with identical timestamps (published in same iteration).
Routing Performance Triage
| Pattern | Diagnosis | Action |
|---|---|---|
| iterations ≈ events | ✅ Good | Hat routing working |
| iterations << events | ⚠️ Same-iteration switching | Check prompt has STOP instruction |
| iterations >> events | ⚠️ Recovery loops | Agent not publishing required events |
| 0 events | ❌ Broken | Events not being read from JSONL |
Root Cause Checklist
If hat routing is broken:
-
Check workflow prompt in
hatless_ralph.rs:- Does it say "CRITICAL: STOP after publishing"?
- Is the DELEGATE section clear about yielding control?
-
Check hat instructions propagation:
- Does
HatInfoincludeinstructionsfield? - Are instructions rendered in the
## HATSsection?
- Does
-
Check events context:
- Is
build_prompt(context)using the context parameter? - Does prompt include
## PENDING EVENTSsection?
- Is
Autonomous Fix Workflow
After evaluation, delegate fixes to subagents:
Step 1: Triage Results
Read .eval/results/latest/SUMMARY.md and identify:
❌ FAIL→ Create code tasks for fixes⏱️ TIMEOUT→ Investigate infinite loops⚠️ PARTIAL→ Check for edge cases
Step 2: Dispatch Task Creation
For each issue, spawn a Task agent:
"Use /code-task-generator to create a task for fixing: [issue from evaluation]
Output to: tasks/preset-fixes/"
Step 3: Dispatch Implementation
For each created task:
"Use /code-assist to implement: tasks/preset-fixes/[task-file].code-task.md
Mode: auto"
Step 4: Re-evaluate
./tools/evaluate-preset.sh <fixed-preset> claude
Prerequisites
- yq (optional): For loading test tasks from YAML. Install:
brew install yq - Cargo: Must be able to build Ralph
Related Files
tools/evaluate-preset.sh— Single preset evaluationtools/evaluate-all-presets.sh— Full suite evaluationtools/preset-test-tasks.yml— Test task definitionstools/preset-evaluation-findings.md— Manual findings docpresets/— The preset collection being evaluated
More by mikeyobrien
View all →You might also like
flutter-development
aj-geddes
Build beautiful cross-platform mobile apps with Flutter and Dart. Covers widgets, state management with Provider/BLoC, navigation, API integration, and material design.
drawio-diagrams-enhanced
jgtolentino
Create professional draw.io (diagrams.net) diagrams in XML format (.drawio files) with integrated PMP/PMBOK methodologies, extensive visual asset libraries, and industry-standard professional templates. Use this skill when users ask to create flowcharts, swimlane diagrams, cross-functional flowcharts, org charts, network diagrams, UML diagrams, BPMN, project management diagrams (WBS, Gantt, PERT, RACI), risk matrices, stakeholder maps, or any other visual diagram in draw.io format. This skill includes access to custom shape libraries for icons, clipart, and professional symbols.
godot
bfollington
This skill should be used when working on Godot Engine projects. It provides specialized knowledge of Godot's file formats (.gd, .tscn, .tres), architecture patterns (component-based, signal-driven, resource-based), common pitfalls, validation tools, code templates, and CLI workflows. The `godot` command is available for running the game, validating scripts, importing resources, and exporting builds. Use this skill for tasks involving Godot game development, debugging scene/resource files, implementing game systems, or creating new Godot components.
nano-banana-pro
garg-aayush
Generate and edit images using Google's Nano Banana Pro (Gemini 3 Pro Image) API. Use when the user asks to generate, create, edit, modify, change, alter, or update images. Also use when user references an existing image file and asks to modify it in any way (e.g., "modify this image", "change the background", "replace X with Y"). Supports both text-to-image generation and image-to-image editing with configurable resolution (1K default, 2K, or 4K for high resolution). DO NOT read the image file first - use this skill directly with the --input-image parameter.
ui-ux-pro-max
nextlevelbuilder
"UI/UX design intelligence. 50 styles, 21 palettes, 50 font pairings, 20 charts, 8 stacks (React, Next.js, Vue, Svelte, SwiftUI, React Native, Flutter, Tailwind). Actions: plan, build, create, design, implement, review, fix, improve, optimize, enhance, refactor, check UI/UX code. Projects: website, landing page, dashboard, admin panel, e-commerce, SaaS, portfolio, blog, mobile app, .html, .tsx, .vue, .svelte. Elements: button, modal, navbar, sidebar, card, table, form, chart. Styles: glassmorphism, claymorphism, minimalism, brutalism, neumorphism, bento grid, dark mode, responsive, skeuomorphism, flat design. Topics: color palette, accessibility, animation, layout, typography, font pairing, spacing, hover, shadow, gradient."
rust-coding-skill
UtakataKyosui
Guides Claude in writing idiomatic, efficient, well-structured Rust code using proper data modeling, traits, impl organization, macros, and build-speed best practices.
Stay ahead of the MCP ecosystem
Get weekly updates on new skills and servers.