advanced-evaluation
This skill should be used when the user asks to "implement LLM-as-judge", "compare model outputs", "create evaluation rubrics", "mitigate evaluation bias", or mentions direct scoring, pairwise comparison, position bias, evaluation pipelines, or automated quality assessment.
Install
mkdir -p .claude/skills/advanced-evaluation && curl -L -o skill.zip "https://mcp.directory/api/skills/download/1538" && unzip -o skill.zip -d .claude/skills/advanced-evaluation && rm skill.zipInstalls to .claude/skills/advanced-evaluation
About this skill
Advanced Evaluation
This skill covers production-grade techniques for evaluating LLM outputs using LLMs as judges. It synthesizes research from academic papers, industry practices, and practical implementation experience into actionable patterns for building reliable evaluation systems.
Key insight: LLM-as-a-Judge is not a single technique but a family of approaches, each suited to different evaluation contexts. Choosing the right approach and mitigating known biases is the core competency this skill develops.
When to Activate
Activate this skill when:
- Building automated evaluation pipelines for LLM outputs
- Comparing multiple model responses to select the best one
- Establishing consistent quality standards across evaluation teams
- Debugging evaluation systems that show inconsistent results
- Designing A/B tests for prompt or model changes
- Creating rubrics for human or automated evaluation
- Analyzing correlation between automated and human judgments
Core Concepts
The Evaluation Taxonomy
Evaluation approaches fall into two primary categories with distinct reliability profiles:
Direct Scoring: A single LLM rates one response on a defined scale.
- Best for: Objective criteria (factual accuracy, instruction following, toxicity)
- Reliability: Moderate to high for well-defined criteria
- Failure mode: Score calibration drift, inconsistent scale interpretation
Pairwise Comparison: An LLM compares two responses and selects the better one.
- Best for: Subjective preferences (tone, style, persuasiveness)
- Reliability: Higher than direct scoring for preferences
- Failure mode: Position bias, length bias
Research from the MT-Bench paper (Zheng et al., 2023) establishes that pairwise comparison achieves higher agreement with human judges than direct scoring for preference-based evaluation, while direct scoring remains appropriate for objective criteria with clear ground truth.
The Bias Landscape
LLM judges exhibit systematic biases that must be actively mitigated:
Position Bias: First-position responses receive preferential treatment in pairwise comparison. Mitigation: Evaluate twice with swapped positions, use majority vote or consistency check.
Length Bias: Longer responses are rated higher regardless of quality. Mitigation: Explicit prompting to ignore length, length-normalized scoring.
Self-Enhancement Bias: Models rate their own outputs higher. Mitigation: Use different models for generation and evaluation, or acknowledge limitation.
Verbosity Bias: Detailed explanations receive higher scores even when unnecessary. Mitigation: Criteria-specific rubrics that penalize irrelevant detail.
Authority Bias: Confident, authoritative tone rated higher regardless of accuracy. Mitigation: Require evidence citation, fact-checking layer.
Metric Selection Framework
Choose metrics based on the evaluation task structure:
| Task Type | Primary Metrics | Secondary Metrics |
|---|---|---|
| Binary classification (pass/fail) | Recall, Precision, F1 | Cohen's κ |
| Ordinal scale (1-5 rating) | Spearman's ρ, Kendall's τ | Cohen's κ (weighted) |
| Pairwise preference | Agreement rate, Position consistency | Confidence calibration |
| Multi-label | Macro-F1, Micro-F1 | Per-label precision/recall |
The critical insight: High absolute agreement matters less than systematic disagreement patterns. A judge that consistently disagrees with humans on specific criteria is more problematic than one with random noise.
Evaluation Approaches
Direct Scoring Implementation
Direct scoring requires three components: clear criteria, a calibrated scale, and structured output format.
Criteria Definition Pattern:
Criterion: [Name]
Description: [What this criterion measures]
Weight: [Relative importance, 0-1]
Scale Calibration:
- 1-3 scales: Binary with neutral option, lowest cognitive load
- 1-5 scales: Standard Likert, good balance of granularity and reliability
- 1-10 scales: High granularity but harder to calibrate, use only with detailed rubrics
Prompt Structure for Direct Scoring:
You are an expert evaluator assessing response quality.
## Task
Evaluate the following response against each criterion.
## Original Prompt
{prompt}
## Response to Evaluate
{response}
## Criteria
{for each criterion: name, description, weight}
## Instructions
For each criterion:
1. Find specific evidence in the response
2. Score according to the rubric (1-{max} scale)
3. Justify your score with evidence
4. Suggest one specific improvement
## Output Format
Respond with structured JSON containing scores, justifications, and summary.
Chain-of-Thought Requirement: All scoring prompts must require justification before the score. Research shows this improves reliability by 15-25% compared to score-first approaches.
Pairwise Comparison Implementation
Pairwise comparison is inherently more reliable for preference-based evaluation but requires bias mitigation.
Position Bias Mitigation Protocol:
- First pass: Response A in first position, Response B in second
- Second pass: Response B in first position, Response A in second
- Consistency check: If passes disagree, return TIE with reduced confidence
- Final verdict: Consistent winner with averaged confidence
Prompt Structure for Pairwise Comparison:
You are an expert evaluator comparing two AI responses.
## Critical Instructions
- Do NOT prefer responses because they are longer
- Do NOT prefer responses based on position (first vs second)
- Focus ONLY on quality according to the specified criteria
- Ties are acceptable when responses are genuinely equivalent
## Original Prompt
{prompt}
## Response A
{response_a}
## Response B
{response_b}
## Comparison Criteria
{criteria list}
## Instructions
1. Analyze each response independently first
2. Compare them on each criterion
3. Determine overall winner with confidence level
## Output Format
JSON with per-criterion comparison, overall winner, confidence (0-1), and reasoning.
Confidence Calibration: Confidence scores should reflect position consistency:
- Both passes agree: confidence = average of individual confidences
- Passes disagree: confidence = 0.5, verdict = TIE
Rubric Generation
Well-defined rubrics reduce evaluation variance by 40-60% compared to open-ended scoring.
Rubric Components:
- Level descriptions: Clear boundaries for each score level
- Characteristics: Observable features that define each level
- Examples: Representative text for each level (optional but valuable)
- Edge cases: Guidance for ambiguous situations
- Scoring guidelines: General principles for consistent application
Strictness Calibration:
- Lenient: Lower bar for passing scores, appropriate for encouraging iteration
- Balanced: Fair, typical expectations for production use
- Strict: High standards, appropriate for safety-critical or high-stakes evaluation
Domain Adaptation: Rubrics should use domain-specific terminology. A "code readability" rubric mentions variables, functions, and comments. A "medical accuracy" rubric references clinical terminology and evidence standards.
Practical Guidance
Evaluation Pipeline Design
Production evaluation systems require multiple layers:
┌─────────────────────────────────────────────────┐
│ Evaluation Pipeline │
├─────────────────────────────────────────────────┤
│ │
│ Input: Response + Prompt + Context │
│ │ │
│ ▼ │
│ ┌─────────────────────┐ │
│ │ Criteria Loader │ ◄── Rubrics, weights │
│ └──────────┬──────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────────────┐ │
│ │ Primary Scorer │ ◄── Direct or Pairwise │
│ └──────────┬──────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────────────┐ │
│ │ Bias Mitigation │ ◄── Position swap, etc. │
│ └──────────┬──────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────────────┐ │
│ │ Confidence Scoring │ ◄── Calibration │
│ └──────────┬──────────┘ │
│ │ │
│ ▼ │
│ Output: Scores + Justifications + Confidence │
│ │
└─────────────────────────────────────────────────┘
Common Anti-Patterns
Anti-pattern: Scoring without justification
- Problem: Scores lack grounding, difficult to debug or improve
- Solution: Always require evidence-based justification before score
Anti-pattern: Single-pass pairwise comparison
- Problem: Position bias corrupts results
- Solution: Always swap positions and check consistency
Anti-pattern: Overloaded criteria
- Problem: Criteria measuring multiple things are unreliable
- Solution: One criterion = one measurable aspect
Anti-pattern: Missing edge case guidance
- Problem: Evaluators handle ambiguous cases inconsistently
- Solution: Include edge cases in rubrics with explicit guidance
Anti-pattern: Ignoring confidence calibration
- Problem: High-confidence wrong judgments are worse than low-confidence
- Solution: Calibrate confidence to position consistency and evidence strength
Decision Framework: Direct vs. Pairwise
Use this decision tree:
Is there an objective ground truth?
├── Yes → Dir
---
*Content truncated.*
More by muratcankoylan
View all skills by muratcankoylan →You might also like
flutter-development
aj-geddes
Build beautiful cross-platform mobile apps with Flutter and Dart. Covers widgets, state management with Provider/BLoC, navigation, API integration, and material design.
ui-ux-pro-max
nextlevelbuilder
"UI/UX design intelligence. 50 styles, 21 palettes, 50 font pairings, 20 charts, 8 stacks (React, Next.js, Vue, Svelte, SwiftUI, React Native, Flutter, Tailwind). Actions: plan, build, create, design, implement, review, fix, improve, optimize, enhance, refactor, check UI/UX code. Projects: website, landing page, dashboard, admin panel, e-commerce, SaaS, portfolio, blog, mobile app, .html, .tsx, .vue, .svelte. Elements: button, modal, navbar, sidebar, card, table, form, chart. Styles: glassmorphism, claymorphism, minimalism, brutalism, neumorphism, bento grid, dark mode, responsive, skeuomorphism, flat design. Topics: color palette, accessibility, animation, layout, typography, font pairing, spacing, hover, shadow, gradient."
drawio-diagrams-enhanced
jgtolentino
Create professional draw.io (diagrams.net) diagrams in XML format (.drawio files) with integrated PMP/PMBOK methodologies, extensive visual asset libraries, and industry-standard professional templates. Use this skill when users ask to create flowcharts, swimlane diagrams, cross-functional flowcharts, org charts, network diagrams, UML diagrams, BPMN, project management diagrams (WBS, Gantt, PERT, RACI), risk matrices, stakeholder maps, or any other visual diagram in draw.io format. This skill includes access to custom shape libraries for icons, clipart, and professional symbols.
godot
bfollington
This skill should be used when working on Godot Engine projects. It provides specialized knowledge of Godot's file formats (.gd, .tscn, .tres), architecture patterns (component-based, signal-driven, resource-based), common pitfalls, validation tools, code templates, and CLI workflows. The `godot` command is available for running the game, validating scripts, importing resources, and exporting builds. Use this skill for tasks involving Godot game development, debugging scene/resource files, implementing game systems, or creating new Godot components.
nano-banana-pro
garg-aayush
Generate and edit images using Google's Nano Banana Pro (Gemini 3 Pro Image) API. Use when the user asks to generate, create, edit, modify, change, alter, or update images. Also use when user references an existing image file and asks to modify it in any way (e.g., "modify this image", "change the background", "replace X with Y"). Supports both text-to-image generation and image-to-image editing with configurable resolution (1K default, 2K, or 4K for high resolution). DO NOT read the image file first - use this skill directly with the --input-image parameter.
pdf-to-markdown
aliceisjustplaying
Convert entire PDF documents to clean, structured Markdown for full context loading. Use this skill when the user wants to extract ALL text from a PDF into context (not grep/search), when discussing or analyzing PDF content in full, when the user mentions "load the whole PDF", "bring the PDF into context", "read the entire PDF", or when partial extraction/grepping would miss important context. This is the preferred method for PDF text extraction over page-by-page or grep approaches.
Related MCP Servers
Browse all serversUnlock seamless Figma to code: streamline Figma to HTML with Framelink MCP Server for fast, accurate design-to-code work
Official Laravel-focused MCP server for augmenting AI-powered local development. Provides deep context about your Larave
Safely connect cloud Grafana to AI agents with MCP: query, inspect, and manage Grafana resources using simple, focused o
Empower your workflows with Perplexity Ask MCP Server—seamless integration of AI research tools for real-time, accurate
Boost your productivity by managing Azure DevOps projects, pipelines, and repos in VS Code. Streamline dev workflows wit
Boost AI coding agents with Ref Tools—efficient documentation access for faster, smarter code generation than GitHub Cop
Stay ahead of the MCP ecosystem
Get weekly updates on new skills and servers.