promptfoo-evaluation

4
1
Source

Configures and runs LLM evaluation using Promptfoo framework. Use when setting up prompt testing, creating evaluation configs (promptfooconfig.yaml), writing Python custom assertions, implementing llm-rubric for LLM-as-judge, or managing few-shot examples in prompts. Triggers on keywords like "promptfoo", "eval", "LLM evaluation", "prompt testing", or "model comparison".

Install

mkdir -p .claude/skills/promptfoo-evaluation && curl -L -o skill.zip "https://mcp.directory/api/skills/download/3138" && unzip -o skill.zip -d .claude/skills/promptfoo-evaluation && rm skill.zip

Installs to .claude/skills/promptfoo-evaluation

About this skill

Promptfoo Evaluation

Overview

This skill provides guidance for configuring and running LLM evaluations using Promptfoo, an open-source CLI tool for testing and comparing LLM outputs.

Quick Start

# Initialize a new evaluation project
npx promptfoo@latest init

# Run evaluation
npx promptfoo@latest eval

# View results in browser
npx promptfoo@latest view

Configuration Structure

A typical Promptfoo project structure:

project/
├── promptfooconfig.yaml    # Main configuration
├── prompts/
│   ├── system.md           # System prompt
│   └── chat.json           # Chat format prompt
├── tests/
│   └── cases.yaml          # Test cases
└── scripts/
    └── metrics.py          # Custom Python assertions

Core Configuration (promptfooconfig.yaml)

# yaml-language-server: $schema=https://promptfoo.dev/config-schema.json
description: "My LLM Evaluation"

# Prompts to test
prompts:
  - file://prompts/system.md
  - file://prompts/chat.json

# Models to compare
providers:
  - id: anthropic:messages:claude-sonnet-4-6
    label: Claude-Sonnet-4.6
  - id: openai:gpt-4.1
    label: GPT-4.1

# Test cases
tests: file://tests/cases.yaml

# Concurrency control (MUST be under commandLineOptions, NOT top-level)
commandLineOptions:
  maxConcurrency: 2

# Default assertions for all tests
defaultTest:
  assert:
    - type: python
      value: file://scripts/metrics.py:custom_assert
    - type: llm-rubric
      value: |
        Evaluate the response quality on a 0-1 scale.
      threshold: 0.7

# Output path
outputPath: results/eval-results.json

Prompt Formats

Text Prompt (system.md)

You are a helpful assistant.

Task: {{task}}
Context: {{context}}

Chat Format (chat.json)

[
  {"role": "system", "content": "{{system_prompt}}"},
  {"role": "user", "content": "{{user_input}}"}
]

Few-Shot Pattern

Embed examples directly in prompt or use chat format with assistant messages:

[
  {"role": "system", "content": "{{system_prompt}}"},
  {"role": "user", "content": "Example input: {{example_input}}"},
  {"role": "assistant", "content": "{{example_output}}"},
  {"role": "user", "content": "Now process: {{actual_input}}"}
]

Test Cases (tests/cases.yaml)

- description: "Test case 1"
  vars:
    system_prompt: file://prompts/system.md
    user_input: "Hello world"
    # Load content from files
    context: file://data/context.txt
  assert:
    - type: contains
      value: "expected text"
    - type: python
      value: file://scripts/metrics.py:custom_check
      threshold: 0.8

Python Custom Assertions

Create a Python file for custom assertions (e.g., scripts/metrics.py):

def get_assert(output: str, context: dict) -> dict:
    """Default assertion function."""
    vars_dict = context.get('vars', {})

    # Access test variables
    expected = vars_dict.get('expected', '')

    # Return result
    return {
        "pass": expected in output,
        "score": 0.8,
        "reason": "Contains expected content",
        "named_scores": {"relevance": 0.9}
    }

def custom_check(output: str, context: dict) -> dict:
    """Custom named assertion."""
    word_count = len(output.split())
    passed = 100 <= word_count <= 500

    return {
        "pass": passed,
        "score": min(1.0, word_count / 300),
        "reason": f"Word count: {word_count}"
    }

Key points:

  • Default function name is get_assert
  • Specify function with file://path.py:function_name
  • Return bool, float (score), or dict with pass/score/reason
  • Access variables via context['vars']

LLM-as-Judge (llm-rubric)

assert:
  - type: llm-rubric
    value: |
      Evaluate the response based on:
      1. Accuracy of information
      2. Clarity of explanation
      3. Completeness

      Score 0.0-1.0 where 0.7+ is passing.
    threshold: 0.7
    provider: openai:gpt-4.1  # Optional: override grader model

When using a relay/proxy API, each llm-rubric assertion needs its own provider config with apiBaseUrl. Otherwise the grader falls back to the default Anthropic/OpenAI endpoint and gets 401 errors:

assert:
  - type: llm-rubric
    value: |
      Evaluate quality on a 0-1 scale.
    threshold: 0.7
    provider:
      id: anthropic:messages:claude-sonnet-4-6
      config:
        apiBaseUrl: https://your-relay.example.com/api

Best practices:

  • Provide clear scoring criteria
  • Use threshold to set minimum passing score
  • Default grader uses available API keys (OpenAI → Anthropic → Google)
  • When using relay/proxy: every llm-rubric must have its own provider with apiBaseUrl — the main provider's apiBaseUrl is NOT inherited

Common Assertion Types

TypeUsageExample
containsCheck substringvalue: "hello"
icontainsCase-insensitivevalue: "HELLO"
equalsExact matchvalue: "42"
regexPattern matchvalue: "\\d{4}"
pythonCustom logicvalue: file://script.py
llm-rubricLLM gradingvalue: "Is professional"
latencyResponse timethreshold: 1000

File References

All file:// paths are resolved relative to promptfooconfig.yaml location (NOT the YAML file containing the reference). This is a common gotcha when tests: references a separate YAML file — the file:// paths inside that test file still resolve from the config root.

# Load file content as variable
vars:
  content: file://data/input.txt

# Load prompt from file
prompts:
  - file://prompts/main.md

# Load test cases from file
tests: file://tests/cases.yaml

# Load Python assertion
assert:
  - type: python
    value: file://scripts/check.py:validate

Running Evaluations

# Basic run
npx promptfoo@latest eval

# With specific config
npx promptfoo@latest eval --config path/to/config.yaml

# Output to file
npx promptfoo@latest eval --output results.json

# Filter tests
npx promptfoo@latest eval --filter-metadata category=math

# View results
npx promptfoo@latest view

Relay / Proxy API Configuration

When using an API relay or proxy instead of direct Anthropic/OpenAI endpoints:

providers:
  - id: anthropic:messages:claude-sonnet-4-6
    label: Claude-Sonnet-4.6
    config:
      max_tokens: 4096
      apiBaseUrl: https://your-relay.example.com/api  # Promptfoo appends /v1/messages

# CRITICAL: maxConcurrency MUST be under commandLineOptions (NOT top-level)
commandLineOptions:
  maxConcurrency: 1  # Respect relay rate limits

Key rules:

  • apiBaseUrl goes in providers[].config — Promptfoo appends /v1/messages automatically
  • maxConcurrency must be under commandLineOptions: — placing it at top level is silently ignored
  • When using relay with LLM-as-judge, set maxConcurrency: 1 to avoid concurrent request limits (generation + grading share the same pool)
  • Pass relay token as ANTHROPIC_API_KEY env var

Troubleshooting

Python not found:

export PROMPTFOO_PYTHON=python3

Large outputs truncated: Outputs over 30000 characters are truncated. Use head_limit in assertions.

File not found errors: All file:// paths resolve relative to promptfooconfig.yaml location.

maxConcurrency ignored (shows "up to N at a time"): maxConcurrency must be under commandLineOptions:, not at the YAML top level. This is a common mistake.

LLM-as-judge returns 401 with relay API: Each llm-rubric assertion must have its own provider with apiBaseUrl. The main provider config is not inherited by grader assertions.

HTML tags in model output inflating metrics: Models may output <br>, <b>, etc. in structured content. Strip HTML in Python assertions before measuring:

import re
clean_text = re.sub(r'<[^>]+>', '', raw_text)

Echo Provider (Preview Mode)

Use the echo provider to preview rendered prompts without making API calls:

# promptfooconfig-preview.yaml
providers:
  - echo  # Returns prompt as output, no API calls

tests:
  - vars:
      input: "test content"

Use cases:

  • Preview prompt rendering before expensive API calls
  • Verify Few-shot examples are loaded correctly
  • Debug variable substitution issues
  • Validate prompt structure
# Run preview mode
npx promptfoo@latest eval --config promptfooconfig-preview.yaml

Cost: Free - no API tokens consumed.

Advanced Few-Shot Implementation

Multi-turn Conversation Pattern

For complex few-shot learning with full examples:

[
  {"role": "system", "content": "{{system_prompt}}"},

  // Few-shot Example 1
  {"role": "user", "content": "Task: {{example_input_1}}"},
  {"role": "assistant", "content": "{{example_output_1}}"},

  // Few-shot Example 2 (optional)
  {"role": "user", "content": "Task: {{example_input_2}}"},
  {"role": "assistant", "content": "{{example_output_2}}"},

  // Actual test
  {"role": "user", "content": "Task: {{actual_input}}"}
]

Test case configuration:

tests:
  - vars:
      system_prompt: file://prompts/system.md
      # Few-shot examples
      example_input_1: file://data/examples/input1.txt
      example_output_1: file://data/examples/output1.txt
      example_input_2: file://data/examples/input2.txt
      example_output_2: file://data/examples/output2.txt
      # Actual test
      actual_input: file://data/test1.txt

Best practices:

  • Use 1-3 few-shot examples (more may dilute effectiveness)
  • Ensure examples match the task format exactly
  • Load examples from files for better maintainability
  • Use echo provider first to verify structure

Long Text Handling

For Chinese/long-form content evaluations (10k+ characters):

Configuration:

providers:
  - id: anthropic:messages:claude-sonnet-4-6
    config:
      max_tokens: 8192  # Increase for long outputs

defaultTest:
  assert:
    - type: python
      va

---

*Content truncated.*

ppt-creator

daymade

Create professional slide decks from topics or documents. Generates structured content with data-driven charts, speaker notes, and complete PPTX files. Applies persuasive storytelling principles (Pyramid Principle, assertion-evidence). Supports multiple formats (Marp, PowerPoint). Use for presentations, pitches, slide decks, or keynotes.

7948

macos-cleaner

daymade

Analyze and reclaim macOS disk space through intelligent cleanup recommendations. This skill should be used when users report disk space issues, need to clean up their Mac, or want to understand what's consuming storage. Focus on safe, interactive analysis with user confirmation before any deletions.

3115

qa-expert

daymade

This skill should be used when establishing comprehensive QA testing processes for any software project. Use when creating test strategies, writing test cases following Google Testing Standards, executing test plans, tracking bugs with P0-P4 classification, calculating quality metrics, or generating progress reports. Includes autonomous execution capability via master prompts and complete documentation templates for third-party QA team handoffs. Implements OWASP security testing and achieves 90% coverage targets.

199

markdown-tools

daymade

Converts documents to markdown with multi-tool orchestration for best quality. Supports Quick Mode (fast, single tool) and Heavy Mode (best quality, multi-tool merge). Use when converting PDF/DOCX/PPTX files to markdown, extracting images from documents, validating conversion quality, or needing LLM-optimized document output.

397

pdf-creator

daymade

Create PDF documents from markdown with proper Chinese font support using weasyprint. This skill should be used when converting markdown to PDF, generating formal documents (legal, trademark filings, reports), or when Chinese typography is required. Triggers include "convert to PDF", "generate PDF", "markdown to PDF", or any request for creating printable documents.

74

ui-designer

daymade

Extract design systems from reference UI images and generate implementation-ready UI design prompts. Use when users provide UI screenshots/mockups and want to create consistent designs, generate design systems, or build MVP UIs matching reference aesthetics.

74

You might also like

flutter-development

aj-geddes

Build beautiful cross-platform mobile apps with Flutter and Dart. Covers widgets, state management with Provider/BLoC, navigation, API integration, and material design.

1,6881,430

ui-ux-pro-max

nextlevelbuilder

"UI/UX design intelligence. 50 styles, 21 palettes, 50 font pairings, 20 charts, 8 stacks (React, Next.js, Vue, Svelte, SwiftUI, React Native, Flutter, Tailwind). Actions: plan, build, create, design, implement, review, fix, improve, optimize, enhance, refactor, check UI/UX code. Projects: website, landing page, dashboard, admin panel, e-commerce, SaaS, portfolio, blog, mobile app, .html, .tsx, .vue, .svelte. Elements: button, modal, navbar, sidebar, card, table, form, chart. Styles: glassmorphism, claymorphism, minimalism, brutalism, neumorphism, bento grid, dark mode, responsive, skeuomorphism, flat design. Topics: color palette, accessibility, animation, layout, typography, font pairing, spacing, hover, shadow, gradient."

1,2721,337

drawio-diagrams-enhanced

jgtolentino

Create professional draw.io (diagrams.net) diagrams in XML format (.drawio files) with integrated PMP/PMBOK methodologies, extensive visual asset libraries, and industry-standard professional templates. Use this skill when users ask to create flowcharts, swimlane diagrams, cross-functional flowcharts, org charts, network diagrams, UML diagrams, BPMN, project management diagrams (WBS, Gantt, PERT, RACI), risk matrices, stakeholder maps, or any other visual diagram in draw.io format. This skill includes access to custom shape libraries for icons, clipart, and professional symbols.

1,5471,153

godot

bfollington

This skill should be used when working on Godot Engine projects. It provides specialized knowledge of Godot's file formats (.gd, .tscn, .tres), architecture patterns (component-based, signal-driven, resource-based), common pitfalls, validation tools, code templates, and CLI workflows. The `godot` command is available for running the game, validating scripts, importing resources, and exporting builds. Use this skill for tasks involving Godot game development, debugging scene/resource files, implementing game systems, or creating new Godot components.

1,359809

nano-banana-pro

garg-aayush

Generate and edit images using Google's Nano Banana Pro (Gemini 3 Pro Image) API. Use when the user asks to generate, create, edit, modify, change, alter, or update images. Also use when user references an existing image file and asks to modify it in any way (e.g., "modify this image", "change the background", "replace X with Y"). Supports both text-to-image generation and image-to-image editing with configurable resolution (1K default, 2K, or 4K for high resolution). DO NOT read the image file first - use this skill directly with the --input-image parameter.

1,269732

pdf-to-markdown

aliceisjustplaying

Convert entire PDF documents to clean, structured Markdown for full context loading. Use this skill when the user wants to extract ALL text from a PDF into context (not grep/search), when discussing or analyzing PDF content in full, when the user mentions "load the whole PDF", "bring the PDF into context", "read the entire PDF", or when partial extraction/grepping would miss important context. This is the preferred method for PDF text extraction over page-by-page or grep approaches.

1,498687