promptfoo-evaluation
Configures and runs LLM evaluation using Promptfoo framework. Use when setting up prompt testing, creating evaluation configs (promptfooconfig.yaml), writing Python custom assertions, implementing llm-rubric for LLM-as-judge, or managing few-shot examples in prompts. Triggers on keywords like "promptfoo", "eval", "LLM evaluation", "prompt testing", or "model comparison".
Install
mkdir -p .claude/skills/promptfoo-evaluation && curl -L -o skill.zip "https://mcp.directory/api/skills/download/3138" && unzip -o skill.zip -d .claude/skills/promptfoo-evaluation && rm skill.zipInstalls to .claude/skills/promptfoo-evaluation
About this skill
Promptfoo Evaluation
Overview
This skill provides guidance for configuring and running LLM evaluations using Promptfoo, an open-source CLI tool for testing and comparing LLM outputs.
Quick Start
# Initialize a new evaluation project
npx promptfoo@latest init
# Run evaluation
npx promptfoo@latest eval
# View results in browser
npx promptfoo@latest view
Configuration Structure
A typical Promptfoo project structure:
project/
├── promptfooconfig.yaml # Main configuration
├── prompts/
│ ├── system.md # System prompt
│ └── chat.json # Chat format prompt
├── tests/
│ └── cases.yaml # Test cases
└── scripts/
└── metrics.py # Custom Python assertions
Core Configuration (promptfooconfig.yaml)
# yaml-language-server: $schema=https://promptfoo.dev/config-schema.json
description: "My LLM Evaluation"
# Prompts to test
prompts:
- file://prompts/system.md
- file://prompts/chat.json
# Models to compare
providers:
- id: anthropic:messages:claude-sonnet-4-6
label: Claude-Sonnet-4.6
- id: openai:gpt-4.1
label: GPT-4.1
# Test cases
tests: file://tests/cases.yaml
# Concurrency control (MUST be under commandLineOptions, NOT top-level)
commandLineOptions:
maxConcurrency: 2
# Default assertions for all tests
defaultTest:
assert:
- type: python
value: file://scripts/metrics.py:custom_assert
- type: llm-rubric
value: |
Evaluate the response quality on a 0-1 scale.
threshold: 0.7
# Output path
outputPath: results/eval-results.json
Prompt Formats
Text Prompt (system.md)
You are a helpful assistant.
Task: {{task}}
Context: {{context}}
Chat Format (chat.json)
[
{"role": "system", "content": "{{system_prompt}}"},
{"role": "user", "content": "{{user_input}}"}
]
Few-Shot Pattern
Embed examples directly in prompt or use chat format with assistant messages:
[
{"role": "system", "content": "{{system_prompt}}"},
{"role": "user", "content": "Example input: {{example_input}}"},
{"role": "assistant", "content": "{{example_output}}"},
{"role": "user", "content": "Now process: {{actual_input}}"}
]
Test Cases (tests/cases.yaml)
- description: "Test case 1"
vars:
system_prompt: file://prompts/system.md
user_input: "Hello world"
# Load content from files
context: file://data/context.txt
assert:
- type: contains
value: "expected text"
- type: python
value: file://scripts/metrics.py:custom_check
threshold: 0.8
Python Custom Assertions
Create a Python file for custom assertions (e.g., scripts/metrics.py):
def get_assert(output: str, context: dict) -> dict:
"""Default assertion function."""
vars_dict = context.get('vars', {})
# Access test variables
expected = vars_dict.get('expected', '')
# Return result
return {
"pass": expected in output,
"score": 0.8,
"reason": "Contains expected content",
"named_scores": {"relevance": 0.9}
}
def custom_check(output: str, context: dict) -> dict:
"""Custom named assertion."""
word_count = len(output.split())
passed = 100 <= word_count <= 500
return {
"pass": passed,
"score": min(1.0, word_count / 300),
"reason": f"Word count: {word_count}"
}
Key points:
- Default function name is
get_assert - Specify function with
file://path.py:function_name - Return
bool,float(score), ordictwith pass/score/reason - Access variables via
context['vars']
LLM-as-Judge (llm-rubric)
assert:
- type: llm-rubric
value: |
Evaluate the response based on:
1. Accuracy of information
2. Clarity of explanation
3. Completeness
Score 0.0-1.0 where 0.7+ is passing.
threshold: 0.7
provider: openai:gpt-4.1 # Optional: override grader model
When using a relay/proxy API, each llm-rubric assertion needs its own provider config with apiBaseUrl. Otherwise the grader falls back to the default Anthropic/OpenAI endpoint and gets 401 errors:
assert:
- type: llm-rubric
value: |
Evaluate quality on a 0-1 scale.
threshold: 0.7
provider:
id: anthropic:messages:claude-sonnet-4-6
config:
apiBaseUrl: https://your-relay.example.com/api
Best practices:
- Provide clear scoring criteria
- Use
thresholdto set minimum passing score - Default grader uses available API keys (OpenAI → Anthropic → Google)
- When using relay/proxy: every
llm-rubricmust have its ownproviderwithapiBaseUrl— the main provider'sapiBaseUrlis NOT inherited
Common Assertion Types
| Type | Usage | Example |
|---|---|---|
contains | Check substring | value: "hello" |
icontains | Case-insensitive | value: "HELLO" |
equals | Exact match | value: "42" |
regex | Pattern match | value: "\\d{4}" |
python | Custom logic | value: file://script.py |
llm-rubric | LLM grading | value: "Is professional" |
latency | Response time | threshold: 1000 |
File References
All file:// paths are resolved relative to promptfooconfig.yaml location (NOT the YAML file containing the reference). This is a common gotcha when tests: references a separate YAML file — the file:// paths inside that test file still resolve from the config root.
# Load file content as variable
vars:
content: file://data/input.txt
# Load prompt from file
prompts:
- file://prompts/main.md
# Load test cases from file
tests: file://tests/cases.yaml
# Load Python assertion
assert:
- type: python
value: file://scripts/check.py:validate
Running Evaluations
# Basic run
npx promptfoo@latest eval
# With specific config
npx promptfoo@latest eval --config path/to/config.yaml
# Output to file
npx promptfoo@latest eval --output results.json
# Filter tests
npx promptfoo@latest eval --filter-metadata category=math
# View results
npx promptfoo@latest view
Relay / Proxy API Configuration
When using an API relay or proxy instead of direct Anthropic/OpenAI endpoints:
providers:
- id: anthropic:messages:claude-sonnet-4-6
label: Claude-Sonnet-4.6
config:
max_tokens: 4096
apiBaseUrl: https://your-relay.example.com/api # Promptfoo appends /v1/messages
# CRITICAL: maxConcurrency MUST be under commandLineOptions (NOT top-level)
commandLineOptions:
maxConcurrency: 1 # Respect relay rate limits
Key rules:
apiBaseUrlgoes inproviders[].config— Promptfoo appends/v1/messagesautomaticallymaxConcurrencymust be undercommandLineOptions:— placing it at top level is silently ignored- When using relay with LLM-as-judge, set
maxConcurrency: 1to avoid concurrent request limits (generation + grading share the same pool) - Pass relay token as
ANTHROPIC_API_KEYenv var
Troubleshooting
Python not found:
export PROMPTFOO_PYTHON=python3
Large outputs truncated:
Outputs over 30000 characters are truncated. Use head_limit in assertions.
File not found errors:
All file:// paths resolve relative to promptfooconfig.yaml location.
maxConcurrency ignored (shows "up to N at a time"):
maxConcurrency must be under commandLineOptions:, not at the YAML top level. This is a common mistake.
LLM-as-judge returns 401 with relay API:
Each llm-rubric assertion must have its own provider with apiBaseUrl. The main provider config is not inherited by grader assertions.
HTML tags in model output inflating metrics:
Models may output <br>, <b>, etc. in structured content. Strip HTML in Python assertions before measuring:
import re
clean_text = re.sub(r'<[^>]+>', '', raw_text)
Echo Provider (Preview Mode)
Use the echo provider to preview rendered prompts without making API calls:
# promptfooconfig-preview.yaml
providers:
- echo # Returns prompt as output, no API calls
tests:
- vars:
input: "test content"
Use cases:
- Preview prompt rendering before expensive API calls
- Verify Few-shot examples are loaded correctly
- Debug variable substitution issues
- Validate prompt structure
# Run preview mode
npx promptfoo@latest eval --config promptfooconfig-preview.yaml
Cost: Free - no API tokens consumed.
Advanced Few-Shot Implementation
Multi-turn Conversation Pattern
For complex few-shot learning with full examples:
[
{"role": "system", "content": "{{system_prompt}}"},
// Few-shot Example 1
{"role": "user", "content": "Task: {{example_input_1}}"},
{"role": "assistant", "content": "{{example_output_1}}"},
// Few-shot Example 2 (optional)
{"role": "user", "content": "Task: {{example_input_2}}"},
{"role": "assistant", "content": "{{example_output_2}}"},
// Actual test
{"role": "user", "content": "Task: {{actual_input}}"}
]
Test case configuration:
tests:
- vars:
system_prompt: file://prompts/system.md
# Few-shot examples
example_input_1: file://data/examples/input1.txt
example_output_1: file://data/examples/output1.txt
example_input_2: file://data/examples/input2.txt
example_output_2: file://data/examples/output2.txt
# Actual test
actual_input: file://data/test1.txt
Best practices:
- Use 1-3 few-shot examples (more may dilute effectiveness)
- Ensure examples match the task format exactly
- Load examples from files for better maintainability
- Use echo provider first to verify structure
Long Text Handling
For Chinese/long-form content evaluations (10k+ characters):
Configuration:
providers:
- id: anthropic:messages:claude-sonnet-4-6
config:
max_tokens: 8192 # Increase for long outputs
defaultTest:
assert:
- type: python
va
---
*Content truncated.*
More by daymade
View all skills by daymade →You might also like
flutter-development
aj-geddes
Build beautiful cross-platform mobile apps with Flutter and Dart. Covers widgets, state management with Provider/BLoC, navigation, API integration, and material design.
drawio-diagrams-enhanced
jgtolentino
Create professional draw.io (diagrams.net) diagrams in XML format (.drawio files) with integrated PMP/PMBOK methodologies, extensive visual asset libraries, and industry-standard professional templates. Use this skill when users ask to create flowcharts, swimlane diagrams, cross-functional flowcharts, org charts, network diagrams, UML diagrams, BPMN, project management diagrams (WBS, Gantt, PERT, RACI), risk matrices, stakeholder maps, or any other visual diagram in draw.io format. This skill includes access to custom shape libraries for icons, clipart, and professional symbols.
ui-ux-pro-max
nextlevelbuilder
"UI/UX design intelligence. 50 styles, 21 palettes, 50 font pairings, 20 charts, 8 stacks (React, Next.js, Vue, Svelte, SwiftUI, React Native, Flutter, Tailwind). Actions: plan, build, create, design, implement, review, fix, improve, optimize, enhance, refactor, check UI/UX code. Projects: website, landing page, dashboard, admin panel, e-commerce, SaaS, portfolio, blog, mobile app, .html, .tsx, .vue, .svelte. Elements: button, modal, navbar, sidebar, card, table, form, chart. Styles: glassmorphism, claymorphism, minimalism, brutalism, neumorphism, bento grid, dark mode, responsive, skeuomorphism, flat design. Topics: color palette, accessibility, animation, layout, typography, font pairing, spacing, hover, shadow, gradient."
godot
bfollington
This skill should be used when working on Godot Engine projects. It provides specialized knowledge of Godot's file formats (.gd, .tscn, .tres), architecture patterns (component-based, signal-driven, resource-based), common pitfalls, validation tools, code templates, and CLI workflows. The `godot` command is available for running the game, validating scripts, importing resources, and exporting builds. Use this skill for tasks involving Godot game development, debugging scene/resource files, implementing game systems, or creating new Godot components.
nano-banana-pro
garg-aayush
Generate and edit images using Google's Nano Banana Pro (Gemini 3 Pro Image) API. Use when the user asks to generate, create, edit, modify, change, alter, or update images. Also use when user references an existing image file and asks to modify it in any way (e.g., "modify this image", "change the background", "replace X with Y"). Supports both text-to-image generation and image-to-image editing with configurable resolution (1K default, 2K, or 4K for high resolution). DO NOT read the image file first - use this skill directly with the --input-image parameter.
fastapi-templates
wshobson
Create production-ready FastAPI projects with async patterns, dependency injection, and comprehensive error handling. Use when building new FastAPI applications or setting up backend API projects.
Related MCP Servers
Browse all serversAnalyze lrcx stock in real-time with Investor Agent using yfinance and CNN data for portfolio and market sentiment insig
Use Calculator for quick, secure payroll and percentage calculations. Try our percentage calculator and payroll calculat
Test website accessibility and ensure WCAG compliance with Axe Accessibility, a web accessibility checker with detailed
OpenRouter Image Analysis offers color analyze and image j compatibilities for advanced image analysis using vision mode
Easily convert markdown to PDF using Markitdown MCP server. Supports HTTP, STDIO, and SSE for fast converting markdown t
Unlock AI-ready web data with Firecrawl: scrape any website, handle dynamic content, and automate web scraping for resea
Stay ahead of the MCP ecosystem
Get weekly updates on new skills and servers.