evaluate-environments
Run and analyze evaluations for verifiers environments using prime eval. Use when asked to smoke-test environments, run benchmark sweeps, resume interrupted evaluations, compare models, inspect sample-level outputs, or produce evaluation summaries suitable for deciding next steps.
Install
mkdir -p .claude/skills/evaluate-environments && curl -L -o skill.zip "https://mcp.directory/api/skills/download/4589" && unzip -o skill.zip -d .claude/skills/evaluate-environments && rm skill.zipInstalls to .claude/skills/evaluate-environments
About this skill
Evaluate Environments
Goal
Run reliable environment evaluations and produce actionable summaries, not raw logs.
Canonical Eval Path
- Use
prime eval runas the default way to run evaluations. - Do not add
--skip-uploador other opt-out flags unless the user explicitly requests that deviation. - Standard
prime eval runruns save results automatically, keeping them available in the user's private Evaluations tab and locally inprime eval tui.
Core Loop
- Run a smoke evaluation first (do not require pre-install):
prime eval run my-env -m gpt-4.1-mini -n 5
- Use owner/env slug directly when evaluating Hub environments:
prime eval run owner/my-env -m gpt-4.1-mini -n 5
- Scale only after smoke pass:
prime eval run owner/my-env -m gpt-4.1-mini -n 200 -r 3 -s
- Treat ownerless env ids as local-first. If not found locally, rely on Prime resolution for your remote env where applicable.
Endpoint Shortcuts And Model Family Choice
- Encourage users to define endpoint aliases in
configs/endpoints.tomlso model, base URL, and key wiring stay reusable. - Use aliases via
-m <endpoint_id>instead of repeating-band-k. - Ask users explicitly whether they want an instruct or reasoning model before non-trivial evaluations.
- Instruct go-tos for quick behavior checks:
gpt-4.1series andqwen3instruct series. - Reasoning go-tos for deeper test coverage:
gpt-5series,qwen3thinking series, andglmseries. - Example endpoint registry:
[[endpoint]]
endpoint_id = "gpt-4.1-mini"
model = "gpt-4.1-mini"
url = "https://api.openai.com/v1"
key = "OPENAI_API_KEY"
[[endpoint]]
endpoint_id = "qwen3-32b-i"
model = "qwen/qwen3-32b-instruct"
url = "https://api.pinference.ai/api/v1"
key = "PRIME_API_KEY"
Publish Gate Before Large Runs
- After smoke tests pass and results look stable, proactively suggest pushing the environment to Hub before large eval sweeps or RL work.
- Ask the user explicitly: should visibility be
PUBLICorPRIVATE? - Push with chosen visibility:
prime env push my-env --visibility PUBLIC
or
prime env push my-env --visibility PRIVATE
- For hosted eval workflows, prefer running large jobs against the Hub slug:
prime eval run owner/my-env -m gpt-4.1-mini -n 200 -r 3 -s
Prefer Config-Driven Evals Beyond Smoke Tests
- For anything beyond quick checks, nudge the user to create an eval TOML config.
- Use config files to run multiple evals in one command and keep runs reproducible:
prime eval run configs/eval/my-benchmark.toml
- Make config files the default for benchmark sweeps, multi-model comparisons, and recurring reports.
Common Evaluation Patterns
- Pass args to
load_environment():
prime eval run my-env -a '{"difficulty":"hard"}'
- Override constructor kwargs:
prime eval run my-env -x '{"max_turns":20}'
- Save extra state columns:
prime eval run my-env -s -C "judge_response,parsed_answer"
- Resume interrupted runs:
prime eval run my-env -n 1000 -s --resume
- Save results to a custom output directory:
prime eval run my-env -s -o /path/to/output
- Run multi-environment TOML suites:
prime eval run configs/eval/my-benchmark.toml
- Scale worker processes to parallelize env execution:
prime eval run my-env -c 1024 -w 4
- Run ablation sweeps using
[[ablation]]blocks in TOML configs:
[[ablation]]
env_id = "my-env"
[ablation.sweep]
temperature = [0.0, 0.5, 1.0]
[ablation.sweep.env_args]
difficulty = ["easy", "hard"]
This generates the cartesian product (6 configs in this example). Use --abbreviated-summary (-A) for compact ablation results.
Inspect Saved Results
- Browse locally saved runs:
prime eval tui
- Inspect platform-visible runs when needed:
prime eval list
prime eval get <eval-id>
prime eval samples <eval-id>
Metrics Interpretation
- Treat binary and continuous rewards differently.
- Use pass@k-style interpretation only when rewards are effectively binary.
- For continuous rewards, focus on distribution shifts and per-task means.
- Always inspect samples before concluding regressions.
Reliability Rules
- Keep environment/model/config fixed while comparing variants.
- Record exact command lines and key flags in the report.
- Call out missing credentials, endpoint mismatches, and dependency errors directly.
- Do not overinterpret tiny sample runs.
Output Format
Return:
- Run configuration table.
- Aggregate metrics and key deltas.
- Sample-level failure themes.
- Clear recommendation: proceed, iterate environment, or retune model/sampling.
More by PrimeIntellect-ai
View all skills by PrimeIntellect-ai →You might also like
flutter-development
aj-geddes
Build beautiful cross-platform mobile apps with Flutter and Dart. Covers widgets, state management with Provider/BLoC, navigation, API integration, and material design.
drawio-diagrams-enhanced
jgtolentino
Create professional draw.io (diagrams.net) diagrams in XML format (.drawio files) with integrated PMP/PMBOK methodologies, extensive visual asset libraries, and industry-standard professional templates. Use this skill when users ask to create flowcharts, swimlane diagrams, cross-functional flowcharts, org charts, network diagrams, UML diagrams, BPMN, project management diagrams (WBS, Gantt, PERT, RACI), risk matrices, stakeholder maps, or any other visual diagram in draw.io format. This skill includes access to custom shape libraries for icons, clipart, and professional symbols.
ui-ux-pro-max
nextlevelbuilder
"UI/UX design intelligence. 50 styles, 21 palettes, 50 font pairings, 20 charts, 8 stacks (React, Next.js, Vue, Svelte, SwiftUI, React Native, Flutter, Tailwind). Actions: plan, build, create, design, implement, review, fix, improve, optimize, enhance, refactor, check UI/UX code. Projects: website, landing page, dashboard, admin panel, e-commerce, SaaS, portfolio, blog, mobile app, .html, .tsx, .vue, .svelte. Elements: button, modal, navbar, sidebar, card, table, form, chart. Styles: glassmorphism, claymorphism, minimalism, brutalism, neumorphism, bento grid, dark mode, responsive, skeuomorphism, flat design. Topics: color palette, accessibility, animation, layout, typography, font pairing, spacing, hover, shadow, gradient."
godot
bfollington
This skill should be used when working on Godot Engine projects. It provides specialized knowledge of Godot's file formats (.gd, .tscn, .tres), architecture patterns (component-based, signal-driven, resource-based), common pitfalls, validation tools, code templates, and CLI workflows. The `godot` command is available for running the game, validating scripts, importing resources, and exporting builds. Use this skill for tasks involving Godot game development, debugging scene/resource files, implementing game systems, or creating new Godot components.
nano-banana-pro
garg-aayush
Generate and edit images using Google's Nano Banana Pro (Gemini 3 Pro Image) API. Use when the user asks to generate, create, edit, modify, change, alter, or update images. Also use when user references an existing image file and asks to modify it in any way (e.g., "modify this image", "change the background", "replace X with Y"). Supports both text-to-image generation and image-to-image editing with configurable resolution (1K default, 2K, or 4K for high resolution). DO NOT read the image file first - use this skill directly with the --input-image parameter.
fastapi-templates
wshobson
Create production-ready FastAPI projects with async patterns, dependency injection, and comprehensive error handling. Use when building new FastAPI applications or setting up backend API projects.
Related MCP Servers
Browse all serversSystemd-Coredump: Access, manage, and analyze Linux core dumps with tools for listing, retrieving, and generating stack
Optimize your codebase for AI with Repomix—transform, compress, and secure repos for easier analysis with modern AI tool
Effortlessly create 25+ chart types with MCP Server Chart. Visualize complex datasets using TypeScript and AntV for powe
Supercharge AI platforms with Azure MCP Server for seamless Azure API Management and resource automation. Public Preview
Analyze lrcx stock in real-time with Investor Agent using yfinance and CNN data for portfolio and market sentiment insig
Unlock seamless Salesforce org management with the secure, flexible Salesforce DX MCP Server. Streamline workflows and b
Stay ahead of the MCP ecosystem
Get weekly updates on new skills and servers.