evaluate-environments

Name: evaluate-environments
Author: PrimeIntellect-ai

2views

1installs

Run and analyze evaluations for verifiers environments using prime eval. Use when asked to smoke-test environments, run benchmark sweeps, resume interrupted evaluations, compare models, inspect sample-level outputs, or produce evaluation summaries suitable for deciding next steps.

Install

mkdir -p .claude/skills/evaluate-environments && curl -L -o skill.zip "https://mcp.directory/api/skills/download/4589" && unzip -o skill.zip -d .claude/skills/evaluate-environments && rm skill.zip

Installs to .claude/skills/evaluate-environments

About this skill

Evaluate Environments

Goal

Run reliable environment evaluations and produce actionable summaries, not raw logs.

Canonical Eval Path

Use prime eval run as the default way to run evaluations.
Do not add --skip-upload or other opt-out flags unless the user explicitly requests that deviation.
Standard prime eval run runs save results automatically, keeping them available in the user's private Evaluations tab and locally in prime eval tui.

Core Loop

Run a smoke evaluation first (do not require pre-install):

prime eval run my-env -m gpt-4.1-mini -n 5

Use owner/env slug directly when evaluating Hub environments:

prime eval run owner/my-env -m gpt-4.1-mini -n 5

Scale only after smoke pass:

prime eval run owner/my-env -m gpt-4.1-mini -n 200 -r 3 -s

Treat ownerless env ids as local-first. If not found locally, rely on Prime resolution for your remote env where applicable.

Endpoint Shortcuts And Model Family Choice

Encourage users to define endpoint aliases in configs/endpoints.toml so model, base URL, and key wiring stay reusable.
Use aliases via -m <endpoint_id> instead of repeating -b and -k.
Ask users explicitly whether they want an instruct or reasoning model before non-trivial evaluations.
Instruct go-tos for quick behavior checks: gpt-4.1 series and qwen3 instruct series.
Reasoning go-tos for deeper test coverage: gpt-5 series, qwen3 thinking series, and glm series.
Example endpoint registry:

[[endpoint]]
endpoint_id = "gpt-4.1-mini"
model = "gpt-4.1-mini"
url = "https://api.openai.com/v1"
key = "OPENAI_API_KEY"

[[endpoint]]
endpoint_id = "qwen3-32b-i"
model = "qwen/qwen3-32b-instruct"
url = "https://api.pinference.ai/api/v1"
key = "PRIME_API_KEY"

Publish Gate Before Large Runs

After smoke tests pass and results look stable, proactively suggest pushing the environment to Hub before large eval sweeps or RL work.
Ask the user explicitly: should visibility be PUBLIC or PRIVATE?
Push with chosen visibility:

prime env push my-env --visibility PUBLIC

prime env push my-env --visibility PRIVATE

For hosted eval workflows, prefer running large jobs against the Hub slug:

prime eval run owner/my-env -m gpt-4.1-mini -n 200 -r 3 -s

Prefer Config-Driven Evals Beyond Smoke Tests

For anything beyond quick checks, nudge the user to create an eval TOML config.
Use config files to run multiple evals in one command and keep runs reproducible:

prime eval run configs/eval/my-benchmark.toml

Make config files the default for benchmark sweeps, multi-model comparisons, and recurring reports.

Common Evaluation Patterns

Pass args to load_environment():

prime eval run my-env -a '{"difficulty":"hard"}'

Override constructor kwargs:

prime eval run my-env -x '{"max_turns":20}'

Save extra state columns:

prime eval run my-env -s -C "judge_response,parsed_answer"

Resume interrupted runs:

prime eval run my-env -n 1000 -s --resume

Save results to a custom output directory:

prime eval run my-env -s -o /path/to/output

Run multi-environment TOML suites:

prime eval run configs/eval/my-benchmark.toml

Scale worker processes to parallelize env execution:

prime eval run my-env -c 1024 -w 4

Run ablation sweeps using [[ablation]] blocks in TOML configs:

[[ablation]]
env_id = "my-env"

[ablation.sweep]
temperature = [0.0, 0.5, 1.0]

[ablation.sweep.env_args]
difficulty = ["easy", "hard"]

This generates the cartesian product (6 configs in this example). Use --abbreviated-summary (-A) for compact ablation results.

Inspect Saved Results

Browse locally saved runs:

prime eval tui

Inspect platform-visible runs when needed:

prime eval list
prime eval get <eval-id>
prime eval samples <eval-id>

Metrics Interpretation

Treat binary and continuous rewards differently.
Use pass@k-style interpretation only when rewards are effectively binary.
For continuous rewards, focus on distribution shifts and per-task means.
Always inspect samples before concluding regressions.

Reliability Rules

Keep environment/model/config fixed while comparing variants.
Record exact command lines and key flags in the report.
Call out missing credentials, endpoint mismatches, and dependency errors directly.
Do not overinterpret tiny sample runs.

Output Format

Return:

Run configuration table.
Aggregate metrics and key deltas.
Sample-level failure themes.
Clear recommendation: proceed, iterate environment, or retune model/sampling.

More by PrimeIntellect-ai

View all skills by PrimeIntellect-ai →

inference-server

PrimeIntellect-ai

Start and test the prime-rl inference server. Use when asked to run inference, start vLLM, test a model, or launch the inference server.

browse-environments

PrimeIntellect-ai

Discover and inspect verifiers environments through the Prime ecosystem. Use when asked to find environments on the Hub, compare options, inspect metadata, check action status, pull local copies for inspection, or choose environment starting points before evaluation, training, or migration work.

toml-config

PrimeIntellect-ai

How to write and use TOML configs in prime-rl. Use when creating config files, running commands with configs, or overriding config values via CLI.

optimize-with-environments

PrimeIntellect-ai

Optimize environment system prompts with GEPA through prime gepa run. Use when asked to improve prompt performance without gradient training, compare baseline versus optimized prompts, run GEPA from CLI or TOML configs, or interpret GEPA outputs before deployment.

create-environments

PrimeIntellect-ai

Create or migrate verifiers environments for the Prime Lab ecosystem. Use when asked to build a new environment from scratch, port an eval or benchmark from papers or other libraries, start from an environment on the Hub, or convert existing tasks into a package that exposes load_environment and installs cleanly with prime env install.

train-with-environments

PrimeIntellect-ai

Train models with verifiers environments using hosted RL or prime-rl. Use when asked to configure RL runs, tune key hyperparameters, diagnose instability, set up difficulty filtering and oversampling, or create practical train and eval loops for new environments.

ui-ux-pro-max

nextlevelbuilder

"UI/UX design intelligence. 50 styles, 21 palettes, 50 font pairings, 20 charts, 8 stacks (React, Next.js, Vue, Svelte, SwiftUI, React Native, Flutter, Tailwind). Actions: plan, build, create, design, implement, review, fix, improve, optimize, enhance, refactor, check UI/UX code. Projects: website, landing page, dashboard, admin panel, e-commerce, SaaS, portfolio, blog, mobile app, .html, .tsx, .vue, .svelte. Elements: button, modal, navbar, sidebar, card, table, form, chart. Styles: glassmorphism, claymorphism, minimalism, brutalism, neumorphism, bento grid, dark mode, responsive, skeuomorphism, flat design. Topics: color palette, accessibility, animation, layout, typography, font pairing, spacing, hover, shadow, gradient."

2,8862,530

pdf-to-markdown

aliceisjustplaying

Convert entire PDF documents to clean, structured Markdown for full context loading. Use this skill when the user wants to extract ALL text from a PDF into context (not grep/search), when discussing or analyzing PDF content in full, when the user mentions "load the whole PDF", "bring the PDF into context", "read the entire PDF", or when partial extraction/grepping would miss important context. This is the preferred method for PDF text extraction over page-by-page or grep approaches.

3,8181,662

flutter-development

aj-geddes

Build beautiful cross-platform mobile apps with Flutter and Dart. Covers widgets, state management with Provider/BLoC, navigation, API integration, and material design.

2,1551,643

drawio-diagrams-enhanced

jgtolentino

Create professional draw.io (diagrams.net) diagrams in XML format (.drawio files) with integrated PMP/PMBOK methodologies, extensive visual asset libraries, and industry-standard professional templates. Use this skill when users ask to create flowcharts, swimlane diagrams, cross-functional flowcharts, org charts, network diagrams, UML diagrams, BPMN, project management diagrams (WBS, Gantt, PERT, RACI), risk matrices, stakeholder maps, or any other visual diagram in draw.io format. This skill includes access to custom shape libraries for icons, clipart, and professional symbols.

2,2691,469

godot

bfollington

This skill should be used when working on Godot Engine projects. It provides specialized knowledge of Godot's file formats (.gd, .tscn, .tres), architecture patterns (component-based, signal-driven, resource-based), common pitfalls, validation tools, code templates, and CLI workflows. The `godot` command is available for running the game, validating scripts, importing resources, and exporting builds. Use this skill for tasks involving Godot game development, debugging scene/resource files, implementing game systems, or creating new Godot components.

2,4711,225

nano-banana-pro

garg-aayush

Generate and edit images using Google's Nano Banana Pro (Gemini 3 Pro Image) API. Use when the user asks to generate, create, edit, modify, change, alter, or update images. Also use when user references an existing image file and asks to modify it in any way (e.g., "modify this image", "change the background", "replace X with Y"). Supports both text-to-image generation and image-to-image editing with configurable resolution (1K default, 2K, or 4K for high resolution). DO NOT read the image file first - use this skill directly with the --input-image parameter.

1,959969

Related MCP Servers

Browse all servers

Systemd-Coredump

Systemd-Coredump: Access, manage, and analyze Linux core dumps with tools for listing, retrieving, and generating stack traces via GDB.

26 tools

Repomix

Optimize your codebase for AI with Repomix—transform, compress, and secure repos for easier analysis with modern AI tools.

22,2988 tools

MCP Server Chart

Effortlessly create 25+ chart types with MCP Server Chart. Visualize complex datasets using TypeScript and AntV for powerful data insights.

3,77726 tools

Azure All

Supercharge AI platforms with Azure MCP Server for seamless Azure API Management and resource automation. Public Preview now available!

1,20847 tools

Investor Agent (Financial Analysis)

Analyze lrcx stock in real-time with Investor Agent using yfinance and CNN data for portfolio and market sentiment insights.

31212 tools

Salesforce

Unlock seamless Salesforce org management with the secure, flexible Salesforce DX MCP Server. Streamline workflows and boost productivity.

3040 tools

Install

mkdir -p .claude/skills/evaluate-environments && curl -L -o skill.zip "https://mcp.directory/api/skills/download/4589" && unzip -o skill.zip -d .claude/skills/evaluate-environments && rm skill.zip

Installs to .claude/skills/evaluate-environments

Stats

Views

Installs

Author

PrimeIntellect-ai

7 skills published

Links

Source Code

evaluate-environments

Install

About this skill

Evaluate Environments

Goal

Canonical Eval Path

Core Loop

Endpoint Shortcuts And Model Family Choice

Publish Gate Before Large Runs

Prefer Config-Driven Evals Beyond Smoke Tests

Common Evaluation Patterns

Inspect Saved Results

Metrics Interpretation

Reliability Rules

Output Format

More by PrimeIntellect-ai

inference-server

browse-environments

toml-config

optimize-with-environments

create-environments

train-with-environments

You might also like

ui-ux-pro-max

pdf-to-markdown

flutter-development

drawio-diagrams-enhanced

godot

nano-banana-pro

Related MCP Servers