arize-phoenix

35views

7installs

Open-source AI observability platform for tracing, evaluating, and improving LLM applications with OpenTelemetry integration

Install

mkdir -p .claude/skills/arize-phoenix && curl -L -o skill.zip "https://mcp.directory/api/skills/download/1057" && unzip -o skill.zip -d .claude/skills/arize-phoenix && rm skill.zip

Installs to .claude/skills/arize-phoenix

About this skill

Arize Phoenix

Phoenix is an open-source AI observability platform built on OpenTelemetry that helps developers understand, debug, and improve AI applications. It provides comprehensive tracing, evaluation, prompt engineering, and experimentation capabilities for LLM-based systems. Phoenix captures detailed execution information from AI applications, measures output quality with evaluators, enables systematic prompt iteration, and supports data-driven experimentation to optimize AI performance.

When to Use This Skill

Debugging AI application failures by inspecting LLM calls, tool executions, and retrieval operations
Measuring and improving AI output quality using LLM-based or code-based evaluators
Iterating on prompts using real production examples and testing variations systematically
Comparing different versions of AI applications (prompts, models, architectures) using experiments
Monitoring LLM costs, token usage, latency, and error rates in production
Building datasets from production traces for evaluation and fine-tuning
Tracking multi-turn conversations and maintaining context across interactions
Optimizing RAG systems by analyzing retrieval quality and document relevance
Evaluating agent performance including tool call accuracy and actionability
Managing prompt versions and deploying them across different environments

Capabilities

Agents can leverage Phoenix to:

Trace AI application execution with detailed visibility into LLM calls, tool executions, retrieval operations, embeddings, and prompt templates
Evaluate output quality using pre-built or custom evaluators with LLM-as-a-judge or code-based evaluation logic
Annotate traces with human feedback, scores, labels, and quality signals for continuous improvement
Experiment systematically by comparing different versions of applications using datasets and evaluators
Monitor performance metrics including latency, token usage, costs, and error rates across projects
Iterate on prompts using the playground, span replay, and dataset-based testing
Organize traces into projects and sessions for better management and analysis
Integrate with 20+ AI frameworks and LLM providers via OpenTelemetry instrumentation

Skills

Tracing

Capture traces via OpenTelemetry (OTLP) protocol with automatic instrumentation for major frameworks
View execution flow showing every LLM call, tool execution, retrieval operation, embedding generation, and response generation
Inspect LLM parameters including temperature, system prompts, function calls, and invocation parameters
Analyze retrieval operations with document scores, order, and embedding text for RAG systems
Track token usage with detailed breakdowns by token type (input/output) and model
Monitor latency at trace, span, and component levels with quantile analysis
Organize with projects to separate traces by environment, application, team, or use case
Group with sessions to track multi-turn conversations and maintain context across interactions
Add metadata to traces with custom attributes, tags, and structured data for filtering and analysis
Annotate traces with scores, labels, human feedback, and LLM evaluations for quality measurement
Export and import traces for backup, migration, or analysis in external tools
Track costs with automatic calculation based on token usage and model pricing

Evaluation

Run LLM-as-a-judge evaluations using any LLM provider (OpenAI, Anthropic, Gemini, custom endpoints) to assess output quality
Build custom evaluators with Python or TypeScript using custom prompts, scoring logic, and evaluation criteria
Use pre-built evaluators for common tasks including faithfulness, relevance, toxicity, summarization, agent evaluation, and RAG quality
Write code-based evaluators for deterministic checks like exact match, regex patterns, or custom Python/TypeScript logic
Execute evaluations at scale with automatic concurrency, rate limit handling, error management, and batching via executors
Map complex inputs using input schemas and mappings to transform nested data structures for evaluators
View evaluator traces with complete transparency into prompts, model reasoning, scores, and execution metadata
Run batch evaluations on traces, datasets, or custom data sources with automatic retry and error handling
Integrate evaluations into workflows by running evals on production traces or test datasets

Datasets & Experiments

Create datasets from traces, code, CSV files, or manually curated examples with inputs and optional reference outputs
Build golden datasets with reference outputs (ground truth) for objective evaluation using code-based evaluators
Version datasets with automatic tracking of inserts, updates, and deletes for reproducibility
Run experiments by executing task functions against datasets with evaluators to compare different versions
Compare experiments side-by-side in the UI to see performance differences, score distributions, and individual example results
Use repetitions to run experiments multiple times for statistical confidence and account for LLM variability
Organize with splits to separate datasets into train/test/validation splits for proper evaluation workflows
Export datasets in JSONL or CSV formats for fine-tuning, analysis, or sharing
View experiment results in the Phoenix UI with task function traces, scores per example, and aggregate performance metrics

Prompt Engineering

Manage prompts with versioning, storage, and deployment across different environments
Test prompts interactively in the Prompt Playground with various models, parameters, and tools
Replay LLM spans from production traces in the playground to debug failures and test improvements
Test at scale by running prompts against datasets to evaluate performance systematically
Compare prompt versions side-by-side to see which performs better on your data
Optimize automatically using automated prompt optimization features
Sync prompts via SDK to keep prompts in sync across applications and environments programmatically
Tag prompts for deployment control across development, staging, and production environments
Track prompt changes with version history, author information, and timestamps

Projects & Organization

Create projects to organize traces by environment (development, staging, production), application, or team
Set up sessions to track multi-turn conversations with chatbot-like UI showing conversation history
View metrics dashboards with pre-defined metrics including latency, errors, token usage, costs, and model performance
Filter and search traces by metadata, attributes, annotations, or custom tags
Configure data retention policies to control how long trace and evaluation data is stored

API & Programmatic Access

Use Python SDK (arize-phoenix-client, arize-phoenix-evals, arize-phoenix-otel) for programmatic access
Use TypeScript SDK (arizeai-phoenix-client, arizeai-phoenix-evals, arizeai-phoenix-otel) for JavaScript/TypeScript applications
Access REST API for annotations, datasets, experiments, traces, spans, prompts, projects, and users
Instrument manually using OpenTelemetry decorators, wrappers, or direct OpenInference SDKs
Generate API keys for programmatic access with role-based permissions

Authentication & Security

Configure RBAC with role-based access control for user permissions and project access
Set up authentication including SSO and user management for self-hosted instances
Manage API keys for secure programmatic access to Phoenix APIs and SDKs
Control data privacy with self-hosting options for VPC deployment or local execution

Workflows

Workflow 1: Instrument and Trace an AI Application

Choose integration - Select appropriate Phoenix integration for your framework (LangChain, LlamaIndex, OpenAI, etc.)
Install package - Install Phoenix client and OpenTelemetry packages for your language (Python or TypeScript)
Configure endpoint - Set Phoenix endpoint URL and optionally configure project name and session tracking
Instrument application - Add auto-instrumentation or manual instrumentation to capture LLM calls, tool executions, and retrievals
View traces - Open Phoenix UI to see execution flow, latency, token usage, and detailed span information
Add annotations - Add scores, labels, or human feedback to traces for quality measurement

Workflow 2: Evaluate AI Output Quality

Choose evaluator type - Select LLM-as-a-judge for subjective quality or code-based for objective checks
Configure LLM provider - Set up evaluator LLM (OpenAI, Anthropic, Gemini, or custom endpoint)
Define evaluation logic - Use pre-built evaluator or create custom evaluator with prompts/scoring logic
Run evaluation - Execute evaluator on traces, datasets, or custom data with automatic batching and concurrency
Review results - View evaluator traces, scores, explanations, and labels in Phoenix UI
Iterate - Adjust evaluator prompts or logic based on results and human feedback

Workflow 3: Run Experiments to Compare Versions

Create dataset - Build dataset with inputs and optional reference outputs from traces, code, or CSV
Define task function - Create Python function that wraps your AI application logic and returns outputs
Select evaluators - Choose code-based evaluators for ground truth comparison or LLM-as-a-judge for subjective quality
Run experiment - Execute task function against dataset with evaluators to generate scores
Compare results - View experiment results in UI with aggregate me

Content truncated.

More by Arize-ai

View all skills by Arize-ai →

phoenix-evals

Arize-ai

Build and run evaluators for AI/LLM applications using Phoenix.

192

phoenix-cli

Arize-ai

Debug LLM applications using the Phoenix CLI. Fetch traces, analyze errors, review experiments, and inspect datasets. Use when debugging AI/LLM applications, analyzing trace data, working with Phoenix observability, or investigating LLM performance issues.

phoenix-tracing

Arize-ai

OpenInference semantic conventions and instrumentation for Phoenix AI observability. Use when implementing LLM tracing, creating custom spans, or deploying to production.

flutter-development

aj-geddes

Build beautiful cross-platform mobile apps with Flutter and Dart. Covers widgets, state management with Provider/BLoC, navigation, API integration, and material design.

1,5701,369

ui-ux-pro-max

nextlevelbuilder

"UI/UX design intelligence. 50 styles, 21 palettes, 50 font pairings, 20 charts, 8 stacks (React, Next.js, Vue, Svelte, SwiftUI, React Native, Flutter, Tailwind). Actions: plan, build, create, design, implement, review, fix, improve, optimize, enhance, refactor, check UI/UX code. Projects: website, landing page, dashboard, admin panel, e-commerce, SaaS, portfolio, blog, mobile app, .html, .tsx, .vue, .svelte. Elements: button, modal, navbar, sidebar, card, table, form, chart. Styles: glassmorphism, claymorphism, minimalism, brutalism, neumorphism, bento grid, dark mode, responsive, skeuomorphism, flat design. Topics: color palette, accessibility, animation, layout, typography, font pairing, spacing, hover, shadow, gradient."

1,1161,188

drawio-diagrams-enhanced

jgtolentino

Create professional draw.io (diagrams.net) diagrams in XML format (.drawio files) with integrated PMP/PMBOK methodologies, extensive visual asset libraries, and industry-standard professional templates. Use this skill when users ask to create flowcharts, swimlane diagrams, cross-functional flowcharts, org charts, network diagrams, UML diagrams, BPMN, project management diagrams (WBS, Gantt, PERT, RACI), risk matrices, stakeholder maps, or any other visual diagram in draw.io format. This skill includes access to custom shape libraries for icons, clipart, and professional symbols.

1,4181,109

godot

bfollington

This skill should be used when working on Godot Engine projects. It provides specialized knowledge of Godot's file formats (.gd, .tscn, .tres), architecture patterns (component-based, signal-driven, resource-based), common pitfalls, validation tools, code templates, and CLI workflows. The `godot` command is available for running the game, validating scripts, importing resources, and exporting builds. Use this skill for tasks involving Godot game development, debugging scene/resource files, implementing game systems, or creating new Godot components.

1,193747

nano-banana-pro

garg-aayush

Generate and edit images using Google's Nano Banana Pro (Gemini 3 Pro Image) API. Use when the user asks to generate, create, edit, modify, change, alter, or update images. Also use when user references an existing image file and asks to modify it in any way (e.g., "modify this image", "change the background", "replace X with Y"). Supports both text-to-image generation and image-to-image editing with configurable resolution (1K default, 2K, or 4K for high resolution). DO NOT read the image file first - use this skill directly with the --input-image parameter.

1,153683

pdf-to-markdown

aliceisjustplaying

Convert entire PDF documents to clean, structured Markdown for full context loading. Use this skill when the user wants to extract ALL text from a PDF into context (not grep/search), when discussing or analyzing PDF content in full, when the user mentions "load the whole PDF", "bring the PDF into context", "read the entire PDF", or when partial extraction/grepping would miss important context. This is the preferred method for PDF text extraction over page-by-page or grep approaches.

1,311614

Related MCP Servers

Browse all servers

Google Cloud

Effortlessly manage Google Cloud with this user-friendly multi cloud management platform—simplify operations, automate t

7010 tools

Logfire

Logfire is a data observability platform for querying, analyzing, and monitoring OpenTelemetry traces, errors, and metri

1530 tools

Dynatrace

Integrate Dynatrace, a leading data observability platform and APM tool, to monitor metrics, security, and network perfo

850 tools

Dynatrace Managed MCP Server

Dynatrace Managed MCP Server delivers AI-driven access to self-hosted monitoring and observability platform, AIOps insig

160 tools

AgentOps

Access AgentOps data for agent debugging: retrieve project info, trace details, span metrics, and execution traces via a

140 tools

Coroot

Coroot offers a robust data observability platform with Prometheus process monitoring, software network monitoring, and

110 tools

Stay ahead of the MCP ecosystem

Get weekly updates on new skills and servers.

Install

mkdir -p .claude/skills/arize-phoenix && curl -L -o skill.zip "https://mcp.directory/api/skills/download/1057" && unzip -o skill.zip -d .claude/skills/arize-phoenix && rm skill.zip

Installs to .claude/skills/arize-phoenix

Stats

Views

Installs

Author

Arize-ai

4 skills published

Links

Source Code

arize-phoenix

Install

About this skill

Arize Phoenix

When to Use This Skill

Capabilities

Skills

Tracing

Evaluation

Datasets & Experiments

Prompt Engineering

Projects & Organization

API & Programmatic Access

Authentication & Security

Workflows

Workflow 1: Instrument and Trace an AI Application

Workflow 2: Evaluate AI Output Quality

Workflow 3: Run Experiments to Compare Versions

More by Arize-ai

phoenix-evals

phoenix-cli

phoenix-tracing

You might also like

flutter-development

ui-ux-pro-max

drawio-diagrams-enhanced

godot

nano-banana-pro

pdf-to-markdown

Related MCP Servers

Stay ahead of the MCP ecosystem