arize-phoenix
Open-source AI observability platform for tracing, evaluating, and improving LLM applications with OpenTelemetry integration
Install
mkdir -p .claude/skills/arize-phoenix && curl -L -o skill.zip "https://mcp.directory/api/skills/download/1057" && unzip -o skill.zip -d .claude/skills/arize-phoenix && rm skill.zipInstalls to .claude/skills/arize-phoenix
About this skill
Arize Phoenix
Phoenix is an open-source AI observability platform built on OpenTelemetry that helps developers understand, debug, and improve AI applications. It provides comprehensive tracing, evaluation, prompt engineering, and experimentation capabilities for LLM-based systems. Phoenix captures detailed execution information from AI applications, measures output quality with evaluators, enables systematic prompt iteration, and supports data-driven experimentation to optimize AI performance.
When to Use This Skill
- Debugging AI application failures by inspecting LLM calls, tool executions, and retrieval operations
- Measuring and improving AI output quality using LLM-based or code-based evaluators
- Iterating on prompts using real production examples and testing variations systematically
- Comparing different versions of AI applications (prompts, models, architectures) using experiments
- Monitoring LLM costs, token usage, latency, and error rates in production
- Building datasets from production traces for evaluation and fine-tuning
- Tracking multi-turn conversations and maintaining context across interactions
- Optimizing RAG systems by analyzing retrieval quality and document relevance
- Evaluating agent performance including tool call accuracy and actionability
- Managing prompt versions and deploying them across different environments
Capabilities
Agents can leverage Phoenix to:
- Trace AI application execution with detailed visibility into LLM calls, tool executions, retrieval operations, embeddings, and prompt templates
- Evaluate output quality using pre-built or custom evaluators with LLM-as-a-judge or code-based evaluation logic
- Annotate traces with human feedback, scores, labels, and quality signals for continuous improvement
- Experiment systematically by comparing different versions of applications using datasets and evaluators
- Monitor performance metrics including latency, token usage, costs, and error rates across projects
- Iterate on prompts using the playground, span replay, and dataset-based testing
- Organize traces into projects and sessions for better management and analysis
- Integrate with 20+ AI frameworks and LLM providers via OpenTelemetry instrumentation
Skills
Tracing
- Capture traces via OpenTelemetry (OTLP) protocol with automatic instrumentation for major frameworks
- View execution flow showing every LLM call, tool execution, retrieval operation, embedding generation, and response generation
- Inspect LLM parameters including temperature, system prompts, function calls, and invocation parameters
- Analyze retrieval operations with document scores, order, and embedding text for RAG systems
- Track token usage with detailed breakdowns by token type (input/output) and model
- Monitor latency at trace, span, and component levels with quantile analysis
- Organize with projects to separate traces by environment, application, team, or use case
- Group with sessions to track multi-turn conversations and maintain context across interactions
- Add metadata to traces with custom attributes, tags, and structured data for filtering and analysis
- Annotate traces with scores, labels, human feedback, and LLM evaluations for quality measurement
- Export and import traces for backup, migration, or analysis in external tools
- Track costs with automatic calculation based on token usage and model pricing
Evaluation
- Run LLM-as-a-judge evaluations using any LLM provider (OpenAI, Anthropic, Gemini, custom endpoints) to assess output quality
- Build custom evaluators with Python or TypeScript using custom prompts, scoring logic, and evaluation criteria
- Use pre-built evaluators for common tasks including faithfulness, relevance, toxicity, summarization, agent evaluation, and RAG quality
- Write code-based evaluators for deterministic checks like exact match, regex patterns, or custom Python/TypeScript logic
- Execute evaluations at scale with automatic concurrency, rate limit handling, error management, and batching via executors
- Map complex inputs using input schemas and mappings to transform nested data structures for evaluators
- View evaluator traces with complete transparency into prompts, model reasoning, scores, and execution metadata
- Run batch evaluations on traces, datasets, or custom data sources with automatic retry and error handling
- Integrate evaluations into workflows by running evals on production traces or test datasets
Datasets & Experiments
- Create datasets from traces, code, CSV files, or manually curated examples with inputs and optional reference outputs
- Build golden datasets with reference outputs (ground truth) for objective evaluation using code-based evaluators
- Version datasets with automatic tracking of inserts, updates, and deletes for reproducibility
- Run experiments by executing task functions against datasets with evaluators to compare different versions
- Compare experiments side-by-side in the UI to see performance differences, score distributions, and individual example results
- Use repetitions to run experiments multiple times for statistical confidence and account for LLM variability
- Organize with splits to separate datasets into train/test/validation splits for proper evaluation workflows
- Export datasets in JSONL or CSV formats for fine-tuning, analysis, or sharing
- View experiment results in the Phoenix UI with task function traces, scores per example, and aggregate performance metrics
Prompt Engineering
- Manage prompts with versioning, storage, and deployment across different environments
- Test prompts interactively in the Prompt Playground with various models, parameters, and tools
- Replay LLM spans from production traces in the playground to debug failures and test improvements
- Test at scale by running prompts against datasets to evaluate performance systematically
- Compare prompt versions side-by-side to see which performs better on your data
- Optimize automatically using automated prompt optimization features
- Sync prompts via SDK to keep prompts in sync across applications and environments programmatically
- Tag prompts for deployment control across development, staging, and production environments
- Track prompt changes with version history, author information, and timestamps
Projects & Organization
- Create projects to organize traces by environment (development, staging, production), application, or team
- Set up sessions to track multi-turn conversations with chatbot-like UI showing conversation history
- View metrics dashboards with pre-defined metrics including latency, errors, token usage, costs, and model performance
- Filter and search traces by metadata, attributes, annotations, or custom tags
- Configure data retention policies to control how long trace and evaluation data is stored
API & Programmatic Access
- Use Python SDK (arize-phoenix-client, arize-phoenix-evals, arize-phoenix-otel) for programmatic access
- Use TypeScript SDK (arizeai-phoenix-client, arizeai-phoenix-evals, arizeai-phoenix-otel) for JavaScript/TypeScript applications
- Access REST API for annotations, datasets, experiments, traces, spans, prompts, projects, and users
- Instrument manually using OpenTelemetry decorators, wrappers, or direct OpenInference SDKs
- Generate API keys for programmatic access with role-based permissions
Authentication & Security
- Configure RBAC with role-based access control for user permissions and project access
- Set up authentication including SSO and user management for self-hosted instances
- Manage API keys for secure programmatic access to Phoenix APIs and SDKs
- Control data privacy with self-hosting options for VPC deployment or local execution
Workflows
Workflow 1: Instrument and Trace an AI Application
- Choose integration - Select appropriate Phoenix integration for your framework (LangChain, LlamaIndex, OpenAI, etc.)
- Install package - Install Phoenix client and OpenTelemetry packages for your language (Python or TypeScript)
- Configure endpoint - Set Phoenix endpoint URL and optionally configure project name and session tracking
- Instrument application - Add auto-instrumentation or manual instrumentation to capture LLM calls, tool executions, and retrievals
- View traces - Open Phoenix UI to see execution flow, latency, token usage, and detailed span information
- Add annotations - Add scores, labels, or human feedback to traces for quality measurement
Workflow 2: Evaluate AI Output Quality
- Choose evaluator type - Select LLM-as-a-judge for subjective quality or code-based for objective checks
- Configure LLM provider - Set up evaluator LLM (OpenAI, Anthropic, Gemini, or custom endpoint)
- Define evaluation logic - Use pre-built evaluator or create custom evaluator with prompts/scoring logic
- Run evaluation - Execute evaluator on traces, datasets, or custom data with automatic batching and concurrency
- Review results - View evaluator traces, scores, explanations, and labels in Phoenix UI
- Iterate - Adjust evaluator prompts or logic based on results and human feedback
Workflow 3: Run Experiments to Compare Versions
- Create dataset - Build dataset with inputs and optional reference outputs from traces, code, or CSV
- Define task function - Create Python function that wraps your AI application logic and returns outputs
- Select evaluators - Choose code-based evaluators for ground truth comparison or LLM-as-a-judge for subjective quality
- Run experiment - Execute task function against dataset with evaluators to generate scores
- Compare results - View experiment results in UI with aggregate me
Content truncated.
More by Arize-ai
View all skills by Arize-ai →You might also like
flutter-development
aj-geddes
Build beautiful cross-platform mobile apps with Flutter and Dart. Covers widgets, state management with Provider/BLoC, navigation, API integration, and material design.
ui-ux-pro-max
nextlevelbuilder
"UI/UX design intelligence. 50 styles, 21 palettes, 50 font pairings, 20 charts, 8 stacks (React, Next.js, Vue, Svelte, SwiftUI, React Native, Flutter, Tailwind). Actions: plan, build, create, design, implement, review, fix, improve, optimize, enhance, refactor, check UI/UX code. Projects: website, landing page, dashboard, admin panel, e-commerce, SaaS, portfolio, blog, mobile app, .html, .tsx, .vue, .svelte. Elements: button, modal, navbar, sidebar, card, table, form, chart. Styles: glassmorphism, claymorphism, minimalism, brutalism, neumorphism, bento grid, dark mode, responsive, skeuomorphism, flat design. Topics: color palette, accessibility, animation, layout, typography, font pairing, spacing, hover, shadow, gradient."
drawio-diagrams-enhanced
jgtolentino
Create professional draw.io (diagrams.net) diagrams in XML format (.drawio files) with integrated PMP/PMBOK methodologies, extensive visual asset libraries, and industry-standard professional templates. Use this skill when users ask to create flowcharts, swimlane diagrams, cross-functional flowcharts, org charts, network diagrams, UML diagrams, BPMN, project management diagrams (WBS, Gantt, PERT, RACI), risk matrices, stakeholder maps, or any other visual diagram in draw.io format. This skill includes access to custom shape libraries for icons, clipart, and professional symbols.
godot
bfollington
This skill should be used when working on Godot Engine projects. It provides specialized knowledge of Godot's file formats (.gd, .tscn, .tres), architecture patterns (component-based, signal-driven, resource-based), common pitfalls, validation tools, code templates, and CLI workflows. The `godot` command is available for running the game, validating scripts, importing resources, and exporting builds. Use this skill for tasks involving Godot game development, debugging scene/resource files, implementing game systems, or creating new Godot components.
nano-banana-pro
garg-aayush
Generate and edit images using Google's Nano Banana Pro (Gemini 3 Pro Image) API. Use when the user asks to generate, create, edit, modify, change, alter, or update images. Also use when user references an existing image file and asks to modify it in any way (e.g., "modify this image", "change the background", "replace X with Y"). Supports both text-to-image generation and image-to-image editing with configurable resolution (1K default, 2K, or 4K for high resolution). DO NOT read the image file first - use this skill directly with the --input-image parameter.
pdf-to-markdown
aliceisjustplaying
Convert entire PDF documents to clean, structured Markdown for full context loading. Use this skill when the user wants to extract ALL text from a PDF into context (not grep/search), when discussing or analyzing PDF content in full, when the user mentions "load the whole PDF", "bring the PDF into context", "read the entire PDF", or when partial extraction/grepping would miss important context. This is the preferred method for PDF text extraction over page-by-page or grep approaches.
Related MCP Servers
Browse all serversEffortlessly manage Google Cloud with this user-friendly multi cloud management platform—simplify operations, automate t
Logfire is a data observability platform for querying, analyzing, and monitoring OpenTelemetry traces, errors, and metri
Integrate Dynatrace, a leading data observability platform and APM tool, to monitor metrics, security, and network perfo
Dynatrace Managed MCP Server delivers AI-driven access to self-hosted monitoring and observability platform, AIOps insig
Access AgentOps data for agent debugging: retrieve project info, trace details, span metrics, and execution traces via a
Coroot offers a robust data observability platform with Prometheus process monitoring, software network monitoring, and
Stay ahead of the MCP ecosystem
Get weekly updates on new skills and servers.