arize-phoenix
Open-source AI observability platform for tracing, evaluating, and improving LLM applications with OpenTelemetry integration
Install
mkdir -p .claude/skills/arize-phoenix && curl -L -o skill.zip "https://mcp.directory/api/skills/download/1057" && unzip -o skill.zip -d .claude/skills/arize-phoenix && rm skill.zipInstalls to .claude/skills/arize-phoenix
About this skill
Arize Phoenix
Phoenix is an open-source AI observability platform built on OpenTelemetry that helps developers understand, debug, and improve AI applications. It provides comprehensive tracing, evaluation, prompt engineering, and experimentation capabilities for LLM-based systems. Phoenix captures detailed execution information from AI applications, measures output quality with evaluators, enables systematic prompt iteration, and supports data-driven experimentation to optimize AI performance.
When to Use This Skill
- Debugging AI application failures by inspecting LLM calls, tool executions, and retrieval operations
- Measuring and improving AI output quality using LLM-based or code-based evaluators
- Iterating on prompts using real production examples and testing variations systematically
- Comparing different versions of AI applications (prompts, models, architectures) using experiments
- Monitoring LLM costs, token usage, latency, and error rates in production
- Building datasets from production traces for evaluation and fine-tuning
- Tracking multi-turn conversations and maintaining context across interactions
- Optimizing RAG systems by analyzing retrieval quality and document relevance
- Evaluating agent performance including tool call accuracy and actionability
- Managing prompt versions and deploying them across different environments
Capabilities
Agents can leverage Phoenix to:
- Trace AI application execution with detailed visibility into LLM calls, tool executions, retrieval operations, embeddings, and prompt templates
- Evaluate output quality using pre-built or custom evaluators with LLM-as-a-judge or code-based evaluation logic
- Annotate traces with human feedback, scores, labels, and quality signals for continuous improvement
- Experiment systematically by comparing different versions of applications using datasets and evaluators
- Monitor performance metrics including latency, token usage, costs, and error rates across projects
- Iterate on prompts using the playground, span replay, and dataset-based testing
- Organize traces into projects and sessions for better management and analysis
- Integrate with 20+ AI frameworks and LLM providers via OpenTelemetry instrumentation
Skills
Tracing
- Capture traces via OpenTelemetry (OTLP) protocol with automatic instrumentation for major frameworks
- View execution flow showing every LLM call, tool execution, retrieval operation, embedding generation, and response generation
- Inspect LLM parameters including temperature, system prompts, function calls, and invocation parameters
- Analyze retrieval operations with document scores, order, and embedding text for RAG systems
- Track token usage with detailed breakdowns by token type (input/output) and model
- Monitor latency at trace, span, and component levels with quantile analysis
- Organize with projects to separate traces by environment, application, team, or use case
- Group with sessions to track multi-turn conversations and maintain context across interactions
- Add metadata to traces with custom attributes, tags, and structured data for filtering and analysis
- Annotate traces with scores, labels, human feedback, and LLM evaluations for quality measurement
- Export and import traces for backup, migration, or analysis in external tools
- Track costs with automatic calculation based on token usage and model pricing
Evaluation
- Run LLM-as-a-judge evaluations using any LLM provider (OpenAI, Anthropic, Gemini, custom endpoints) to assess output quality
- Build custom evaluators with Python or TypeScript using custom prompts, scoring logic, and evaluation criteria
- Use pre-built evaluators for common tasks including faithfulness, relevance, toxicity, summarization, agent evaluation, and RAG quality
- Write code-based evaluators for deterministic checks like exact match, regex patterns, or custom Python/TypeScript logic
- Execute evaluations at scale with automatic concurrency, rate limit handling, error management, and batching via executors
- Map complex inputs using input schemas and mappings to transform nested data structures for evaluators
- View evaluator traces with complete transparency into prompts, model reasoning, scores, and execution metadata
- Run batch evaluations on traces, datasets, or custom data sources with automatic retry and error handling
- Integrate evaluations into workflows by running evals on production traces or test datasets
Datasets & Experiments
- Create datasets from traces, code, CSV files, or manually curated examples with inputs and optional reference outputs
- Build golden datasets with reference outputs (ground truth) for objective evaluation using code-based evaluators
- Version datasets with automatic tracking of inserts, updates, and deletes for reproducibility
- Run experiments by executing task functions against datasets with evaluators to compare different versions
- Compare experiments side-by-side in the UI to see performance differences, score distributions, and individual example results
- Use repetitions to run experiments multiple times for statistical confidence and account for LLM variability
- Organize with splits to separate datasets into train/test/validation splits for proper evaluation workflows
- Export datasets in JSONL or CSV formats for fine-tuning, analysis, or sharing
- View experiment results in the Phoenix UI with task function traces, scores per example, and aggregate performance metrics
Prompt Engineering
- Manage prompts with versioning, storage, and deployment across different environments
- Test prompts interactively in the Prompt Playground with various models, parameters, and tools
- Replay LLM spans from production traces in the playground to debug failures and test improvements
- Test at scale by running prompts against datasets to evaluate performance systematically
- Compare prompt versions side-by-side to see which performs better on your data
- Optimize automatically using automated prompt optimization features
- Sync prompts via SDK to keep prompts in sync across applications and environments programmatically
- Tag prompts for deployment control across development, staging, and production environments
- Track prompt changes with version history, author information, and timestamps
Projects & Organization
- Create projects to organize traces by environment (development, staging, production), application, or team
- Set up sessions to track multi-turn conversations with chatbot-like UI showing conversation history
- View metrics dashboards with pre-defined metrics including latency, errors, token usage, costs, and model performance
- Filter and search traces by metadata, attributes, annotations, or custom tags
- Configure data retention policies to control how long trace and evaluation data is stored
API & Programmatic Access
- Use Python SDK (arize-phoenix-client, arize-phoenix-evals, arize-phoenix-otel) for programmatic access
- Use TypeScript SDK (arizeai-phoenix-client, arizeai-phoenix-evals, arizeai-phoenix-otel) for JavaScript/TypeScript applications
- Access REST API for annotations, datasets, experiments, traces, spans, prompts, projects, and users
- Instrument manually using OpenTelemetry decorators, wrappers, or direct OpenInference SDKs
- Generate API keys for programmatic access with role-based permissions
Authentication & Security
- Configure RBAC with role-based access control for user permissions and project access
- Set up authentication including SSO and user management for self-hosted instances
- Manage API keys for secure programmatic access to Phoenix APIs and SDKs
- Control data privacy with self-hosting options for VPC deployment or local execution
Workflows
Workflow 1: Instrument and Trace an AI Application
- Choose integration - Select appropriate Phoenix integration for your framework (LangChain, LlamaIndex, OpenAI, etc.)
- Install package - Install Phoenix client and OpenTelemetry packages for your language (Python or TypeScript)
- Configure endpoint - Set Phoenix endpoint URL and optionally configure project name and session tracking
- Instrument application - Add auto-instrumentation or manual instrumentation to capture LLM calls, tool executions, and retrievals
- View traces - Open Phoenix UI to see execution flow, latency, token usage, and detailed span information
- Add annotations - Add scores, labels, or human feedback to traces for quality measurement
Workflow 2: Evaluate AI Output Quality
- Choose evaluator type - Select LLM-as-a-judge for subjective quality or code-based for objective checks
- Configure LLM provider - Set up evaluator LLM (OpenAI, Anthropic, Gemini, or custom endpoint)
- Define evaluation logic - Use pre-built evaluator or create custom evaluator with prompts/scoring logic
- Run evaluation - Execute evaluator on traces, datasets, or custom data with automatic batching and concurrency
- Review results - View evaluator traces, scores, explanations, and labels in Phoenix UI
- Iterate - Adjust evaluator prompts or logic based on results and human feedback
Workflow 3: Run Experiments to Compare Versions
- Create dataset - Build dataset with inputs and optional reference outputs from traces, code, or CSV
- Define task function - Create Python function that wraps your AI application logic and returns outputs
- Select evaluators - Choose code-based evaluators for ground truth comparison or LLM-as-a-judge for subjective quality
- Run experiment - Execute task function against dataset with evaluators to generate scores
- Compare results - View experiment results in UI with aggregate metrics, score distributions, and per-example analysis
- Iterate - Make changes to prompts, models, or architecture and run new experiment to compare performance
Workflow 4: Optimize Prompts with Playground
- Identify prompt - Find prompt in traces or load existing prompt from prompt management
- Open playground - Load prompt into Prompt Playground with current parameters and tools
- Test variations - Modify prompt text, model parameters, tools, or response format and test with real inputs
- View traces - All playground runs are automatically recorded as traces for analysis
- Test at scale - Run prompt variations against dataset examples to evaluate performance systematically
- Save and deploy - Save best-performing prompt version, tag for environment, and deploy via SDK
Workflow 5: Debug Production Issues
- Identify problematic trace - Search or filter traces to find failed or low-quality executions
- Inspect execution flow - View detailed span information including LLM calls, tool executions, and retrievals
- Replay span - Load problematic LLM span into Prompt Playground to test fixes
- Test improvements - Modify prompts, parameters, or tools in playground and compare outputs
- Add to dataset - Add problematic examples to dataset for future testing
- Run experiment - Test improved version against dataset to verify fix before deployment
Integrations
LLM Providers
OpenAI, Anthropic, Amazon Bedrock, Google (Gemini), Groq, MistralAI, VertexAI, LiteLLM, OpenRouter, Together, Vercel AI
Python Frameworks
Agno, AutoGen, BeeAI, CrewAI, DSPy, Google ADK, Graphite, Guardrails AI, Haystack, Hugging Face smolagents, Instructor, LlamaIndex, LangChain, LangGraph, MCP, NVIDIA, Portkey, Pydantic AI
TypeScript Frameworks
BeeAI, LangChain.js, Mastra, MCP, Vercel AI SDK
Java Frameworks
LangChain4j, Spring AI, Arconia
Platforms
Dify, Flowise, LangFlow, Prompt Flow
Vector Databases
MongoDB, OpenSearch, Pinecone, Qdrant, Weaviate, Zilliz/Milvus, Couchbase
Evaluation Integrations
Cleanlab, Ragas, UQLM
Observability Protocols
OpenTelemetry (OTLP), OpenInference
Developer Tools
Claude Code, Cursor, Phoenix MCP Server
Cloud Platforms
AWS (CloudFormation), Kubernetes (Helm), Docker, Railway
Context
OpenTelemetry: Phoenix tracing is built on OpenTelemetry (OTLP), an industry-standard observability protocol. This means instrumentation code written for Phoenix can be reused with other observability platforms, avoiding vendor lock-in.
OpenInference: Phoenix uses OpenInference instrumentation, an extension of OpenTelemetry specifically designed for AI/LLM applications. OpenInference adds semantic conventions for LLM spans, retrieval operations, and embeddings.
Traces and Spans: A trace represents the complete execution path of a request through an AI application. Spans are individual units of work within a trace (e.g., a single LLM call, tool execution, or retrieval operation). Spans can be nested to show hierarchical execution flow.
Projects: Projects provide organizational structure for traces, allowing separation by environment, application, or team. Each project has its own metrics dashboard and data isolation.
Sessions: Sessions group related traces into conversational threads, enabling tracking of multi-turn conversations with context maintained across interactions.
Evaluators: Evaluators measure the quality of AI outputs. LLM-based evaluators use LLMs as judges to assess subjective quality. Code-based evaluators use deterministic logic for objective checks. All evaluators return scores with optional labels, explanations, and metadata.
Datasets: Datasets are collections of examples with inputs and optional reference outputs. Golden datasets contain reference outputs (ground truth) for objective evaluation. Datasets are versioned automatically.
Experiments: Experiments run task functions (wrapped AI application logic) against datasets with evaluators to systematically compare different versions. Experiments track scores per example and aggregate metrics.
Prompts: In Phoenix, a prompt includes the prompt template, invocation parameters (temperature, etc.), tools, and response format. Prompts are versioned and can be tagged for deployment across environments.
Executors: Executors handle evaluation execution with automatic concurrency, rate limit management, error handling, and batching. They can achieve up to 20x speedup compared to direct API calls.
Self-Hosting: Phoenix can be self-hosted on Docker, Kubernetes, AWS, Railway, or locally. Self-hosted instances support authentication, email configuration, and data retention policies.
Phoenix Cloud: Managed Phoenix hosting service with automatic updates, scaling, and maintenance handled by Arize team.
For additional documentation: https://arize.com/docs/phoenix/llms.txt
More by Arize-ai
View all →You might also like
flutter-development
aj-geddes
Build beautiful cross-platform mobile apps with Flutter and Dart. Covers widgets, state management with Provider/BLoC, navigation, API integration, and material design.
drawio-diagrams-enhanced
jgtolentino
Create professional draw.io (diagrams.net) diagrams in XML format (.drawio files) with integrated PMP/PMBOK methodologies, extensive visual asset libraries, and industry-standard professional templates. Use this skill when users ask to create flowcharts, swimlane diagrams, cross-functional flowcharts, org charts, network diagrams, UML diagrams, BPMN, project management diagrams (WBS, Gantt, PERT, RACI), risk matrices, stakeholder maps, or any other visual diagram in draw.io format. This skill includes access to custom shape libraries for icons, clipart, and professional symbols.
godot
bfollington
This skill should be used when working on Godot Engine projects. It provides specialized knowledge of Godot's file formats (.gd, .tscn, .tres), architecture patterns (component-based, signal-driven, resource-based), common pitfalls, validation tools, code templates, and CLI workflows. The `godot` command is available for running the game, validating scripts, importing resources, and exporting builds. Use this skill for tasks involving Godot game development, debugging scene/resource files, implementing game systems, or creating new Godot components.
nano-banana-pro
garg-aayush
Generate and edit images using Google's Nano Banana Pro (Gemini 3 Pro Image) API. Use when the user asks to generate, create, edit, modify, change, alter, or update images. Also use when user references an existing image file and asks to modify it in any way (e.g., "modify this image", "change the background", "replace X with Y"). Supports both text-to-image generation and image-to-image editing with configurable resolution (1K default, 2K, or 4K for high resolution). DO NOT read the image file first - use this skill directly with the --input-image parameter.
ui-ux-pro-max
nextlevelbuilder
"UI/UX design intelligence. 50 styles, 21 palettes, 50 font pairings, 20 charts, 8 stacks (React, Next.js, Vue, Svelte, SwiftUI, React Native, Flutter, Tailwind). Actions: plan, build, create, design, implement, review, fix, improve, optimize, enhance, refactor, check UI/UX code. Projects: website, landing page, dashboard, admin panel, e-commerce, SaaS, portfolio, blog, mobile app, .html, .tsx, .vue, .svelte. Elements: button, modal, navbar, sidebar, card, table, form, chart. Styles: glassmorphism, claymorphism, minimalism, brutalism, neumorphism, bento grid, dark mode, responsive, skeuomorphism, flat design. Topics: color palette, accessibility, animation, layout, typography, font pairing, spacing, hover, shadow, gradient."
rust-coding-skill
UtakataKyosui
Guides Claude in writing idiomatic, efficient, well-structured Rust code using proper data modeling, traits, impl organization, macros, and build-speed best practices.
Stay ahead of the MCP ecosystem
Get weekly updates on new skills and servers.