phoenix-observability

21views

2installs

Open-source AI observability platform for LLM tracing, evaluation, and monitoring. Use when debugging LLM applications with detailed traces, running evaluations on datasets, or monitoring production AI systems with real-time insights.

Install

mkdir -p .claude/skills/phoenix-observability && curl -L -o skill.zip "https://mcp.directory/api/skills/download/1025" && unzip -o skill.zip -d .claude/skills/phoenix-observability && rm skill.zip

Installs to .claude/skills/phoenix-observability

About this skill

Phoenix - AI Observability Platform

Open-source AI observability and evaluation platform for LLM applications with tracing, evaluation, datasets, experiments, and real-time monitoring.

When to use Phoenix

Use Phoenix when:

Debugging LLM application issues with detailed traces
Running systematic evaluations on datasets
Monitoring production LLM systems in real-time
Building experiment pipelines for prompt/model comparison
Self-hosted observability without vendor lock-in

Key features:

Tracing: OpenTelemetry-based trace collection for any LLM framework
Evaluation: LLM-as-judge evaluators for quality assessment
Datasets: Versioned test sets for regression testing
Experiments: Compare prompts, models, and configurations
Playground: Interactive prompt testing with multiple models
Open-source: Self-hosted with PostgreSQL or SQLite

Use alternatives instead:

LangSmith: Managed platform with LangChain-first integration
Weights & Biases: Deep learning experiment tracking focus
Arize Cloud: Managed Phoenix with enterprise features
MLflow: General ML lifecycle, model registry focus

Quick start

Installation

pip install arize-phoenix

# With specific backends
pip install arize-phoenix[embeddings]  # Embedding analysis
pip install arize-phoenix-otel         # OpenTelemetry config
pip install arize-phoenix-evals        # Evaluation framework
pip install arize-phoenix-client       # Lightweight REST client

Launch Phoenix server

import phoenix as px

# Launch in notebook (ThreadServer mode)
session = px.launch_app()

# View UI
session.view()  # Embedded iframe
print(session.url)  # http://localhost:6006

Command-line server (production)

# Start Phoenix server
phoenix serve

# With PostgreSQL
export PHOENIX_SQL_DATABASE_URL="postgresql://user:pass@host/db"
phoenix serve --port 6006

Basic tracing

from phoenix.otel import register
from openinference.instrumentation.openai import OpenAIInstrumentor

# Configure OpenTelemetry with Phoenix
tracer_provider = register(
    project_name="my-llm-app",
    endpoint="http://localhost:6006/v1/traces"
)

# Instrument OpenAI SDK
OpenAIInstrumentor().instrument(tracer_provider=tracer_provider)

# All OpenAI calls are now traced
from openai import OpenAI
client = OpenAI()
response = client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": "Hello!"}]
)

Core concepts

Traces and spans

A trace represents a complete execution flow, while spans are individual operations within that trace.

from phoenix.otel import register
from opentelemetry import trace

# Setup tracing
tracer_provider = register(project_name="my-app")
tracer = trace.get_tracer(__name__)

# Create custom spans
with tracer.start_as_current_span("process_query") as span:
    span.set_attribute("input.value", query)

    # Child spans are automatically nested
    with tracer.start_as_current_span("retrieve_context"):
        context = retriever.search(query)

    with tracer.start_as_current_span("generate_response"):
        response = llm.generate(query, context)

    span.set_attribute("output.value", response)

Projects

Projects organize related traces:

import os
os.environ["PHOENIX_PROJECT_NAME"] = "production-chatbot"

# Or per-trace
from phoenix.otel import register
tracer_provider = register(project_name="experiment-v2")

Framework instrumentation

OpenAI

from phoenix.otel import register
from openinference.instrumentation.openai import OpenAIInstrumentor

tracer_provider = register()
OpenAIInstrumentor().instrument(tracer_provider=tracer_provider)

LangChain

from phoenix.otel import register
from openinference.instrumentation.langchain import LangChainInstrumentor

tracer_provider = register()
LangChainInstrumentor().instrument(tracer_provider=tracer_provider)

# All LangChain operations traced
from langchain_openai import ChatOpenAI
llm = ChatOpenAI(model="gpt-4o")
response = llm.invoke("Hello!")

LlamaIndex

from phoenix.otel import register
from openinference.instrumentation.llama_index import LlamaIndexInstrumentor

tracer_provider = register()
LlamaIndexInstrumentor().instrument(tracer_provider=tracer_provider)

Anthropic

from phoenix.otel import register
from openinference.instrumentation.anthropic import AnthropicInstrumentor

tracer_provider = register()
AnthropicInstrumentor().instrument(tracer_provider=tracer_provider)

Evaluation framework

Built-in evaluators

from phoenix.evals import (
    OpenAIModel,
    HallucinationEvaluator,
    RelevanceEvaluator,
    ToxicityEvaluator,
    llm_classify
)

# Setup model for evaluation
eval_model = OpenAIModel(model="gpt-4o")

# Evaluate hallucination
hallucination_eval = HallucinationEvaluator(eval_model)
results = hallucination_eval.evaluate(
    input="What is the capital of France?",
    output="The capital of France is Paris.",
    reference="Paris is the capital of France."
)

Custom evaluators

from phoenix.evals import llm_classify

# Define custom evaluation
def evaluate_helpfulness(input_text, output_text):
    template = """
    Evaluate if the response is helpful for the given question.

    Question: {input}
    Response: {output}

    Is this response helpful? Answer 'helpful' or 'not_helpful'.
    """

    result = llm_classify(
        model=eval_model,
        template=template,
        input=input_text,
        output=output_text,
        rails=["helpful", "not_helpful"]
    )
    return result

Run evaluations on dataset

from phoenix import Client
from phoenix.evals import run_evals

client = Client()

# Get spans to evaluate
spans_df = client.get_spans_dataframe(
    project_name="my-app",
    filter_condition="span_kind == 'LLM'"
)

# Run evaluations
eval_results = run_evals(
    dataframe=spans_df,
    evaluators=[
        HallucinationEvaluator(eval_model),
        RelevanceEvaluator(eval_model)
    ],
    provide_explanation=True
)

# Log results back to Phoenix
client.log_evaluations(eval_results)

Datasets and experiments

Create dataset

from phoenix import Client

client = Client()

# Create dataset
dataset = client.create_dataset(
    name="qa-test-set",
    description="QA evaluation dataset"
)

# Add examples
client.add_examples_to_dataset(
    dataset_name="qa-test-set",
    examples=[
        {
            "input": {"question": "What is Python?"},
            "output": {"answer": "A programming language"}
        },
        {
            "input": {"question": "What is ML?"},
            "output": {"answer": "Machine learning"}
        }
    ]
)

Run experiment

from phoenix import Client
from phoenix.experiments import run_experiment

client = Client()

def my_model(input_data):
    """Your model function."""
    question = input_data["question"]
    return {"answer": generate_answer(question)}

def accuracy_evaluator(input_data, output, expected):
    """Custom evaluator."""
    return {
        "score": 1.0 if expected["answer"].lower() in output["answer"].lower() else 0.0,
        "label": "correct" if expected["answer"].lower() in output["answer"].lower() else "incorrect"
    }

# Run experiment
results = run_experiment(
    dataset_name="qa-test-set",
    task=my_model,
    evaluators=[accuracy_evaluator],
    experiment_name="baseline-v1"
)

print(f"Average accuracy: {results.aggregate_metrics['accuracy']}")

Client API

Query traces and spans

from phoenix import Client

client = Client(endpoint="http://localhost:6006")

# Get spans as DataFrame
spans_df = client.get_spans_dataframe(
    project_name="my-app",
    filter_condition="span_kind == 'LLM'",
    limit=1000
)

# Get specific span
span = client.get_span(span_id="abc123")

# Get trace
trace = client.get_trace(trace_id="xyz789")

Log feedback

from phoenix import Client

client = Client()

# Log user feedback
client.log_annotation(
    span_id="abc123",
    name="user_rating",
    annotator_kind="HUMAN",
    score=0.8,
    label="helpful",
    metadata={"comment": "Good response"}
)

Export data

# Export to pandas
df = client.get_spans_dataframe(project_name="my-app")

# Export traces
traces = client.list_traces(project_name="my-app")

Production deployment

Docker

docker run -p 6006:6006 arizephoenix/phoenix:latest

With PostgreSQL

# Set database URL
export PHOENIX_SQL_DATABASE_URL="postgresql://user:pass@host:5432/phoenix"

# Start server
phoenix serve --host 0.0.0.0 --port 6006

Environment variables

Variable	Description	Default
`PHOENIX_PORT`	HTTP server port	`6006`
`PHOENIX_HOST`	Server bind address	`127.0.0.1`
`PHOENIX_GRPC_PORT`	gRPC/OTLP port	`4317`
`PHOENIX_SQL_DATABASE_URL`	Database connection	SQLite temp
`PHOENIX_WORKING_DIR`	Data storage directory	OS temp
`PHOENIX_ENABLE_AUTH`	Enable authentication	`false`
`PHOENIX_SECRET`	JWT signing secret	Required if auth enabled

With authentication

export PHOENIX_ENABLE_AUTH=true
export PHOENIX_SECRET="your-secret-key-min-32-chars"
export PHOENIX_ADMIN_SECRET="admin-bootstrap-token"

phoenix serve

Best practices

Use projects: Separate traces by environment (dev/staging/prod)
Add metadata: Include user IDs, session IDs for debugging
Evaluate regularly: Run automated evaluations in CI/CD
Version datasets: Track test set changes over time
Monitor costs: Track token usage via Phoenix dashboards
Self-host: Use PostgreSQL for production deployments

Common issues

Traces not appearing:

from phoenix.otel import register

# Verify en

---

*Content truncated.*

More by davila7

View all skills by davila7 →

software-architecture

davila7

Guide for quality focused software architecture. This skill should be used when users want to write code, design architecture, analyze code, in any case that relates to software development.

461159

scroll-experience

davila7

Expert in building immersive scroll-driven experiences - parallax storytelling, scroll animations, interactive narratives, and cinematic web experiences. Like NY Times interactives, Apple product pages, and award-winning web experiences. Makes websites feel like experiences, not just pages. Use when: scroll animation, parallax, scroll storytelling, interactive story, cinematic website.

12480

planning-with-files

davila7

Implements Manus-style file-based planning for complex tasks. Creates task_plan.md, findings.md, and progress.md. Use when starting complex multi-step tasks, research projects, or any task requiring >5 tool calls.

7863

game-development

davila7

Game development orchestrator. Routes to platform-specific skills based on project needs.

14549

humanizer

davila7

Remove signs of AI-generated writing from text. Use when editing or reviewing text to make it sound more natural and human-written. Based on Wikipedia's comprehensive "Signs of AI writing" guide. Detects and fixes patterns including: inflated symbolism, promotional language, superficial -ing analyses, vague attributions, em dash overuse, rule of three, AI vocabulary words, negative parallelisms, and excessive conjunctive phrases. Credits: Original skill by @blader - https://github.com/blader/humanizer

10048

2d-games

davila7

2D game development principles. Sprites, tilemaps, physics, camera.

12744

flutter-development

aj-geddes

Build beautiful cross-platform mobile apps with Flutter and Dart. Covers widgets, state management with Provider/BLoC, navigation, API integration, and material design.

1,5541,368

ui-ux-pro-max

nextlevelbuilder

"UI/UX design intelligence. 50 styles, 21 palettes, 50 font pairings, 20 charts, 8 stacks (React, Next.js, Vue, Svelte, SwiftUI, React Native, Flutter, Tailwind). Actions: plan, build, create, design, implement, review, fix, improve, optimize, enhance, refactor, check UI/UX code. Projects: website, landing page, dashboard, admin panel, e-commerce, SaaS, portfolio, blog, mobile app, .html, .tsx, .vue, .svelte. Elements: button, modal, navbar, sidebar, card, table, form, chart. Styles: glassmorphism, claymorphism, minimalism, brutalism, neumorphism, bento grid, dark mode, responsive, skeuomorphism, flat design. Topics: color palette, accessibility, animation, layout, typography, font pairing, spacing, hover, shadow, gradient."

1,0741,165

drawio-diagrams-enhanced

jgtolentino

Create professional draw.io (diagrams.net) diagrams in XML format (.drawio files) with integrated PMP/PMBOK methodologies, extensive visual asset libraries, and industry-standard professional templates. Use this skill when users ask to create flowcharts, swimlane diagrams, cross-functional flowcharts, org charts, network diagrams, UML diagrams, BPMN, project management diagrams (WBS, Gantt, PERT, RACI), risk matrices, stakeholder maps, or any other visual diagram in draw.io format. This skill includes access to custom shape libraries for icons, clipart, and professional symbols.

1,4001,102

godot

bfollington

This skill should be used when working on Godot Engine projects. It provides specialized knowledge of Godot's file formats (.gd, .tscn, .tres), architecture patterns (component-based, signal-driven, resource-based), common pitfalls, validation tools, code templates, and CLI workflows. The `godot` command is available for running the game, validating scripts, importing resources, and exporting builds. Use this skill for tasks involving Godot game development, debugging scene/resource files, implementing game systems, or creating new Godot components.

1,172738

nano-banana-pro

garg-aayush

Generate and edit images using Google's Nano Banana Pro (Gemini 3 Pro Image) API. Use when the user asks to generate, create, edit, modify, change, alter, or update images. Also use when user references an existing image file and asks to modify it in any way (e.g., "modify this image", "change the background", "replace X with Y"). Supports both text-to-image generation and image-to-image editing with configurable resolution (1K default, 2K, or 4K for high resolution). DO NOT read the image file first - use this skill directly with the --input-image parameter.

1,128677

pdf-to-markdown

aliceisjustplaying

Convert entire PDF documents to clean, structured Markdown for full context loading. Use this skill when the user wants to extract ALL text from a PDF into context (not grep/search), when discussing or analyzing PDF content in full, when the user mentions "load the whole PDF", "bring the PDF into context", "read the entire PDF", or when partial extraction/grepping would miss important context. This is the preferred method for PDF text extraction over page-by-page or grep approaches.

1,272601

Related MCP Servers

Browse all servers

Galileo

Galileo: Integrate with Galileo to create datasets, manage prompt templates, run experiments, analyze logs, and monitor

50 tools

Google Cloud

Effortlessly manage Google Cloud with this user-friendly multi cloud management platform—simplify operations, automate t

7010 tools

Logfire

Logfire is a data observability platform for querying, analyzing, and monitoring OpenTelemetry traces, errors, and metri

1530 tools

Dynatrace

Integrate Dynatrace, a leading data observability platform and APM tool, to monitor metrics, security, and network perfo

850 tools

Dynatrace Managed MCP Server

Dynatrace Managed MCP Server delivers AI-driven access to self-hosted monitoring and observability platform, AIOps insig

160 tools

AgentOps

Access AgentOps data for agent debugging: retrieve project info, trace details, span metrics, and execution traces via a

140 tools

Stay ahead of the MCP ecosystem

Get weekly updates on new skills and servers.

Install

mkdir -p .claude/skills/phoenix-observability && curl -L -o skill.zip "https://mcp.directory/api/skills/download/1025" && unzip -o skill.zip -d .claude/skills/phoenix-observability && rm skill.zip

Installs to .claude/skills/phoenix-observability

Stats

Views

Installs

Author

davila7

7 skills published

Links

Source Code

phoenix-observability

Install

About this skill

Phoenix - AI Observability Platform

When to use Phoenix

Quick start

Installation

Launch Phoenix server

Command-line server (production)

Basic tracing

Core concepts

Traces and spans

Projects

Framework instrumentation

OpenAI

LangChain

LlamaIndex

Anthropic

Evaluation framework

Built-in evaluators

Custom evaluators

Run evaluations on dataset

Datasets and experiments

Create dataset

Run experiment

Client API

Query traces and spans

Log feedback

Export data

Production deployment

Docker

With PostgreSQL

Environment variables

With authentication

Best practices

Common issues

More by davila7

software-architecture

scroll-experience

planning-with-files

game-development

humanizer

2d-games

You might also like

flutter-development

ui-ux-pro-max

drawio-diagrams-enhanced

godot

nano-banana-pro

pdf-to-markdown

Related MCP Servers

Stay ahead of the MCP ecosystem