cicd-diagnostics

3
0
Source

Diagnoses DotCMS GitHub Actions failures (PR builds, merge queue, nightly, trunk). Analyzes failed tests, root causes, compares runs. Use for "fails in GitHub", "merge queue failure", "PR build failed", "nightly build issue".

Install

mkdir -p .claude/skills/cicd-diagnostics && curl -L -o skill.zip "https://mcp.directory/api/skills/download/4133" && unzip -o skill.zip -d .claude/skills/cicd-diagnostics && rm skill.zip

Installs to .claude/skills/cicd-diagnostics

About this skill

CI/CD Build Diagnostics

Persona: Senior Platform Engineer - CI/CD Specialist

You are an experienced platform engineer specializing in DotCMS CI/CD failure diagnosis. See REFERENCE.md for detailed technical expertise and diagnostic patterns.

Core Workflow Types

  • cicd_1-pr.yml - PR validation with test filtering (may pass with subset)
  • cicd_2-merge-queue.yml - Full test suite before merge (catches filtered tests)
  • cicd_3-trunk.yml - Post-merge deployment (uses artifacts, no test re-run)
  • cicd_4-nightly.yml - Scheduled full test run (detects flaky tests)

Key insight: Tests passing in PR but failing in merge queue usually indicates test filtering discrepancy.

When to Use This Skill

Primary Triggers (ALWAYS use skill):

Run-Specific Analysis:

PR-Specific Investigation:

  • "What is the CI/CD failure for PR [number]"
  • "What failed in PR [number]"
  • "Check PR [number] CI status"
  • "Analyze PR [number] failures"
  • "Why did PR [number] fail"

Workflow/Build Investigation:

  • "Why did the build fail?"
  • "What's wrong with the CI?"
  • "Check CI/CD status"
  • "Debug [workflow-name] failure"
  • "What's failing in CI?"

Comparative Analysis:

  • "Why did PR pass but merge queue fail?"
  • "Compare PR and merge queue results"
  • "Why did this pass locally but fail in CI?"

Flaky Test Investigation:

  • "Is [test] flaky?"
  • "Check test [test-name] reliability"
  • "Analyze flaky test [name]"
  • "Why does [test] fail intermittently"

Nightly/Scheduled Build Analysis:

  • "Check nightly build status"
  • "Why did nightly fail?"
  • "Analyze nightly build"

Merge Queue Investigation:

  • "Check merge queue health"
  • "What's blocking the merge queue?"
  • "Why is merge queue failing?"

Context Indicators (Use when mentioned):

  • User provides GitHub Actions run URL
  • User mentions "CI", "build", "workflow", "pipeline", "tests failing in CI"
  • User asks about specific workflow names (PR Check, merge queue, nightly, trunk)
  • User mentions test failures in automated environments

Don't Use Skill When:

  • User asks about local test execution only
  • User wants to run tests locally (use direct commands)
  • User is debugging code logic (not CI failures)
  • User asks about git operations unrelated to CI

Diagnostic Approach

Philosophy: You are a senior engineer conducting an investigation, not following a rigid checklist. Use your judgment to pursue the most promising leads based on what you discover. The steps below are tools and techniques, not a mandatory sequence.

Core Investigation Pattern:

  1. Understand the context - What failed? When? How often?
  2. Gather evidence - Logs, errors, timeline, patterns
  3. Form hypotheses - What are the possible causes?
  4. Test hypotheses - Which evidence supports/refutes each?
  5. Draw conclusions - Root cause with confidence level
  6. Provide recommendations - How to fix, prevent, or investigate further

Investigation Decision Tree

Use this to guide your investigation approach based on initial findings:

Start → Identify what failed → Gather evidence → What type of failure?

├─ Test Failure?
│  ├─ Assertion error → Check recent code changes + Known issues
│  ├─ Timeout/race condition → Check for flaky test patterns + Timing analysis
│  └─ Setup failure → Check infrastructure + Recent runs
│
├─ Deployment Failure?
│  ├─ npm/Docker/Artifact error → CHECK EXTERNAL ISSUES FIRST
│  ├─ Authentication error → CHECK EXTERNAL ISSUES FIRST
│  └─ Build error → Check code changes + Dependencies
│
├─ Infrastructure Failure?
│  ├─ Container/Database → Check logs + Recent runs for patterns
│  ├─ Network/Timeout → Check timing + External service status
│  └─ Resource exhaustion → Check logs for memory/disk issues
│
└─ No obvious category?
   → Gather more evidence → Present complete diagnostic → AI analysis

Key Decision Points:

  1. After gathering evidence → Does this look like external service issue?

    • YES → Run external_issues.py, check service status, search web
    • NO → Focus on code changes, test patterns, internal issues
  2. After checking known issues → Is this a duplicate?

    • YES → Link to existing issue, assess if new information
    • NO → Continue investigation
  3. After initial analysis → Confidence level?

    • HIGH → Write diagnosis, create issue if needed
    • MEDIUM/LOW → Gather more context, compare runs, deep dive logs

Investigation Toolkit

Use these techniques flexibly based on your decision tree path:

Setup and Load Utilities (Always Start Here)

CRITICAL: All commands must run from repository root. Never use cd to change directories.

CRITICAL: This skill uses Python 3.8+ for all utility scripts. Python modules are automatically available when scripts are executed.

🚨 CRITICAL - SCRIPT PARAMETER ORDER 🚨

ALL fetch-*.py scripts use the SAME parameter order:

fetch-metadata.py  <RUN_ID> <WORKSPACE>
fetch-jobs.py      <RUN_ID> <WORKSPACE>
fetch-logs.py      <RUN_ID> <WORKSPACE> [JOB_ID]

Remember: RUN_ID is ALWAYS first, WORKSPACE is ALWAYS second!

Initialize the diagnostic workspace:

# Use the Python init script to set up workspace
RUN_ID=19131365567
python3 .claude/skills/cicd-diagnostics/init-diagnostic.py "$RUN_ID"
# Outputs: WORKSPACE=/path/to/.claude/diagnostics/run-{RUN_ID}

# IMPORTANT: Extract and set WORKSPACE variable from output
WORKSPACE="/Users/stevebolton/git/core2/.claude/diagnostics/run-${RUN_ID}"

Available Python utilities (imported automatically):

  • workspace.py - Diagnostic workspace with automatic caching
  • github_api.py - GitHub API wrappers for runs/jobs/logs
  • evidence.py - Evidence presentation for AI analysis (primary tool)
  • tiered_extraction.py - Tiered log extraction (Level 1/2/3)

All utilities use Python standard library and GitHub CLI (gh). No external Python packages required.

Identify Target and Create Workspace

Extract run ID from URL or PR:

# From URL: https://github.com/dotCMS/core/actions/runs/19131365567
RUN_ID=19131365567

# OR from PR number (extract RUN_ID from failed check URL)
PR_NUM=33711
gh pr view $PR_NUM --json statusCheckRollup \
    --jq '.statusCheckRollup[] | select(.conclusion == "FAILURE") | .detailsUrl' | head -1
# Extract RUN_ID from the URL output

# Workspace already created by init script in step 0
WORKSPACE="/Users/stevebolton/git/core2/.claude/diagnostics/run-${RUN_ID}"

2. Fetch Workflow Data (with caching)

Use Python helper scripts - remember: RUN_ID first, WORKSPACE second:

# ✅ CORRECT PARAMETER ORDER: <RUN_ID> <WORKSPACE>

# Example values for reference:
# RUN_ID=19131365567
# WORKSPACE="/Users/stevebolton/git/core2/.claude/diagnostics/run-19131365567"

# Fetch metadata (uses caching)
python3 .claude/skills/cicd-diagnostics/fetch-metadata.py "$RUN_ID" "$WORKSPACE"
#                                                          ^^^^^^^^  ^^^^^^^^^^
#                                                          FIRST     SECOND

# Fetch jobs (uses caching)
python3 .claude/skills/cicd-diagnostics/fetch-jobs.py "$RUN_ID" "$WORKSPACE"
#                                                     ^^^^^^^^  ^^^^^^^^^^
#                                                     FIRST     SECOND

# 🚨 NEW: Fetch workflow annotations (CRITICAL - check first!)
python3 .claude/skills/cicd-diagnostics/fetch-annotations.py "$RUN_ID" "$WORKSPACE"
#                                                            ^^^^^^^^  ^^^^^^^^^^
#                                                            FIRST     SECOND

# Set file paths
METADATA="$WORKSPACE/run-metadata.json"
JOBS="$WORKSPACE/jobs-detailed.json"
ANNOTATIONS="$WORKSPACE/annotations.json"

🎯 SMART ANNOTATION STRATEGY: Check annotations based on job states

Fetch annotations FIRST (before logs) when you see these indicators:

  • ✅ Jobs marked "skipped" in fetch-jobs.py output (check for if: conditions)
  • ✅ Expected jobs (release, deploy) completely missing from workflow run
  • ✅ Workflow shows "completed" but didn't execute all expected phases
  • ✅ Job conclusion is "startup_failure" or "action_required" (not "failure")
  • ✅ No obvious error messages in initial metadata review

Skip annotations (go straight to logs) when you see:

  • ❌ All expected jobs ran and failed (conclusion: "failure" with logs available)
  • ❌ Clear test failures or build errors visible in job summaries
  • ❌ Authentication/infrastructure errors already apparent in metadata
  • ❌ Obvious root cause already identified (e.g., flaky test, known issue)

Why this matters: Workflow annotations contain YAML syntax validation errors that:

  • Are visible in GitHub UI but NOT in job logs
  • Explain why jobs were skipped or never evaluated (workflow-level issues)
  • Are the ONLY way to diagnose jobs that never ran due to syntax errors

Time optimization:

  • Annotations-first path: ~1-2 min to root cause (when workflow syntax is the issue)
  • Logs-first path: ~2-5 min to root cause (when application/tests are the issue)
  • Wrong order wastes time analyzing logs for problems that don't exist in logs!

3. Download Failed Job Logs

The fetch-jobs.py script displays failed job IDs. Use those to download logs:

# ✅ CORRECT PARAMETER ORDER: <RUN_ID> <WORKSPACE> [JOB_ID]

# Example values for reference:
# RUN_ID=19131365567
# WORKSPACE="/Users/stevebolton/git/core2/.claude/diagnostics/run-19131365567"
# FAILED_JOB_ID=54939324205

# Download logs for specific failed job
python3 .claude/skills/cicd-diagnostics/fetch-logs.py "$RUN_ID" "$WORKSPACE" "$FAILED_JOB_ID"
#                                                     ^^^^^^^^  ^^^^^^^^^^  ^^^^^^^^^^^^^^^
#                                    

---

*Content truncated.*

You might also like

flutter-development

aj-geddes

Build beautiful cross-platform mobile apps with Flutter and Dart. Covers widgets, state management with Provider/BLoC, navigation, API integration, and material design.

643969

drawio-diagrams-enhanced

jgtolentino

Create professional draw.io (diagrams.net) diagrams in XML format (.drawio files) with integrated PMP/PMBOK methodologies, extensive visual asset libraries, and industry-standard professional templates. Use this skill when users ask to create flowcharts, swimlane diagrams, cross-functional flowcharts, org charts, network diagrams, UML diagrams, BPMN, project management diagrams (WBS, Gantt, PERT, RACI), risk matrices, stakeholder maps, or any other visual diagram in draw.io format. This skill includes access to custom shape libraries for icons, clipart, and professional symbols.

591705

ui-ux-pro-max

nextlevelbuilder

"UI/UX design intelligence. 50 styles, 21 palettes, 50 font pairings, 20 charts, 8 stacks (React, Next.js, Vue, Svelte, SwiftUI, React Native, Flutter, Tailwind). Actions: plan, build, create, design, implement, review, fix, improve, optimize, enhance, refactor, check UI/UX code. Projects: website, landing page, dashboard, admin panel, e-commerce, SaaS, portfolio, blog, mobile app, .html, .tsx, .vue, .svelte. Elements: button, modal, navbar, sidebar, card, table, form, chart. Styles: glassmorphism, claymorphism, minimalism, brutalism, neumorphism, bento grid, dark mode, responsive, skeuomorphism, flat design. Topics: color palette, accessibility, animation, layout, typography, font pairing, spacing, hover, shadow, gradient."

318398

godot

bfollington

This skill should be used when working on Godot Engine projects. It provides specialized knowledge of Godot's file formats (.gd, .tscn, .tres), architecture patterns (component-based, signal-driven, resource-based), common pitfalls, validation tools, code templates, and CLI workflows. The `godot` command is available for running the game, validating scripts, importing resources, and exporting builds. Use this skill for tasks involving Godot game development, debugging scene/resource files, implementing game systems, or creating new Godot components.

339397

nano-banana-pro

garg-aayush

Generate and edit images using Google's Nano Banana Pro (Gemini 3 Pro Image) API. Use when the user asks to generate, create, edit, modify, change, alter, or update images. Also use when user references an existing image file and asks to modify it in any way (e.g., "modify this image", "change the background", "replace X with Y"). Supports both text-to-image generation and image-to-image editing with configurable resolution (1K default, 2K, or 4K for high resolution). DO NOT read the image file first - use this skill directly with the --input-image parameter.

451339

fastapi-templates

wshobson

Create production-ready FastAPI projects with async patterns, dependency injection, and comprehensive error handling. Use when building new FastAPI applications or setting up backend API projects.

304231

Stay ahead of the MCP ecosystem

Get weekly updates on new skills and servers.