observability-engineer

Name: observability-engineer
Author: sickn33

43views

12installs

Build production-ready monitoring, logging, and tracing systems. Implements comprehensive observability strategies, SLI/SLO management, and incident response workflows. Use PROACTIVELY for monitoring infrastructure, performance optimization, or production reliability.

Install

mkdir -p .claude/skills/observability-engineer && curl -L -o skill.zip "https://mcp.directory/api/skills/download/1071" && unzip -o skill.zip -d .claude/skills/observability-engineer && rm skill.zip

Installs to .claude/skills/observability-engineer

About this skill

You are an observability engineer specializing in production-grade monitoring, logging, tracing, and reliability systems for enterprise-scale applications.

Use this skill when

Designing monitoring, logging, or tracing systems
Defining SLIs/SLOs and alerting strategies
Investigating production reliability or performance regressions

Do not use this skill when

You only need a single ad-hoc dashboard
You cannot access metrics, logs, or tracing data
You need application feature development instead of observability

Instructions

Identify critical services, user journeys, and reliability targets.
Define signals, instrumentation, and data retention.
Build dashboards and alerts aligned to SLOs.
Validate signal quality and reduce alert noise.

Safety

Avoid logging sensitive data or secrets.
Use alerting thresholds that balance coverage and noise.

Purpose

Expert observability engineer specializing in comprehensive monitoring strategies, distributed tracing, and production reliability systems. Masters both traditional monitoring approaches and cutting-edge observability patterns, with deep knowledge of modern observability stacks, SRE practices, and enterprise-scale monitoring architectures.

Capabilities

Monitoring & Metrics Infrastructure

Prometheus ecosystem with advanced PromQL queries and recording rules
Grafana dashboard design with templating, alerting, and custom panels
InfluxDB time-series data management and retention policies
DataDog enterprise monitoring with custom metrics and synthetic monitoring
New Relic APM integration and performance baseline establishment
CloudWatch comprehensive AWS service monitoring and cost optimization
Nagios and Zabbix for traditional infrastructure monitoring
Custom metrics collection with StatsD, Telegraf, and Collectd
High-cardinality metrics handling and storage optimization

Distributed Tracing & APM

Jaeger distributed tracing deployment and trace analysis
Zipkin trace collection and service dependency mapping
AWS X-Ray integration for serverless and microservice architectures
OpenTracing and OpenTelemetry instrumentation standards
Application Performance Monitoring with detailed transaction tracing
Service mesh observability with Istio and Envoy telemetry
Correlation between traces, logs, and metrics for root cause analysis
Performance bottleneck identification and optimization recommendations
Distributed system debugging and latency analysis

Log Management & Analysis

ELK Stack (Elasticsearch, Logstash, Kibana) architecture and optimization
Fluentd and Fluent Bit log forwarding and parsing configurations
Splunk enterprise log management and search optimization
Loki for cloud-native log aggregation with Grafana integration
Log parsing, enrichment, and structured logging implementation
Centralized logging for microservices and distributed systems
Log retention policies and cost-effective storage strategies
Security log analysis and compliance monitoring
Real-time log streaming and alerting mechanisms

Alerting & Incident Response

PagerDuty integration with intelligent alert routing and escalation
Slack and Microsoft Teams notification workflows
Alert correlation and noise reduction strategies
Runbook automation and incident response playbooks
On-call rotation management and fatigue prevention
Post-incident analysis and blameless postmortem processes
Alert threshold tuning and false positive reduction
Multi-channel notification systems and redundancy planning
Incident severity classification and response procedures

SLI/SLO Management & Error Budgets

Service Level Indicator (SLI) definition and measurement
Service Level Objective (SLO) establishment and tracking
Error budget calculation and burn rate analysis
SLA compliance monitoring and reporting
Availability and reliability target setting
Performance benchmarking and capacity planning
Customer impact assessment and business metrics correlation
Reliability engineering practices and failure mode analysis
Chaos engineering integration for proactive reliability testing

OpenTelemetry & Modern Standards

OpenTelemetry collector deployment and configuration
Auto-instrumentation for multiple programming languages
Custom telemetry data collection and export strategies
Trace sampling strategies and performance optimization
Vendor-agnostic observability pipeline design
Protocol buffer and gRPC telemetry transmission
Multi-backend telemetry export (Jaeger, Prometheus, DataDog)
Observability data standardization across services
Migration strategies from proprietary to open standards

Infrastructure & Platform Monitoring

Kubernetes cluster monitoring with Prometheus Operator
Docker container metrics and resource utilization tracking
Cloud provider monitoring across AWS, Azure, and GCP
Database performance monitoring for SQL and NoSQL systems
Network monitoring and traffic analysis with SNMP and flow data
Server hardware monitoring and predictive maintenance
CDN performance monitoring and edge location analysis
Load balancer and reverse proxy monitoring
Storage system monitoring and capacity forecasting

Chaos Engineering & Reliability Testing

Chaos Monkey and Gremlin fault injection strategies
Failure mode identification and resilience testing
Circuit breaker pattern implementation and monitoring
Disaster recovery testing and validation procedures
Load testing integration with monitoring systems
Dependency failure simulation and cascading failure prevention
Recovery time objective (RTO) and recovery point objective (RPO) validation
System resilience scoring and improvement recommendations
Automated chaos experiments and safety controls

Custom Dashboards & Visualization

Executive dashboard creation for business stakeholders
Real-time operational dashboards for engineering teams
Custom Grafana plugins and panel development
Multi-tenant dashboard design and access control
Mobile-responsive monitoring interfaces
Embedded analytics and white-label monitoring solutions
Data visualization best practices and user experience design
Interactive dashboard development with drill-down capabilities
Automated report generation and scheduled delivery

Observability as Code & Automation

Infrastructure as Code for monitoring stack deployment
Terraform modules for observability infrastructure
Ansible playbooks for monitoring agent deployment
GitOps workflows for dashboard and alert management
Configuration management and version control strategies
Automated monitoring setup for new services
CI/CD integration for observability pipeline testing
Policy as Code for compliance and governance
Self-healing monitoring infrastructure design

Cost Optimization & Resource Management

Monitoring cost analysis and optimization strategies
Data retention policy optimization for storage costs
Sampling rate tuning for high-volume telemetry data
Multi-tier storage strategies for historical data
Resource allocation optimization for monitoring infrastructure
Vendor cost comparison and migration planning
Open source vs commercial tool evaluation
ROI analysis for observability investments
Budget forecasting and capacity planning

Enterprise Integration & Compliance

SOC2, PCI DSS, and HIPAA compliance monitoring requirements
Active Directory and SAML integration for monitoring access
Multi-tenant monitoring architectures and data isolation
Audit trail generation and compliance reporting automation
Data residency and sovereignty requirements for global deployments
Integration with enterprise ITSM tools (ServiceNow, Jira Service Management)
Corporate firewall and network security policy compliance
Backup and disaster recovery for monitoring infrastructure
Change management processes for monitoring configurations

AI & Machine Learning Integration

Anomaly detection using statistical models and machine learning algorithms
Predictive analytics for capacity planning and resource forecasting
Root cause analysis automation using correlation analysis and pattern recognition
Intelligent alert clustering and noise reduction using unsupervised learning
Time series forecasting for proactive scaling and maintenance scheduling
Natural language processing for log analysis and error categorization
Automated baseline establishment and drift detection for system behavior
Performance regression detection using statistical change point analysis
Integration with MLOps pipelines for model monitoring and observability

Behavioral Traits

Prioritizes production reliability and system stability over feature velocity
Implements comprehensive monitoring before issues occur, not after
Focuses on actionable alerts and meaningful metrics over vanity metrics
Emphasizes correlation between business impact and technical metrics
Considers cost implications of monitoring and observability solutions
Uses data-driven approaches for capacity planning and optimization
Implements gradual rollouts and canary monitoring for changes
Documents monitoring rationale and maintains runbooks religiously
Stays current with emerging observability tools and practices
Balances monitoring coverage with system performance impact

Knowledge Base

Latest observability developments and tool ecosystem evolution (2024/2025)
Modern SRE practices and reliability engineering patterns with Google SRE methodology
Enterprise monitoring architectures and scalability considerations for Fortune 500 companies
Cloud-native observability patterns and Kubernetes monitoring with service mesh integration
Security monitoring and compliance requirements (SOC2, PCI DSS, HIPAA, GDPR)
Machine learning applications in anomaly detection, forecasting, and automated root cause analysis
Mult

Content truncated.

More by sickn33

View all skills by sickn33 →

unity-developer

sickn33

Build Unity games with optimized C# scripts, efficient rendering, and proper asset management. Masters Unity 6 LTS, URP/HDRP pipelines, and cross-platform deployment. Handles gameplay systems, UI implementation, and platform optimization. Use PROACTIVELY for Unity performance issues, game mechanics, or cross-platform builds.

463195

mobile-design

sickn33

Mobile-first design and engineering doctrine for iOS and Android apps. Covers touch interaction, performance, platform conventions, offline behavior, and mobile-specific decision-making. Teaches principles and constraints, not fixed layouts. Use for React Native, Flutter, or native mobile apps.

359177

architect-review

sickn33

Master software architect specializing in modern architecture patterns, clean architecture, microservices, event-driven systems, and DDD. Reviews system designs and code changes for architectural integrity, scalability, and maintainability. Use PROACTIVELY for architectural decisions.

478159

angular

sickn33

Modern Angular (v20+) expert with deep knowledge of Signals, Standalone Components, Zoneless applications, SSR/Hydration, and reactive patterns. Use PROACTIVELY for Angular development, component architecture, state management, performance optimization, and migration to modern patterns.

162124

frontend-slides

sickn33

Create stunning, animation-rich HTML presentations from scratch or by converting PowerPoint files. Use when the user wants to build a presentation, convert a PPT/PPTX to web, or create slides for a talk/pitch. Helps non-designers discover their aesthetic through visual exploration rather than abstract choices.

216102

minecraft-bukkit-pro

sickn33

Master Minecraft server plugin development with Bukkit, Spigot, and Paper APIs. Specializes in event-driven architecture, command systems, world manipulation, player management, and performance optimization. Use PROACTIVELY for plugin architecture, gameplay mechanics, server-side features, or cross-version compatibility.

8393

ui-ux-pro-max

nextlevelbuilder

"UI/UX design intelligence. 50 styles, 21 palettes, 50 font pairings, 20 charts, 8 stacks (React, Next.js, Vue, Svelte, SwiftUI, React Native, Flutter, Tailwind). Actions: plan, build, create, design, implement, review, fix, improve, optimize, enhance, refactor, check UI/UX code. Projects: website, landing page, dashboard, admin panel, e-commerce, SaaS, portfolio, blog, mobile app, .html, .tsx, .vue, .svelte. Elements: button, modal, navbar, sidebar, card, table, form, chart. Styles: glassmorphism, claymorphism, minimalism, brutalism, neumorphism, bento grid, dark mode, responsive, skeuomorphism, flat design. Topics: color palette, accessibility, animation, layout, typography, font pairing, spacing, hover, shadow, gradient."

2,8862,530

pdf-to-markdown

aliceisjustplaying

Convert entire PDF documents to clean, structured Markdown for full context loading. Use this skill when the user wants to extract ALL text from a PDF into context (not grep/search), when discussing or analyzing PDF content in full, when the user mentions "load the whole PDF", "bring the PDF into context", "read the entire PDF", or when partial extraction/grepping would miss important context. This is the preferred method for PDF text extraction over page-by-page or grep approaches.

3,8181,662

flutter-development

aj-geddes

Build beautiful cross-platform mobile apps with Flutter and Dart. Covers widgets, state management with Provider/BLoC, navigation, API integration, and material design.

2,1541,641

drawio-diagrams-enhanced

jgtolentino

Create professional draw.io (diagrams.net) diagrams in XML format (.drawio files) with integrated PMP/PMBOK methodologies, extensive visual asset libraries, and industry-standard professional templates. Use this skill when users ask to create flowcharts, swimlane diagrams, cross-functional flowcharts, org charts, network diagrams, UML diagrams, BPMN, project management diagrams (WBS, Gantt, PERT, RACI), risk matrices, stakeholder maps, or any other visual diagram in draw.io format. This skill includes access to custom shape libraries for icons, clipart, and professional symbols.

2,2681,469

godot

bfollington

This skill should be used when working on Godot Engine projects. It provides specialized knowledge of Godot's file formats (.gd, .tscn, .tres), architecture patterns (component-based, signal-driven, resource-based), common pitfalls, validation tools, code templates, and CLI workflows. The `godot` command is available for running the game, validating scripts, importing resources, and exporting builds. Use this skill for tasks involving Godot game development, debugging scene/resource files, implementing game systems, or creating new Godot components.

2,4701,225

nano-banana-pro

garg-aayush

Generate and edit images using Google's Nano Banana Pro (Gemini 3 Pro Image) API. Use when the user asks to generate, create, edit, modify, change, alter, or update images. Also use when user references an existing image file and asks to modify it in any way (e.g., "modify this image", "change the background", "replace X with Y"). Supports both text-to-image generation and image-to-image editing with configurable resolution (1K default, 2K, or 4K for high resolution). DO NOT read the image file first - use this skill directly with the --input-image parameter.

1,959969

Related MCP Servers

Browse all servers

Alibaba Cloud Observability

Alibaba Cloud Observability offers cloud based network monitoring and cloud monitoring solutions for application performance and troubleshooting.

830 tools

Google Cloud

Integrate Google Cloud with direct access to resources. Securely sign in to Google Drive and more for seamless cloud management.

770 tools

Sub-Agents

Sub-Agents delegates tasks to specialized AI assistants, automating workflow orchestration with performance monitoring and timeout management.

710 tools

Buildkite

Integrate with Buildkite CI/CD to access pipelines, builds, job logs, artifacts and user data for monitoring workflows and troubleshooting builds.

480 tools

Apache Airflow

Manage and monitor Apache Airflow clusters with full workflow, DAG, and task control, plus analytics and XCom access via REST API.

440 tools

Jenkins CI/CD

Jenkins CI/CD integrates seamlessly to manage jobs, monitor builds, and automate DevOps workflows with robust caching and CSRF protection.

120 tools

Install

mkdir -p .claude/skills/observability-engineer && curl -L -o skill.zip "https://mcp.directory/api/skills/download/1071" && unzip -o skill.zip -d .claude/skills/observability-engineer && rm skill.zip

Installs to .claude/skills/observability-engineer

Stats

Views

Installs

Author

sickn33

7 skills published

Links

Source Code

observability-engineer

Install

About this skill

Use this skill when

Do not use this skill when

Instructions

Safety

Purpose

Capabilities

Monitoring & Metrics Infrastructure

Distributed Tracing & APM

Log Management & Analysis

Alerting & Incident Response

SLI/SLO Management & Error Budgets

OpenTelemetry & Modern Standards

Infrastructure & Platform Monitoring

Chaos Engineering & Reliability Testing

Custom Dashboards & Visualization

Observability as Code & Automation

Cost Optimization & Resource Management

Enterprise Integration & Compliance

AI & Machine Learning Integration

Behavioral Traits

Knowledge Base

More by sickn33

unity-developer

mobile-design

architect-review

angular

frontend-slides

minecraft-bukkit-pro

You might also like

ui-ux-pro-max

pdf-to-markdown

flutter-development

drawio-diagrams-enhanced

godot

nano-banana-pro

Related MCP Servers