Install
mkdir -p .claude/skills/observability-designer && curl -L -o skill.zip "https://mcp.directory/api/skills/download/2699" && unzip -o skill.zip -d .claude/skills/observability-designer && rm skill.zipInstalls to .claude/skills/observability-designer
About this skill
Observability Designer (POWERFUL)
Category: Engineering
Tier: POWERFUL
Description: Design comprehensive observability strategies for production systems including SLI/SLO frameworks, alerting optimization, and dashboard generation.
Overview
Observability Designer enables you to create production-ready observability strategies that provide deep insights into system behavior, performance, and reliability. This skill combines the three pillars of observability (metrics, logs, traces) with proven frameworks like SLI/SLO design, golden signals monitoring, and alert optimization to create comprehensive observability solutions.
Core Competencies
SLI/SLO/SLA Framework Design
- Service Level Indicators (SLI): Define measurable signals that indicate service health
- Service Level Objectives (SLO): Set reliability targets based on user experience
- Service Level Agreements (SLA): Establish customer-facing commitments with consequences
- Error Budget Management: Calculate and track error budget consumption
- Burn Rate Alerting: Multi-window burn rate alerts for proactive SLO protection
Three Pillars of Observability
Metrics
- Golden Signals: Latency, traffic, errors, and saturation monitoring
- RED Method: Rate, Errors, and Duration for request-driven services
- USE Method: Utilization, Saturation, and Errors for resource monitoring
- Business Metrics: Revenue, user engagement, and feature adoption tracking
- Infrastructure Metrics: CPU, memory, disk, network, and custom resource metrics
Logs
- Structured Logging: JSON-based log formats with consistent fields
- Log Aggregation: Centralized log collection and indexing strategies
- Log Levels: Appropriate use of DEBUG, INFO, WARN, ERROR, FATAL levels
- Correlation IDs: Request tracing through distributed systems
- Log Sampling: Volume management for high-throughput systems
Traces
- Distributed Tracing: End-to-end request flow visualization
- Span Design: Meaningful span boundaries and metadata
- Trace Sampling: Intelligent sampling strategies for performance and cost
- Service Maps: Automatic dependency discovery through traces
- Root Cause Analysis: Trace-driven debugging workflows
Dashboard Design Principles
Information Architecture
- Hierarchy: Overview → Service → Component → Instance drill-down paths
- Golden Ratio: 80% operational metrics, 20% exploratory metrics
- Cognitive Load: Maximum 7±2 panels per dashboard screen
- User Journey: Role-based dashboard personas (SRE, Developer, Executive)
Visualization Best Practices
- Chart Selection: Time series for trends, heatmaps for distributions, gauges for status
- Color Theory: Red for critical, amber for warning, green for healthy states
- Reference Lines: SLO targets, capacity thresholds, and historical baselines
- Time Ranges: Default to meaningful windows (4h for incidents, 7d for trends)
Panel Design
- Metric Queries: Efficient Prometheus/InfluxDB queries with proper aggregation
- Alerting Integration: Visual alert state indicators on relevant panels
- Interactive Elements: Template variables, drill-down links, and annotation overlays
- Performance: Sub-second render times through query optimization
Alert Design and Optimization
Alert Classification
- Severity Levels:
- Critical: Service down, SLO burn rate high
- Warning: Approaching thresholds, non-user-facing issues
- Info: Deployment notifications, capacity planning alerts
- Actionability: Every alert must have a clear response action
- Alert Routing: Escalation policies based on severity and team ownership
Alert Fatigue Prevention
- Signal vs Noise: High precision (few false positives) over high recall
- Hysteresis: Different thresholds for firing and resolving alerts
- Suppression: Dependent alert suppression during known outages
- Grouping: Related alerts grouped into single notifications
Alert Rule Design
- Threshold Selection: Statistical methods for threshold determination
- Window Functions: Appropriate averaging windows and percentile calculations
- Alert Lifecycle: Clear firing conditions and automatic resolution criteria
- Testing: Alert rule validation against historical data
Runbook Generation and Incident Response
Runbook Structure
- Alert Context: What the alert means and why it fired
- Impact Assessment: User-facing vs internal impact evaluation
- Investigation Steps: Ordered troubleshooting procedures with time estimates
- Resolution Actions: Common fixes and escalation procedures
- Post-Incident: Follow-up tasks and prevention measures
Incident Detection Patterns
- Anomaly Detection: Statistical methods for detecting unusual patterns
- Composite Alerts: Multi-signal alerts for complex failure modes
- Predictive Alerts: Capacity and trend-based forward-looking alerts
- Canary Monitoring: Early detection through progressive deployment monitoring
Golden Signals Framework
Latency Monitoring
- Request Latency: P50, P95, P99 response time tracking
- Queue Latency: Time spent waiting in processing queues
- Network Latency: Inter-service communication delays
- Database Latency: Query execution and connection pool metrics
Traffic Monitoring
- Request Rate: Requests per second with burst detection
- Bandwidth Usage: Network throughput and capacity utilization
- User Sessions: Active user tracking and session duration
- Feature Usage: API endpoint and feature adoption metrics
Error Monitoring
- Error Rate: 4xx and 5xx HTTP response code tracking
- Error Budget: SLO-based error rate targets and consumption
- Error Distribution: Error type classification and trending
- Silent Failures: Detection of processing failures without HTTP errors
Saturation Monitoring
- Resource Utilization: CPU, memory, disk, and network usage
- Queue Depth: Processing queue length and wait times
- Connection Pools: Database and service connection saturation
- Rate Limiting: API throttling and quota exhaustion tracking
Distributed Tracing Strategies
Trace Architecture
- Sampling Strategy: Head-based, tail-based, and adaptive sampling
- Trace Propagation: Context propagation across service boundaries
- Span Correlation: Parent-child relationship modeling
- Trace Storage: Retention policies and storage optimization
Service Instrumentation
- Auto-Instrumentation: Framework-based automatic trace generation
- Manual Instrumentation: Custom span creation for business logic
- Baggage Handling: Cross-cutting concern propagation
- Performance Impact: Instrumentation overhead measurement and optimization
Log Aggregation Patterns
Collection Architecture
- Agent Deployment: Log shipping agent strategies (push vs pull)
- Log Routing: Topic-based routing and filtering
- Parsing Strategies: Structured vs unstructured log handling
- Schema Evolution: Log format versioning and migration
Storage and Indexing
- Index Design: Optimized field indexing for common query patterns
- Retention Policies: Time and volume-based log retention
- Compression: Log data compression and archival strategies
- Search Performance: Query optimization and result caching
Cost Optimization for Observability
Data Management
- Metric Retention: Tiered retention based on metric importance
- Log Sampling: Intelligent sampling to reduce ingestion costs
- Trace Sampling: Cost-effective trace collection strategies
- Data Archival: Cold storage for historical observability data
Resource Optimization
- Query Efficiency: Optimized metric and log queries
- Storage Costs: Appropriate storage tiers for different data types
- Ingestion Rate Limiting: Controlled data ingestion to manage costs
- Cardinality Management: High-cardinality metric detection and mitigation
Scripts Overview
This skill includes three powerful Python scripts for comprehensive observability design:
1. SLO Designer (slo_designer.py)
Generates complete SLI/SLO frameworks based on service characteristics:
- Input: Service description JSON (type, criticality, dependencies)
- Output: SLI definitions, SLO targets, error budgets, burn rate alerts, SLA recommendations
- Features: Multi-window burn rate calculations, error budget policies, alert rule generation
2. Alert Optimizer (alert_optimizer.py)
Analyzes and optimizes existing alert configurations:
- Input: Alert configuration JSON with rules, thresholds, and routing
- Output: Optimization report and improved alert configuration
- Features: Noise detection, coverage gaps, duplicate identification, threshold optimization
3. Dashboard Generator (dashboard_generator.py)
Creates comprehensive dashboard specifications:
- Input: Service/system description JSON
- Output: Grafana-compatible dashboard JSON and documentation
- Features: Golden signals coverage, RED/USE methods, drill-down paths, role-based views
Integration Patterns
Monitoring Stack Integration
- Prometheus: Metric collection and alerting rule generation
- Grafana: Dashboard creation and visualization configuration
- Elasticsearch/Kibana: Log analysis and dashboard integration
- Jaeger/Zipkin: Distributed tracing configuration and analysis
CI/CD Integration
- Pipeline Monitoring: Build, test, and deployment observability
- Deployment Correlation: Release impact tracking and rollback triggers
- Feature Flag Monitoring: A/B test and feature rollout observability
- Performance Regression: Automated performance monitoring in pipelines
Incident Management Integration
- **PagerDuty/V
Content truncated.
More by alirezarezvani
View all skills by alirezarezvani →You might also like
flutter-development
aj-geddes
Build beautiful cross-platform mobile apps with Flutter and Dart. Covers widgets, state management with Provider/BLoC, navigation, API integration, and material design.
drawio-diagrams-enhanced
jgtolentino
Create professional draw.io (diagrams.net) diagrams in XML format (.drawio files) with integrated PMP/PMBOK methodologies, extensive visual asset libraries, and industry-standard professional templates. Use this skill when users ask to create flowcharts, swimlane diagrams, cross-functional flowcharts, org charts, network diagrams, UML diagrams, BPMN, project management diagrams (WBS, Gantt, PERT, RACI), risk matrices, stakeholder maps, or any other visual diagram in draw.io format. This skill includes access to custom shape libraries for icons, clipart, and professional symbols.
ui-ux-pro-max
nextlevelbuilder
"UI/UX design intelligence. 50 styles, 21 palettes, 50 font pairings, 20 charts, 8 stacks (React, Next.js, Vue, Svelte, SwiftUI, React Native, Flutter, Tailwind). Actions: plan, build, create, design, implement, review, fix, improve, optimize, enhance, refactor, check UI/UX code. Projects: website, landing page, dashboard, admin panel, e-commerce, SaaS, portfolio, blog, mobile app, .html, .tsx, .vue, .svelte. Elements: button, modal, navbar, sidebar, card, table, form, chart. Styles: glassmorphism, claymorphism, minimalism, brutalism, neumorphism, bento grid, dark mode, responsive, skeuomorphism, flat design. Topics: color palette, accessibility, animation, layout, typography, font pairing, spacing, hover, shadow, gradient."
godot
bfollington
This skill should be used when working on Godot Engine projects. It provides specialized knowledge of Godot's file formats (.gd, .tscn, .tres), architecture patterns (component-based, signal-driven, resource-based), common pitfalls, validation tools, code templates, and CLI workflows. The `godot` command is available for running the game, validating scripts, importing resources, and exporting builds. Use this skill for tasks involving Godot game development, debugging scene/resource files, implementing game systems, or creating new Godot components.
nano-banana-pro
garg-aayush
Generate and edit images using Google's Nano Banana Pro (Gemini 3 Pro Image) API. Use when the user asks to generate, create, edit, modify, change, alter, or update images. Also use when user references an existing image file and asks to modify it in any way (e.g., "modify this image", "change the background", "replace X with Y"). Supports both text-to-image generation and image-to-image editing with configurable resolution (1K default, 2K, or 4K for high resolution). DO NOT read the image file first - use this skill directly with the --input-image parameter.
fastapi-templates
wshobson
Create production-ready FastAPI projects with async patterns, dependency injection, and comprehensive error handling. Use when building new FastAPI applications or setting up backend API projects.
Related MCP Servers
Browse all serversUnlock AI-ready web data with Firecrawl: scrape any website, handle dynamic content, and automate web scraping for resea
Boost your AI code assistant with Context7: inject real-time API documentation from OpenAPI specification sources into y
Extend your developer tools with GitHub MCP Server for advanced automation, supporting GitHub Student and student packag
Optimize your codebase for AI with Repomix—transform, compress, and secure repos for easier analysis with modern AI tool
Serena is a free AI code generator toolkit providing robust code editing and retrieval, turning LLMs into powerful artif
Unlock seamless Figma to code: streamline Figma to HTML with Framelink MCP Server for fast, accurate design-to-code work
Stay ahead of the MCP ecosystem
Get weekly updates on new skills and servers.