it-operations

0views

0installs

Manages IT infrastructure, monitoring, incident response, and service reliability. Provides frameworks for ITIL service management, observability strategies, automation, backup/recovery, capacity planning, and operational excellence practices.

Install

mkdir -p .claude/skills/it-operations && curl -L -o skill.zip "https://mcp.directory/api/skills/download/8629" && unzip -o skill.zip -d .claude/skills/it-operations && rm skill.zip

Installs to .claude/skills/it-operations

About this skill

IT Operations Expert

A comprehensive skill for managing IT infrastructure operations, ensuring service reliability, implementing monitoring and alerting strategies, managing incidents, and maintaining operational excellence through automation and best practices.

Core Principles

1. Service Reliability First

Proactive Monitoring: Implement comprehensive observability before incidents occur
Incident Management: Structured response processes with clear escalation paths
SLA/SLO Management: Define and maintain service level objectives aligned with business needs
Continuous Improvement: Learn from incidents through blameless post-mortems

2. Automation Over Manual Processes

Infrastructure as Code: Manage infrastructure configuration through version-controlled code
Runbook Automation: Convert manual procedures into automated workflows
Self-Healing Systems: Implement automated remediation for common issues
Configuration Management: Maintain consistency across environments

3. ITIL Service Management

Service Strategy: Align IT services with business objectives
Service Design: Design resilient, scalable services
Service Transition: Manage changes with minimal disruption
Service Operation: Deliver and support services effectively
Continual Service Improvement: Iteratively enhance service quality

4. Operational Excellence

Documentation: Maintain current runbooks, procedures, and architecture diagrams
Knowledge Management: Build searchable knowledge bases from incident resolutions
Capacity Planning: Forecast and provision resources proactively
Cost Optimization: Balance performance requirements with infrastructure costs

Core Workflow

Infrastructure Operations Workflow

1. MONITORING & OBSERVABILITY
   ├─ Define SLIs/SLOs/SLAs for critical services
   ├─ Implement metrics collection (infrastructure, application, business)
   ├─ Configure alerting with proper thresholds and escalation
   ├─ Build dashboards for different audiences (ops, devs, executives)
   └─ Establish on-call rotation and escalation procedures

2. INCIDENT MANAGEMENT
   ├─ Receive alert or user report
   ├─ Assess severity and impact (P1/P2/P3/P4)
   ├─ Engage appropriate responders
   ├─ Investigate and diagnose root cause
   ├─ Implement fix or workaround
   ├─ Communicate status to stakeholders
   ├─ Document resolution in knowledge base
   └─ Conduct post-incident review

3. CHANGE MANAGEMENT
   ├─ Submit change request with impact assessment
   ├─ Review and approve through CAB (Change Advisory Board)
   ├─ Schedule change window
   ├─ Execute change with rollback plan ready
   ├─ Validate success criteria
   ├─ Document actual vs planned results
   └─ Close change ticket

4. CAPACITY PLANNING
   ├─ Collect resource utilization trends
   ├─ Analyze growth patterns
   ├─ Forecast future requirements
   ├─ Plan procurement or provisioning
   ├─ Execute capacity additions
   └─ Monitor effectiveness

5. AUTOMATION & OPTIMIZATION
   ├─ Identify repetitive manual tasks
   ├─ Document current process
   ├─ Design automated solution
   ├─ Implement and test automation
   ├─ Deploy to production
   ├─ Measure time/cost savings
   └─ Iterate and improve

Decision Frameworks

Alert Configuration Decision Matrix

Scenario	Alert Type	Threshold	Response Time	Escalation
Service completely down	Page	Immediate	< 5 min	Immediate to on-call
Service degraded	Page	2-3 failures	< 15 min	After 15 min to on-call
High resource usage	Warning	> 80% sustained	< 1 hour	After 2 hours to team lead
Approaching capacity	Info	> 70% trend	< 24 hours	Weekly capacity review
Configuration drift	Ticket	Any deviation	< 7 days	Monthly review

Incident Severity Classification

Priority 1 (Critical)

Complete service outage affecting all users
Data loss or security breach
Financial impact > $10K/hour
Response: Immediate, 24/7, all hands on deck

Priority 2 (High)

Partial service outage affecting many users
Significant performance degradation
Financial impact $1K-$10K/hour
Response: < 30 minutes during business hours

Priority 3 (Medium)

Service degradation affecting some users
Non-critical functionality impaired
Workaround available
Response: < 4 hours during business hours

Priority 4 (Low)

Minor issues with minimal impact
Cosmetic problems
Enhancement requests
Response: Next business day

Change Management Risk Assessment

Risk Level = Impact × Likelihood × Complexity

Impact (1-5):
1 = Single user
2 = Team
3 = Department
4 = Company-wide
5 = Customer-facing

Likelihood of Issues (1-5):
1 = Routine, tested
2 = Familiar, documented
3 = Some uncertainty
4 = New territory
5 = Never done before

Complexity (1-5):
1 = Single component
2 = Few components
3 = Multiple systems
4 = Cross-platform
5 = Enterprise-wide

Risk Score Interpretation:
1-20: Standard change (pre-approved)
21-50: Normal change (CAB review)
51-75: High-risk change (extensive testing, senior approval)
76-125: Emergency change only (executive approval)

Monitoring Tool Selection

Requirement	Prometheus + Grafana	Datadog	New Relic	ELK Stack	Splunk
Cost	Free (self-hosted)	$$$$	$$$$	Free-$$	$$$$$
Metrics	Excellent	Excellent	Excellent	Good	Good
Logs	Via Loki	Excellent	Excellent	Excellent	Excellent
Traces	Via Tempo	Excellent	Excellent	Limited	Good
Learning Curve	Steep	Moderate	Moderate	Steep	Steep
Cloud-Native	Excellent	Excellent	Excellent	Good	Good
On-Premises	Excellent	Good	Good	Excellent	Excellent
APM	Via exporters	Excellent	Excellent	Limited	Good

Common Operational Challenges

Challenge 1: Alert Fatigue

Problem: Too many false positive alerts causing team burnout

Solution:

Alert Tuning Process:
1. Measure baseline alert volume and false positive rate
2. Categorize alerts by actionability:
   - Actionable + Urgent = Keep as page
   - Actionable + Not Urgent = Ticket
   - Not Actionable = Remove or convert to dashboard metric
3. Implement alert aggregation (group similar alerts)
4. Add context to alerts (runbook links, relevant metrics)
5. Regular review meetings (weekly) to tune thresholds
6. Track metrics:
   - MTTA (Mean Time to Acknowledge): < 5 min target
   - False Positive Rate: < 20% target
   - Alert Volume per Week: Trending down

Challenge 2: Incident Documentation During Crisis

Problem: Teams skip documentation during high-pressure incidents

Solution:

Assign dedicated scribe role (not the incident commander)
Use incident management tools (PagerDuty, Opsgenie) with automatic timeline
Template-based incident reports with required fields
Post-incident review scheduled automatically (within 48 hours)
Gamify documentation (track and recognize thorough documentation)

Challenge 3: Knowledge Silos

Problem: Critical knowledge trapped in individual team members' heads

Solution:

Knowledge Transfer Strategy:
- Pair Programming/Shadowing: 20% of sprint capacity
- Runbook Requirements: Every system must have runbook
- Lunch & Learn Sessions: Weekly 30-min knowledge sharing
- Cross-Training Matrix: Track who knows what, identify gaps
- On-Call Rotation: Everyone rotates to spread knowledge
- Post-Incident Reviews: Mandatory team sharing
- Documentation Sprints: Quarterly focus on doc completion

Challenge 4: Balancing Stability vs Innovation

Problem: Operations team resists change to maintain stability

Solution:

Implement change windows (planned maintenance periods)
Use blue-green or canary deployments for lower risk
Establish "innovation time" (Google 20% time model)
Create sandbox environments for experimentation
Measure and reward both stability AND improvement metrics
Include "toil reduction" as OKR target

Key Metrics & KPIs

Service Reliability Metrics

Availability:
  Formula: (Total Time - Downtime) / Total Time × 100
  Target: 99.9% (43.8 min/month downtime)
  Measurement: Per service, monthly

MTTR (Mean Time to Recovery):
  Formula: Sum of recovery times / Number of incidents
  Target: < 30 minutes for P1, < 4 hours for P2
  Measurement: Per severity level, monthly

MTBF (Mean Time Between Failures):
  Formula: Total operational time / Number of failures
  Target: > 720 hours (30 days)
  Measurement: Per service, quarterly

MTTA (Mean Time to Acknowledge):
  Formula: Sum of acknowledgment times / Number of alerts
  Target: < 5 minutes for pages
  Measurement: Per on-call engineer, weekly

Change Success Rate:
  Formula: Successful changes / Total changes × 100
  Target: > 95%
  Measurement: Monthly

Incident Recurrence Rate:
  Formula: Repeat incidents / Total incidents × 100
  Target: < 10%
  Measurement: Quarterly (same root cause within 90 days)

Operational Efficiency Metrics

Toil Percentage:
  Definition: Time spent on manual, repetitive tasks
  Target: < 30% of team capacity
  Measurement: Weekly time tracking

Automation Coverage:
  Formula: Automated tasks / Total repetitive tasks × 100
  Target: > 70%
  Measurement: Quarterly audit

On-Call Load:
  Formula: Alerts per on-call shift
  Target: < 5 actionable alerts per shift
  Measurement: Per engineer, weekly

Runbook Coverage:
  Formula: Services with runbooks / Total services × 100
  Target: 100%
  Measurement: Monthly audit

Knowledge Base Utilization:
  Formula: Incidents resolved via KB / Total incidents × 100
  Target: > 40%
  Measurement: Monthly

Integration Points

With Development Teams

Participate in design reviews for operational requirements
Provide deplo

Content truncated.

More by davila7

View all skills by davila7 →

software-architecture

davila7

Guide for quality focused software architecture. This skill should be used when users want to write code, design architecture, analyze code, in any case that relates to software development.

751274

planning-with-files

davila7

Implements Manus-style file-based planning for complex tasks. Creates task_plan.md, findings.md, and progress.md. Use when starting complex multi-step tasks, research projects, or any task requiring >5 tool calls.

102191

scroll-experience

davila7

Expert in building immersive scroll-driven experiences - parallax storytelling, scroll animations, interactive narratives, and cinematic web experiences. Like NY Times interactives, Apple product pages, and award-winning web experiences. Makes websites feel like experiences, not just pages. Use when: scroll animation, parallax, scroll storytelling, interactive story, cinematic website.

13799

humanizer

davila7

Remove signs of AI-generated writing from text. Use when editing or reviewing text to make it sound more natural and human-written. Based on Wikipedia's comprehensive "Signs of AI writing" guide. Detects and fixes patterns including: inflated symbolism, promotional language, superficial -ing analyses, vague attributions, em dash overuse, rule of three, AI vocabulary words, negative parallelisms, and excessive conjunctive phrases. Credits: Original skill by @blader - https://github.com/blader/humanizer

15684

telegram-bot-builder

davila7

Expert in building Telegram bots that solve real problems - from simple automation to complex AI-powered bots. Covers bot architecture, the Telegram Bot API, user experience, monetization strategies, and scaling bots to thousands of users. Use when: telegram bot, bot api, telegram automation, chat bot telegram, tg bot.

12184

game-development

davila7

Game development orchestrator. Routes to platform-specific skills based on project needs.

17764

ui-ux-pro-max

nextlevelbuilder

"UI/UX design intelligence. 50 styles, 21 palettes, 50 font pairings, 20 charts, 8 stacks (React, Next.js, Vue, Svelte, SwiftUI, React Native, Flutter, Tailwind). Actions: plan, build, create, design, implement, review, fix, improve, optimize, enhance, refactor, check UI/UX code. Projects: website, landing page, dashboard, admin panel, e-commerce, SaaS, portfolio, blog, mobile app, .html, .tsx, .vue, .svelte. Elements: button, modal, navbar, sidebar, card, table, form, chart. Styles: glassmorphism, claymorphism, minimalism, brutalism, neumorphism, bento grid, dark mode, responsive, skeuomorphism, flat design. Topics: color palette, accessibility, animation, layout, typography, font pairing, spacing, hover, shadow, gradient."

1,7401,715

flutter-development

aj-geddes

Build beautiful cross-platform mobile apps with Flutter and Dart. Covers widgets, state management with Provider/BLoC, navigation, API integration, and material design.

1,9071,523

drawio-diagrams-enhanced

jgtolentino

Create professional draw.io (diagrams.net) diagrams in XML format (.drawio files) with integrated PMP/PMBOK methodologies, extensive visual asset libraries, and industry-standard professional templates. Use this skill when users ask to create flowcharts, swimlane diagrams, cross-functional flowcharts, org charts, network diagrams, UML diagrams, BPMN, project management diagrams (WBS, Gantt, PERT, RACI), risk matrices, stakeholder maps, or any other visual diagram in draw.io format. This skill includes access to custom shape libraries for icons, clipart, and professional symbols.

1,8551,294

pdf-to-markdown

aliceisjustplaying

Convert entire PDF documents to clean, structured Markdown for full context loading. Use this skill when the user wants to extract ALL text from a PDF into context (not grep/search), when discussing or analyzing PDF content in full, when the user mentions "load the whole PDF", "bring the PDF into context", "read the entire PDF", or when partial extraction/grepping would miss important context. This is the preferred method for PDF text extraction over page-by-page or grep approaches.

2,238983

godot

bfollington

This skill should be used when working on Godot Engine projects. It provides specialized knowledge of Godot's file formats (.gd, .tscn, .tres), architecture patterns (component-based, signal-driven, resource-based), common pitfalls, validation tools, code templates, and CLI workflows. The `godot` command is available for running the game, validating scripts, importing resources, and exporting builds. Use this skill for tasks involving Godot game development, debugging scene/resource files, implementing game systems, or creating new Godot components.

1,746963

nano-banana-pro

garg-aayush

Generate and edit images using Google's Nano Banana Pro (Gemini 3 Pro Image) API. Use when the user asks to generate, create, edit, modify, change, alter, or update images. Also use when user references an existing image file and asks to modify it in any way (e.g., "modify this image", "change the background", "replace X with Y"). Supports both text-to-image generation and image-to-image editing with configurable resolution (1K default, 2K, or 4K for high resolution). DO NOT read the image file first - use this skill directly with the --input-image parameter.

1,549831

Related MCP Servers

Browse all servers

Azure All

Supercharge AI platforms with Azure MCP Server for seamless Azure API Management and resource automation. Public Preview

1,20847 tools

Google Cloud

Effortlessly manage Google Cloud with this user-friendly multi cloud management platform—simplify operations, automate t

7010 tools

Railway

Deploy and manage apps easily on Railway's train platform—handle services, environments, and more via conversational wor

15414 tools

Datadog

Integrate Datadog monitor for streamlined incident management. List and get incident info to enhance your observability

1390 tools

Alibaba Cloud

Manage Alibaba Cloud ECS, monitor metrics, and configure VPC networks effortlessly using natural language commands with

1060 tools

Alibaba Cloud Observability

Alibaba Cloud Observability offers cloud based network monitoring and cloud monitoring solutions for application perform

830 tools

Install

mkdir -p .claude/skills/it-operations && curl -L -o skill.zip "https://mcp.directory/api/skills/download/8629" && unzip -o skill.zip -d .claude/skills/it-operations && rm skill.zip

Installs to .claude/skills/it-operations

Stats

Views

Installs

Author

davila7

7 skills published

Links

Source Code

it-operations

Install

About this skill

IT Operations Expert

Core Principles

1. Service Reliability First

2. Automation Over Manual Processes

3. ITIL Service Management

4. Operational Excellence

Core Workflow

Infrastructure Operations Workflow

Decision Frameworks

Alert Configuration Decision Matrix

Incident Severity Classification

Change Management Risk Assessment

Monitoring Tool Selection

Common Operational Challenges

Challenge 1: Alert Fatigue

Challenge 2: Incident Documentation During Crisis

Challenge 3: Knowledge Silos

Challenge 4: Balancing Stability vs Innovation

Key Metrics & KPIs

Service Reliability Metrics

Operational Efficiency Metrics

Integration Points

With Development Teams

More by davila7

software-architecture

planning-with-files

scroll-experience

humanizer

telegram-bot-builder

game-development

You might also like

ui-ux-pro-max

flutter-development

drawio-diagrams-enhanced

pdf-to-markdown

godot

nano-banana-pro

Related MCP Servers