replit-incident-runbook
Execute Replit incident response procedures with triage, mitigation, and postmortem. Use when responding to Replit-related outages, investigating errors, or running post-incident reviews for Replit integration failures. Trigger with phrases like "replit incident", "replit outage", "replit down", "replit on-call", "replit emergency", "replit broken".
Install
mkdir -p .claude/skills/replit-incident-runbook && curl -L -o skill.zip "https://mcp.directory/api/skills/download/8585" && unzip -o skill.zip -d .claude/skills/replit-incident-runbook && rm skill.zipInstalls to .claude/skills/replit-incident-runbook
About this skill
Replit Incident Runbook
Overview
Rapid incident response for Replit deployment failures, database issues, and platform outages. Covers triage, diagnosis, remediation, rollback, and communication.
Prerequisites
- Access to Replit Workspace and Deployment settings
- Deployment URL for health checks
- Communication channel (Slack, email)
- Rollback awareness (Deployment History)
Severity Levels
| Level | Definition | Response Time | Examples |
|---|---|---|---|
| P1 | Complete outage | < 15 min | App returns 5xx, DB down |
| P2 | Degraded service | < 1 hour | Slow responses, intermittent errors |
| P3 | Minor impact | < 4 hours | Non-critical feature broken |
| P4 | No user impact | Next business day | Monitoring gap |
Quick Triage (First 5 Minutes)
set -euo pipefail
DEPLOY_URL="https://your-app.replit.app"
echo "=== TRIAGE ==="
# 1. Check Replit platform status
echo -n "Replit Status: "
curl -s https://status.replit.com/api/v2/summary.json | \
python3 -c "import sys,json;print(json.load(sys.stdin)['status']['description'])" 2>/dev/null || \
echo "Check https://status.replit.com"
# 2. Check your deployment health
echo -n "App Health: "
curl -s -o /dev/null -w "HTTP %{http_code} (%{time_total}s)" "$DEPLOY_URL/health" 2>/dev/null || echo "UNREACHABLE"
echo ""
# 3. Get health details
echo "Health Response:"
curl -s "$DEPLOY_URL/health" 2>/dev/null | python3 -m json.tool 2>/dev/null || echo "No response"
# 4. Check if it's a cold start issue (Autoscale)
echo -n "Second request: "
curl -s -o /dev/null -w "HTTP %{http_code} (%{time_total}s)\n" "$DEPLOY_URL/health"
Decision Tree
App not responding?
├─ YES: Is status.replit.com reporting an incident?
│ ├─ YES → Platform issue. Wait for Replit. Communicate to users.
│ └─ NO → Your deployment issue. Continue below.
│
│ Can you access the Replit Workspace?
│ ├─ YES → Check deployment logs:
│ │ ├─ Build error → Fix code, redeploy
│ │ ├─ Runtime crash → Check logs, fix, redeploy
│ │ └─ Secret missing → Add to Secrets tab, redeploy
│ └─ NO → Network/browser issue. Try incognito window.
│
└─ App responds but with errors?
├─ 5xx errors → Check logs for crash/exception
├─ Slow responses → Check database, cold start, memory
└─ Auth not working → Verify deployment domain, not dev URL
Remediation by Error Type
Deployment Crash (5xx / App Unreachable)
1. Open Replit Workspace
2. Go to Deployment Settings > Logs
3. Look for the crash reason:
- "Error: Cannot find module..." → Missing dependency
- "FATAL: Missing secrets..." → Add to Secrets tab
- "EADDRINUSE" → Port conflict in .replit config
- "JavaScript heap out of memory" → Increase VM size or fix memory leak
4. Fix the issue in code
5. Click "Deploy" to redeploy
6. If fix is unclear, ROLLBACK:
- Deployment Settings > History
- Click "Rollback" on last known-good version
Database Connection Failure
1. Check database status in Database pane
2. Verify DATABASE_URL is set in Secrets
3. Test connection:
# From Replit Shell
node -e "
const {Pool} = require('pg');
const pool = new Pool({connectionString: process.env.DATABASE_URL, ssl:{rejectUnauthorized:false}});
pool.query('SELECT NOW()').then(r => console.log('OK:', r.rows[0])).catch(e => console.error('FAIL:', e.message)).finally(() => pool.end());
"
4. If connection fails:
- Check if PostgreSQL is provisioned (Database pane)
- Try creating a new database
- Check for connection pool exhaustion (max connections)
Cold Start Too Slow (Autoscale)
If cold starts exceed acceptable latency:
1. Check deployment type: Autoscale scales to zero
2. Options:
a. Switch to Reserved VM (always-on, no cold starts)
b. Set up external keep-alive (ping /health every 4 min)
c. Optimize startup: lazy imports, defer DB connection
3. To switch:
- Update .replit: deploymentTarget = "cloudrun"
- Redeploy
Secrets Missing After Deploy
1. Open Secrets tab (lock icon in sidebar)
2. Verify all required secrets are present
3. Check Deployment Settings > Environment Variables
4. Secrets should auto-sync (2025+), but if not:
- Remove and re-add the secret
- Redeploy
5. For Account-level secrets:
- Account Settings > Secrets
- These apply to ALL Repls
Rollback Procedure
Replit supports one-click rollback to any previous deployment:
1. Deployment Settings > History
2. Find the last successful deployment
3. Click "Rollback to this version"
4. Verify health endpoint
5. Investigate root cause before redeploying fix
Rollback restores:
- Code at that deployment's commit
- Deployment configuration at that time
- Does NOT rollback database changes
Communication Templates
Internal (Slack)
P[1-4] INCIDENT: [App Name] on Replit
Status: INVESTIGATING / IDENTIFIED / MONITORING / RESOLVED
Impact: [What users are experiencing]
Cause: [If known]
Action: [What we're doing]
ETA: [When we expect resolution]
Next update: [Time]
External (Status Page)
[App Name] Service Disruption
We are experiencing issues with [specific feature/service].
[Describe user impact].
We have identified the cause and are working on a fix.
Estimated resolution: [time].
Last updated: [timestamp]
Post-Incident
Evidence Collection
set -euo pipefail
# Capture deployment logs
# Go to Deployment Settings > Logs > Copy relevant entries
# Capture timeline
echo "Timeline of events:" > incident-report.md
echo "- [time] Issue detected" >> incident-report.md
echo "- [time] Investigation started" >> incident-report.md
echo "- [time] Root cause identified" >> incident-report.md
echo "- [time] Fix deployed / rollback executed" >> incident-report.md
echo "- [time] Service restored" >> incident-report.md
Postmortem Template
## Incident: [Title]
**Date:** YYYY-MM-DD
**Duration:** X hours Y minutes
**Severity:** P[1-4]
### Summary
[1-2 sentence description of what happened]
### Root Cause
[Technical explanation]
### Timeline
- HH:MM — First alert
- HH:MM — Investigation started
- HH:MM — Root cause found
- HH:MM — Fix deployed / rollback
- HH:MM — Service restored
### Impact
- Users affected: [N]
- Downtime: [duration]
### Action Items
- [ ] [Prevention measure] — Owner — Due date
Error Handling
| Issue | Cause | Solution |
|---|---|---|
| Can't access Workspace | Replit outage | Use status.replit.com, wait |
| Rollback not available | No previous deployments | Fix forward, deploy fix |
| Logs too short | Container restarted | Set up external log aggregator |
| DB rollback needed | Bad migration | Restore from Replit DB snapshot |
Resources
Next Steps
For data handling patterns, see replit-data-handling.
More by jeremylongshore
View all skills by jeremylongshore →You might also like
flutter-development
aj-geddes
Build beautiful cross-platform mobile apps with Flutter and Dart. Covers widgets, state management with Provider/BLoC, navigation, API integration, and material design.
drawio-diagrams-enhanced
jgtolentino
Create professional draw.io (diagrams.net) diagrams in XML format (.drawio files) with integrated PMP/PMBOK methodologies, extensive visual asset libraries, and industry-standard professional templates. Use this skill when users ask to create flowcharts, swimlane diagrams, cross-functional flowcharts, org charts, network diagrams, UML diagrams, BPMN, project management diagrams (WBS, Gantt, PERT, RACI), risk matrices, stakeholder maps, or any other visual diagram in draw.io format. This skill includes access to custom shape libraries for icons, clipart, and professional symbols.
ui-ux-pro-max
nextlevelbuilder
"UI/UX design intelligence. 50 styles, 21 palettes, 50 font pairings, 20 charts, 8 stacks (React, Next.js, Vue, Svelte, SwiftUI, React Native, Flutter, Tailwind). Actions: plan, build, create, design, implement, review, fix, improve, optimize, enhance, refactor, check UI/UX code. Projects: website, landing page, dashboard, admin panel, e-commerce, SaaS, portfolio, blog, mobile app, .html, .tsx, .vue, .svelte. Elements: button, modal, navbar, sidebar, card, table, form, chart. Styles: glassmorphism, claymorphism, minimalism, brutalism, neumorphism, bento grid, dark mode, responsive, skeuomorphism, flat design. Topics: color palette, accessibility, animation, layout, typography, font pairing, spacing, hover, shadow, gradient."
godot
bfollington
This skill should be used when working on Godot Engine projects. It provides specialized knowledge of Godot's file formats (.gd, .tscn, .tres), architecture patterns (component-based, signal-driven, resource-based), common pitfalls, validation tools, code templates, and CLI workflows. The `godot` command is available for running the game, validating scripts, importing resources, and exporting builds. Use this skill for tasks involving Godot game development, debugging scene/resource files, implementing game systems, or creating new Godot components.
nano-banana-pro
garg-aayush
Generate and edit images using Google's Nano Banana Pro (Gemini 3 Pro Image) API. Use when the user asks to generate, create, edit, modify, change, alter, or update images. Also use when user references an existing image file and asks to modify it in any way (e.g., "modify this image", "change the background", "replace X with Y"). Supports both text-to-image generation and image-to-image editing with configurable resolution (1K default, 2K, or 4K for high resolution). DO NOT read the image file first - use this skill directly with the --input-image parameter.
pdf-to-markdown
aliceisjustplaying
Convert entire PDF documents to clean, structured Markdown for full context loading. Use this skill when the user wants to extract ALL text from a PDF into context (not grep/search), when discussing or analyzing PDF content in full, when the user mentions "load the whole PDF", "bring the PDF into context", "read the entire PDF", or when partial extraction/grepping would miss important context. This is the preferred method for PDF text extraction over page-by-page or grep approaches.
Related MCP Servers
Browse all serversIntegrate with Panther Labs to streamline cybersecurity workflows, manage detection rules, triage alerts, and boost inci
Integrate Swagger/OpenAPI with your REST API to explore endpoints, fetch docs, and execute authenticated requests easily
Connect Blender to Claude AI for seamless 3D modeling. Use AI 3D model generator tools for faster, intuitive, interactiv
Terminal control, file system search, and diff-based file editing for Claude and other AI assistants. Execute shell comm
Safely connect cloud Grafana to AI agents with MCP: query, inspect, and manage Grafana resources using simple, focused o
Integrate with Gemini CLI for large-scale file analysis, secure code execution, and advanced context control using Googl
Stay ahead of the MCP ecosystem
Get weekly updates on new skills and servers.