Daniel Cohen·
Got a runnable DB-failover runbook with promote commands and blameless post-mortem skeleton
Generate structured incident response plans, runbooks, and post-mortem templates for production outages and service degradations.
Incident Response Commander
You are an SRE incident commander. Create a comprehensive incident response framework for the following scenario.
**System Architecture:**
{{system_architecture}}
**Incident Type:**
{{incident_type}}
**Affected Services:**
{{affected_services}}
**Severity Level:**
{{severity_level}}
**Team Structure:**
{{team_structure}}
**On-Call Rotation:**
{{oncall_rotation}}
**Escalation Policy:**
{{escalation_policy}}
Generate:
1. **Incident Response Plan**:
- Triage checklist (first 5 minutes, first 15 minutes, first hour)
- Communication templates (internal, customer-facing, executive)
- Severity classification and SLAs
2. **Runbook for {{incident_type}}**: Step-by-step diagnosis and remediation procedures
3. **War Room Setup**: Video bridge, Slack channel naming, role assignments (IC, scribe, communications)
4. **Detection & Alerting**: Specific alerts that should have triggered, alert improvement recommendations
5. **Rollback Procedures**: Safe rollback steps for each affected service with verification
6. **Communication Timeline**: When and how to communicate with stakeholders
7. **Post-Mortem Template**: Structured blameless post-mortem with timeline, 5 Whys, action items
8. **Follow-Up Actions**: CAP (Corrective Action Plan) with owners and deadlines
9. **Prevention Measures**: Technical and process changes to prevent recurrence
10. **Metrics**: How to measure incident response effectiveness (MTTR, MTTD)
Output ready-to-use documents that the team can execute during a live incident.
Ergebnisse
## Incident Response Framework — Database primary failover (SEV-1)
Architecture: 3 API pods → Postgres primary + 1 replica → Redis. Below is the runnable IR plan for a primary-down event.
**Triage checklist:**
- **First 5 min** — declare the incident, page the on-call DBA, open `#inc-db-failover`, assign roles (IC, scribe, comms). Confirm scope: is it the primary or a network partition?
- **First 15 min** — promote the replica (`pg_ctl promote` / managed failover), repoint the app's `DATABASE_URL`, verify writes succeed. Post the first customer status update.
- **First hour** — confirm replication re-established to a new standby, monitor error rate back to baseline, draft the timeline.
**Runbook — promote replica:**
```bash
# 1. confirm primary is truly down (not a flapping LB)
pg_isready -h primary.db.internal || echo "primary unreachable"
# 2. promote the standby
aws rds failover-db-cluster --db-cluster-identifier checkout-prod
# 3. verify the new primary accepts writes
psql -h primary.db.internal -c "INSERT INTO _healthcheck DEFAULT VALUES;"
```
**Comms templates** — internal ("investigating elevated errors on checkout"), customer-facing ("some users may see delays placing orders, we are on it"), exec (impact + ETA).
**Blameless post-mortem skeleton:** timeline → 5 Whys → contributing factors → corrective actions with owners + dates. **Metrics to capture:** MTTD (alert → ack), MTTR (ack → resolved). **Prevention candidate from this class:** automate failover so promotion is not a manual 15-minute step.
Modell: Claude Sonnet 4
107 Likes36 SavesScore: 59
4 Kommentare
Marco Rossi·
Tried a variant of this and it caught a bug I'd been chasing for days.
Jonas Weber·
Tried it against a gnarly legacy file and it untangled it cleanly.
Ahmed Hassan·
This is the best debugging, testing example I've seen on here.
Emily Chen·
Ran it on our repo and it worked first try, honestly impressed.