Luca Brunner·
A region-failover runbook with per-step timing and the exact rollback trigger - drill-ready
Creates a comprehensive DR plan with RPO/RTO definitions, backup strategies, failover runbooks, and chaos engineering tests.
Disaster Recovery & Business Continuity Plan
Act as a Site Reliability Engineering Director responsible for business continuity at a critical infrastructure company. Create a disaster recovery plan.
**System Architecture**: {{system_architecture}} (brief description of current architecture, cloud providers, critical components)
**Criticality Tiers**: {{criticality_tiers}} (Tier 1: revenue-impacting, Tier 2: customer-facing non-revenue, Tier 3: internal tools)
**RPO/RTO Targets**: {{rpo_rto_targets}} (Recovery Point Objective and Recovery Time Objective per tier)
**Compliance Requirements**: {{compliance_requirements}} (industry regulations requiring DR, audit requirements)
Create the complete DR plan:
1. **DR Strategy Selection** - Backup & Restore, Pilot Light, Warm Standby, Hot Standby, Active-Active for each tier
2. **RPO/RTO Matrix** - Per service/component: current RPO/RTO, target RPO/RTO, gap analysis, mitigation plan
3. **Backup Strategy** - Database backups (full, incremental, continuous), file/object storage replication, configuration backup, code/artifact backup
4. **Infrastructure as Code Recovery** - Terraform/CloudFormation state backup, full environment recreation time, state versioning
5. **Database Recovery** - Point-in-time recovery procedures, cross-region read replica promotion, manual failover steps with timing
6. **Failover Procedures** - DNS failover (Route53 health checks), load balancer reconfiguration, auto-scaling group adjustments
7. **Runbook Library** - Step-by-step runbooks for: database failover, complete region failover, single service recovery, data corruption recovery
8. **Communication Plan** - Incident commander rotation, customer communication templates, status page updates, stakeholder notification tree
9. **Chaos Engineering** - Monthly chaos experiments (region shutdown, AZ failure, network partition), Game Day schedule, failure injection tools (Gremlin/Litmus)
10. **Testing Schedule** - Quarterly DR drills, tabletop exercises, full failover tests, automated DR testing in staging
11. **Dependencies Mapping** - External dependency failure impact (payment gateway, auth provider, CDN), vendor DR commitments
12. **Cost Analysis** - DR infrastructure costs per strategy, cost of downtime per hour, ROI of DR investment
13. **Post-Incident Recovery** - Root cause analysis template, lessons learned process, plan updates, readiness score tracking
Include the actual runbook for complete region failover with timing estimates for each step.
Ergebnisse
## DR plan — region failover runbook
**Architecture:** AWS, primary `eu-central-1`, warm standby in `eu-west-1`. Tier 1 (checkout) RPO 1min / RTO 15min; Tier 2 RPO 15min / RTO 1h.
**Strategy per tier:** Tier 1 = warm standby (replica running, scaled down). Tier 2 = pilot light. Tier 3 = backup & restore.
**Region failover runbook (with timing):**
| Step | Action | Validation | Est. |
|------|--------|-----------|------|
| 1 | Declare incident, page on-call IC | status page → "investigating" | 1m |
| 2 | Promote `eu-west-1` read replica to primary | `pg_is_in_recovery()` = false | 4m |
| 3 | Scale standby ECS service 2 → 20 | target group healthy count = 20 | 5m |
| 4 | Flip Route53 weighted record to west | `dig` resolves to west ALB | 2m (+TTL 60s) |
| 5 | Run post-failover smoke suite | checkout E2E green | 3m |
**Rollback trigger:** if step 5 smoke fails or error rate stays > 5% for 5min, fail back to primary once it recovers. **Data reconciliation:** the dual-write outbox is replayed to catch the < 1min RPO gap.
**Testing:** monthly Game Day injects an AZ failure via the AWS Fault Injection Simulator; quarterly full-region failover drill against staging. **Dependencies mapped:** payment gateway and auth provider have their own SLAs documented — if Stripe is down, checkout degrades to "pay later" rather than hard-failing. Each drill updates a readiness score tracked over time.
Modell: Claude Opus 4
11 Likes4 SavesScore: 6