Skip to main content
Luca Brunner·

A region-failover runbook with per-step timing and the exact rollback trigger - drill-ready

Creates a comprehensive DR plan with RPO/RTO definitions, backup strategies, failover runbooks, and chaos engineering tests.

Disaster Recovery & Business Continuity Plan

Act as a Site Reliability Engineering Director responsible for business continuity at a critical infrastructure company. Create a disaster recovery plan. **System Architecture**: {{system_architecture}} (brief description of current architecture, cloud providers, critical components) **Criticality Tiers**: {{criticality_tiers}} (Tier 1: revenue-impacting, Tier 2: customer-facing non-revenue, Tier 3: internal tools) **RPO/RTO Targets**: {{rpo_rto_targets}} (Recovery Point Objective and Recovery Time Objective per tier) **Compliance Requirements**: {{compliance_requirements}} (industry regulations requiring DR, audit requirements) Create the complete DR plan: 1. **DR Strategy Selection** - Backup & Restore, Pilot Light, Warm Standby, Hot Standby, Active-Active for each tier 2. **RPO/RTO Matrix** - Per service/component: current RPO/RTO, target RPO/RTO, gap analysis, mitigation plan 3. **Backup Strategy** - Database backups (full, incremental, continuous), file/object storage replication, configuration backup, code/artifact backup 4. **Infrastructure as Code Recovery** - Terraform/CloudFormation state backup, full environment recreation time, state versioning 5. **Database Recovery** - Point-in-time recovery procedures, cross-region read replica promotion, manual failover steps with timing 6. **Failover Procedures** - DNS failover (Route53 health checks), load balancer reconfiguration, auto-scaling group adjustments 7. **Runbook Library** - Step-by-step runbooks for: database failover, complete region failover, single service recovery, data corruption recovery 8. **Communication Plan** - Incident commander rotation, customer communication templates, status page updates, stakeholder notification tree 9. **Chaos Engineering** - Monthly chaos experiments (region shutdown, AZ failure, network partition), Game Day schedule, failure injection tools (Gremlin/Litmus) 10. **Testing Schedule** - Quarterly DR drills, tabletop exercises, full failover tests, automated DR testing in staging 11. **Dependencies Mapping** - External dependency failure impact (payment gateway, auth provider, CDN), vendor DR commitments 12. **Cost Analysis** - DR infrastructure costs per strategy, cost of downtime per hour, ROI of DR investment 13. **Post-Incident Recovery** - Root cause analysis template, lessons learned process, plan updates, readiness score tracking Include the actual runbook for complete region failover with timing estimates for each step.

Ergebnisse

## DR plan — region failover runbook **Architecture:** AWS, primary `eu-central-1`, warm standby in `eu-west-1`. Tier 1 (checkout) RPO 1min / RTO 15min; Tier 2 RPO 15min / RTO 1h. **Strategy per tier:** Tier 1 = warm standby (replica running, scaled down). Tier 2 = pilot light. Tier 3 = backup & restore. **Region failover runbook (with timing):** | Step | Action | Validation | Est. | |------|--------|-----------|------| | 1 | Declare incident, page on-call IC | status page → "investigating" | 1m | | 2 | Promote `eu-west-1` read replica to primary | `pg_is_in_recovery()` = false | 4m | | 3 | Scale standby ECS service 2 → 20 | target group healthy count = 20 | 5m | | 4 | Flip Route53 weighted record to west | `dig` resolves to west ALB | 2m (+TTL 60s) | | 5 | Run post-failover smoke suite | checkout E2E green | 3m | **Rollback trigger:** if step 5 smoke fails or error rate stays > 5% for 5min, fail back to primary once it recovers. **Data reconciliation:** the dual-write outbox is replayed to catch the < 1min RPO gap. **Testing:** monthly Game Day injects an AZ failure via the AWS Fault Injection Simulator; quarterly full-region failover drill against staging. **Dependencies mapped:** payment gateway and auth provider have their own SLAs documented — if Stripe is down, checkout degrades to "pay later" rather than hard-failing. Each drill updates a readiness score tracked over time.

Modell: Claude Opus 4

11 Likes4 SavesScore: 6