Skip to main content
Ahmed Hassan·

Litmus pod-delete experiment with an httpProbe that auto-aborts if checkout health drops

Design controlled chaos engineering experiments using Chaos Monkey, Litmus, or Gremlin to validate system resilience and failure recovery.

Chaos Engineering Experiment Designer

You are a chaos engineering specialist. Design a controlled chaos experiment plan for the following system. **System Architecture:** {{system_architecture}} **Critical User Journeys:** {{critical_journeys}} **Tech Stack:** {{tech_stack}} **Current Reliability Concerns:** {{reliability_concerns}} **Chaos Tool:** {{chaos_tool}} **Safety Requirements:** {{safety_requirements}} Design: 1. **Steady-State Hypothesis**: Define normal system behavior metrics for each journey 2. **Experiment Portfolio**: 8-10 specific experiments: - Instance termination (random pod/node failure) - Network latency injection (100ms, 500ms, 1000ms) - Network partition (split-brain scenarios) - CPU exhaustion (stress testing with 80%, 95%, 100% CPU) - Memory pressure (OOM conditions) - Disk failure / I/O throttling - DNS failure simulation - Database connection pool exhaustion - Dependency failure (simulate 3rd party API down) - Time drift / clock skew 3. **Blast Radius**: Controlled scope with automatic abort conditions 4. **Abort Conditions**: Metrics-based automatic rollback triggers 5. **Chaos Scripts**: Complete executable scripts for {{chaos_tool}} 6. **Monitoring Setup**: What dashboards to watch during each experiment 7. **Safety Measures**: Feature flags, circuit breakers, automatic abort 8. **Game Day Schedule**: Planned execution timeline with observers and decision makers 9. **Success Criteria**: How to measure if the system passed the experiment 10. **Remediation Plan**: Expected fixes for likely failures discovered Output complete experiment definitions and safety procedures.

Ergebnisse

## Chaos Experiment Plan — payment service resilience (Litmus on EKS) **Steady-state hypothesis:** during normal operation, checkout success rate ≥ 99.5% and p99 < 400ms. Every experiment aborts automatically if success drops below 98%. **Experiment 1 — kill a payment pod (instance termination):** ```yaml apiVersion: litmuschaos.io/v1alpha1 kind: ChaosEngine metadata: { name: checkout-pod-delete, namespace: payments } spec: appinfo: { appns: payments, applabel: "app=checkout-api", appkind: deployment } experiments: - name: pod-delete spec: components: env: - { name: TOTAL_CHAOS_DURATION, value: "60" } - { name: PODS_AFFECTED_PERC, value: "33" } # one of three pods probe: - name: checkout-healthy type: httpProbe httpProbe/inputs: url: "http://checkout-api/api/ready" responseTimeout: 2000 runProperties: { probeTimeout: 5, interval: 2, retry: 3 } ``` **Experiment 2 — inject 500ms latency** on the Postgres call to prove the timeout + circuit breaker engage rather than cascading. **Blast radius control:** experiments run only in staging first, scoped to one deployment, with the `httpProbe` as an automatic abort — if `/api/ready` fails 3x, Litmus halts and the pods recover. **Expected finding:** if killing one of three pods spikes latency, the readiness probe is too slow to deregister the dying pod from the service — fix by tuning `terminationGracePeriodSeconds` and the probe interval. **Game day:** run experiment 1 live with the on-call team watching the Grafana SLO panel.

Modell: Claude Sonnet 4

9 Likes3 SavesScore: 6