Ahmed Hassan·
Litmus pod-delete experiment with an httpProbe that auto-aborts if checkout health drops
Design controlled chaos engineering experiments using Chaos Monkey, Litmus, or Gremlin to validate system resilience and failure recovery.
Chaos Engineering Experiment Designer
You are a chaos engineering specialist. Design a controlled chaos experiment plan for the following system.
**System Architecture:**
{{system_architecture}}
**Critical User Journeys:**
{{critical_journeys}}
**Tech Stack:**
{{tech_stack}}
**Current Reliability Concerns:**
{{reliability_concerns}}
**Chaos Tool:**
{{chaos_tool}}
**Safety Requirements:**
{{safety_requirements}}
Design:
1. **Steady-State Hypothesis**: Define normal system behavior metrics for each journey
2. **Experiment Portfolio**: 8-10 specific experiments:
- Instance termination (random pod/node failure)
- Network latency injection (100ms, 500ms, 1000ms)
- Network partition (split-brain scenarios)
- CPU exhaustion (stress testing with 80%, 95%, 100% CPU)
- Memory pressure (OOM conditions)
- Disk failure / I/O throttling
- DNS failure simulation
- Database connection pool exhaustion
- Dependency failure (simulate 3rd party API down)
- Time drift / clock skew
3. **Blast Radius**: Controlled scope with automatic abort conditions
4. **Abort Conditions**: Metrics-based automatic rollback triggers
5. **Chaos Scripts**: Complete executable scripts for {{chaos_tool}}
6. **Monitoring Setup**: What dashboards to watch during each experiment
7. **Safety Measures**: Feature flags, circuit breakers, automatic abort
8. **Game Day Schedule**: Planned execution timeline with observers and decision makers
9. **Success Criteria**: How to measure if the system passed the experiment
10. **Remediation Plan**: Expected fixes for likely failures discovered
Output complete experiment definitions and safety procedures.
Ergebnisse
## Chaos Experiment Plan — payment service resilience (Litmus on EKS)
**Steady-state hypothesis:** during normal operation, checkout success rate ≥ 99.5% and p99 < 400ms. Every experiment aborts automatically if success drops below 98%.
**Experiment 1 — kill a payment pod (instance termination):**
```yaml
apiVersion: litmuschaos.io/v1alpha1
kind: ChaosEngine
metadata: { name: checkout-pod-delete, namespace: payments }
spec:
appinfo: { appns: payments, applabel: "app=checkout-api", appkind: deployment }
experiments:
- name: pod-delete
spec:
components:
env:
- { name: TOTAL_CHAOS_DURATION, value: "60" }
- { name: PODS_AFFECTED_PERC, value: "33" } # one of three pods
probe:
- name: checkout-healthy
type: httpProbe
httpProbe/inputs:
url: "http://checkout-api/api/ready"
responseTimeout: 2000
runProperties: { probeTimeout: 5, interval: 2, retry: 3 }
```
**Experiment 2 — inject 500ms latency** on the Postgres call to prove the timeout + circuit breaker engage rather than cascading.
**Blast radius control:** experiments run only in staging first, scoped to one deployment, with the `httpProbe` as an automatic abort — if `/api/ready` fails 3x, Litmus halts and the pods recover. **Expected finding:** if killing one of three pods spikes latency, the readiness probe is too slow to deregister the dying pod from the service — fix by tuning `terminationGracePeriodSeconds` and the probe interval. **Game day:** run experiment 1 live with the on-call team watching the Grafana SLO panel.
Modell: Claude Sonnet 4
9 Likes3 SavesScore: 6