Tobias Keller·
Per-dependency resilience matrix - aggressive breaker on the optional tax API, none on payments
Implements comprehensive resilience patterns including circuit breakers, bulkheads, retries, timeouts, and fallback strategies for fault-tolerant distributed systems.
Circuit Breaker & Resilience Patterns Design
You are a Distributed Systems Engineer who implemented resilience patterns at Netflix (Hystrix) and Amazon. Design a comprehensive resilience architecture.
**System Architecture**: {{system_architecture}} (microservices, dependencies, external APIs, critical user journeys)
**Failure Modes**: {{failure_modes}} (common failures: DB timeout, third-party API down, network partition, memory pressure)
**SLA Requirements**: {{sla_requirements}} (availability target, max degradation under partial failure, acceptable error rate)
**Dependency Criticality**: {{dependency_criticality}} (which services are critical vs optional, external vs internal)
Design the resilience patterns:
1. **Circuit Breaker Configuration** - Per-dependency CB settings: failure threshold (%), slow call threshold (ms), wait duration in open state (s), half-open max calls
2. **Retry Strategy** - Exponential backoff formula: base * 2^attempt + jitter, max retry count per service type, idempotency requirements
3. **Timeout Hierarchy** - Global timeout, per-service timeout, per-operation timeout, timeout budget allocation across call chain
4. **Bulkhead Pattern** - Thread pool isolation per dependency, semaphore isolation, max concurrent calls per service, queue size limits
5. **Fallback Strategies** - Cache fallback, degraded response, static response, graceful degradation per feature, emergency mode
6. **Rate Limiting** - Token bucket per client, sliding window, distributed rate limiting with Redis, different limits per tier
7. **Load Shedding** - Prioritization matrix (critical vs best-effort), admission control, graceful request rejection with Retry-After headers
8. **Health Check Design** - Deep vs shallow health checks, dependency health aggregation, Kubernetes readiness/liveness probes configuration
9. **Chaos Engineering** - Failure injection schedule, Game Day scenarios, automated chaos tests in staging (CPU, memory, network, latency)
10. **Cascade Failure Prevention** - Root cause analysis of cascade patterns, prevention measures, bulkhead sizing formula
11. **Dashboard & Alerting** - Circuit breaker state visualization, error rate by dependency, retry rate, fallback activation rate, latency by call
12. **Testing Resilience** - Unit tests for each pattern, integration tests with WireMock/TestContainers simulating failures, chaos tests
13. **Framework Implementation** - Resilience4j (Java), Polly (.NET), resilience patterns in Go/TypeScript/Rust with configuration code
Include actual code/configuration for circuit breakers with specific threshold values, not generic guidance. Provide a dependency matrix with specific settings per dependency.
Ergebnisse
Resilience design for a **checkout flow** calling payment, inventory, and a flaky third-party tax API. Settings are per-dependency — a blunt one-size config either trips too early or too late.
**Dependency matrix**
| Dependency | Failure rate budget | CB threshold | Timeout | Fallback |
|-----------|---------------------|--------------|---------|----------|
| Payment (critical) | 0.1% | 50% over 20 calls | 4s | none — fail the order |
| Inventory (critical) | 0.5% | 50% over 20 calls | 2s | none |
| Tax API (optional) | 5% | 30% over 10 calls | 800ms | cached/estimated rate |
The key call: tax is **optional**, so its breaker is aggressive and it has a graceful fallback; payment is critical with no fallback.
```typescript
const taxBreaker = new CircuitBreaker(callTaxApi, {
timeout: 800,
errorThresholdPercentage: 30,
resetTimeout: 10_000,
volumeThreshold: 10,
});
taxBreaker.fallback(() => estimatedTaxFromCache()); // degrade, don't fail checkout
async function getTax(cart: Cart): Promise<TaxResult> {
return taxBreaker.fire(cart); // returns estimate if the breaker is open
}
```
**Timeout budget:** the checkout request has a 6s total budget — payment (4s) + inventory (2s) run with the tax call (800ms) folded in via the fallback so it never blows the budget. **Bulkhead:** each dependency gets an isolated connection pool so a tax-API stall can't starve payment connections. **Retry:** only `payment` retries (3x, jittered backoff) and only on transient errors, gated by idempotency keys. **Load shedding:** under CPU pressure, best-effort traffic (recommendations) is rejected with `Retry-After` before critical paths degrade.
Modell: Claude Opus 4
19 Likes8 SavesScore: 18
2 Kommentare
Ahmed Hassan·
I was skeptical but the output is genuinely production-grade.
Jonas Weber·
Works in TS strict mode with no complaints, which is rare.