Skip to main content
Daniel Cohen·

Got a phased 2x/10x/100x roadmap with the exact bottlenecks ranked and a load-test gate per phase

Creates a phased scalability roadmap with load testing strategy, bottleneck identification, and infrastructure scaling triggers for predictable growth.

Scalability Planning & Capacity Roadmap

You are a Site Reliability Engineer and Scalability Architect at a hyper-growth startup. Create a comprehensive scalability plan. **Current System**: {{current_system}} (architecture, current traffic, known bottlenecks) **Growth Trajectory**: {{growth_trajectory}} (user growth % per month, expected traffic milestones) **Peak Traffic Pattern**: {{peak_traffic_pattern}} (daily/weekly/seasonal spikes, flash sale events) **Budget Constraints**: {{budget_constraints}} (monthly infrastructure budget, cost-per-user target) Provide: 1. **Scalability Assessment** - Current bottlenecks ranked by severity with concrete metrics (CPU, memory, I/O, network) 2. **Scaling Phases** - Phase 1 (immediate 2x), Phase 2 (10x), Phase 3 (100x) with specific architectural changes at each 3. **Horizontal Scaling Strategy** - Stateless design requirements, session externalization, sticky session elimination 4. **Database Scaling Path** - Read replicas, connection pooling, query optimization, sharding timeline 5. **Caching Hierarchy** - L1 (in-process), L2 (Redis), L3 (CDN) with cache invalidation strategy and hit ratio targets 6. **Load Testing Plan** - k6/Artillery/JMeter test scenarios, RPS targets, soak test duration, chaos engineering tests 7. **Auto-scaling Policies** - HPA/VPA/Cluster-autoscaler configuration with scale-up/scale-down triggers and cooldowns 8. **Traffic Management** - Rate limiting tiers, load shedding strategy, graceful degradation plan, bulkhead pattern 9. **CDN & Edge Strategy** - Static asset caching, dynamic content at edge, edge functions/compute deployment 10. **Monitoring & Alerting** - SLIs/SLOs/SLAs definition, latency/error rate/throughput dashboards, paging thresholds 11. **Cost Projections** - Infrastructure cost at 1x, 10x, 100x scale with cost optimization strategies 12. **Runbook** - Step-by-step manual scaling procedures if auto-scaling fails Include specific configuration values (e.g., Redis maxmemory-policy allkeys-lru, nginx worker_connections 4096).

Ergebnisse

## Scalability roadmap — order-management monolith **Current:** single Express process + one Postgres primary, ~800 RPS, p95 280ms. Growth: +30%/mo, flash-sale spikes to 6x baseline. Budget ceiling 4k EUR/mo. **Bottlenecks, ranked** | # | Bottleneck | Signal | Fix | |---|-----------|--------|-----| | 1 | Single Node event loop | CPU 1 core pinned at peak | `cluster` module, 1 worker/core | | 2 | DB write contention | lock waits on `orders` | move reads to replicas | | 3 | No edge cache | catalog re-rendered per request | CDN + 60s ISR | **Phase 1 (2x, this sprint):** enable Node `cluster` (≈40 LOC), add `pgbouncer` in transaction mode, cache the product catalog in Redis (`maxmemory-policy allkeys-lru`, 4GB). Stateless sessions move to Redis so any worker serves any request. **Phase 2 (10x):** two read replicas behind a routing layer (writes → primary, reads → replicas), CDN in front of all GET endpoints, HPA on the API deployment scaling 4→24 pods at 65% CPU with a 120s cooldown. **Phase 3 (100x):** shard `orders` by `customer_id` hash, async-ify checkout via a queue so LLM/fraud calls don't hold DB connections, multi-region active-active with geo-DNS. **Load test before each phase:** k6 ramp to 3x target RPS, 30-min soak, then a chaos run killing one replica mid-test. SLO gates: p95 < 500ms, error rate < 0.5%.

Modell: Claude Opus 4

55 Likes11 SavesScore: 26

2 Kommentare

Priya Nair·

Okay this system design output just saved me an afternoon.

Emily Chen·

The system design details here are spot on.