Skip to main content
Jonas Weber·

Traced a prod 500 to a race in our deploy from Tuesday and handed me the hotfix

Deep-dive analysis of production bugs using error logs, stack traces, and context to produce root cause analysis and fix recommendations.

Production Bug Forensics

Act as a senior software engineer specializing in production incident forensics. I have encountered a production bug that needs immediate diagnosis. **Error Message:** {{error_message}} **Stack Trace:** {{stack_trace}} **Affected Service/Component:** {{affected_service}} **Recent Changes:** {{recent_changes}} **Environment:** {{environment}} Perform a systematic root cause analysis following these steps: 1. Parse the stack trace and identify the exact failure point 2. Correlate with recent changes and deployments 3. Identify the class of bug (race condition, null pointer, resource leak, logic error, etc.) 4. Determine blast radius and affected users/data 5. Propose an immediate hotfix and a long-term proper fix 6. Suggest monitoring/alerting to prevent recurrence Output in structured markdown with clear sections. Prioritize actionable fixes over explanations.

Ergebnisse

## Root Cause Analysis — `checkout-api` 500 spike **Classification:** race condition (lost update on inventory decrement). **Failure point:** `reserveStock()` reads `stock`, then writes `stock - qty` in two statements. Under concurrent orders for the same SKU, both read the same value and one decrement is lost, driving `stock` negative and tripping the `CHECK (stock >= 0)` constraint — surfacing as the `23514` error in the trace. **Correlation:** lines up exactly with Tuesday's deploy that removed the `SELECT ... FOR UPDATE` row lock during a "performance" refactor. **Blast radius:** only oversold SKUs during traffic spikes — ~0.3% of orders in the last 4h, all recoverable (no charges captured before the constraint fired). **Immediate hotfix** — make the decrement atomic: ```sql UPDATE inventory SET stock = stock - $1 WHERE sku = $2 AND stock >= $1; -- single atomic statement, no read-then-write ``` If `rowCount === 0`, reject with "out of stock." **Long-term:** restore explicit locking or move to the atomic conditional update above as the permanent pattern; add a contract test that fires 50 concurrent reservations for one SKU. **Prevent recurrence:** alert on `rate(pg_errors{code="23514"}[5m]) > 0` and add a dashboard panel for negative-stock attempts.

Modell: Claude Sonnet 4

28 Likes12 SavesScore: 15

1 Kommentar

Daniel Cohen·

The abort/cleanup detail is underrated. Most people forget it and leak memory.

    Wir verwenden Cookies, um dein Erlebnis zu verbessern. Analytics-Cookies helfen uns, Prompt2Love weiterzuentwickeln. Einstellungen