Skip to main content
Luca Brunner·

Our once-a-week phantom crash got a ranked root-cause list and a stress-test plan

Diagnose and create a systematic plan for tracking down intermittent, non-deterministic bugs that are hard to reproduce.

Heisenbug Hunter

Act as a debugging specialist who hunts heisenbugs (intermittent, non-deterministic bugs). I have a bug that only appears sporadically and is impossible to reproduce reliably. **Bug Description:** {{bug_description}} **Occurrence Pattern:** {{occurrence_pattern}} **Tech Stack:** {{tech_stack}} **Known Context:** {{known_context}} **Code Context:** {{code_snippet}} Create a comprehensive heisenbug hunting plan: 1. Classify the bug type (race condition, timing issue, uninitialized memory, external dependency, hardware-sensitive, etc.) 2. List the top 5 most likely root causes with probability scores 3. Design targeted instrumentation/logging to capture the bug when it strikes 4. Create a stress-test strategy to increase reproduction rate 5. Suggest code changes to add defensive programming and deterministic behavior 6. Propose a systematic elimination process to narrow down the cause Format as an action plan with prioritized steps I can execute immediately.

Ergebnisse

## Heisenbug Hunting Plan — intermittent 502s under load **Bug class:** connection-pool exhaustion (resource leak), not random — correlates with traffic, hides at low load. ## Top 5 root causes (likelihood) | # | Hypothesis | P | |---|-----------|---| | 1 | DB connection not released on the error path in `getUser()` | 0.40 | | 2 | Pool max (10) too low for peak concurrency | 0.25 | | 3 | A slow query holds connections past the 30s idle timeout | 0.15 | | 4 | Keep-alive mismatch between LB and Node | 0.12 | | 5 | Event-loop block from sync crypto stalls release | 0.08 | ## Instrumentation (do first) - Log `pool.totalCount` / `pool.idleCount` every 5s — watch idle trend toward 0. - Add a `queryId` + duration log around every `client.query`, flag any held > 5s. ## Stress strategy Replay production traffic at 3× with `k6`; the leak that takes hours in prod should surface in minutes. ## Defensive fixes - Wrap every checkout in `try/finally { client.release() }` so errors can't leak a connection. - Set a `statement_timeout` so a stuck query frees its slot. ## Elimination Bisect by disabling the suspect endpoint behind a flag — if 502s vanish, hypothesis #1 is confirmed.

Modell: Claude Sonnet 4

19 Likes8 SavesScore: 12

1 Kommentar

Marco Rossi·

Pasted, tweaked two lines, shipped. Love it.

    Wir verwenden Cookies, um dein Erlebnis zu verbessern. Analytics-Cookies helfen uns, Prompt2Love weiterzuentwickeln. Einstellungen