Luca Brunner·
Our once-a-week phantom crash got a ranked root-cause list and a stress-test plan
Diagnose and create a systematic plan for tracking down intermittent, non-deterministic bugs that are hard to reproduce.
Heisenbug Hunter
Act as a debugging specialist who hunts heisenbugs (intermittent, non-deterministic bugs). I have a bug that only appears sporadically and is impossible to reproduce reliably.
**Bug Description:**
{{bug_description}}
**Occurrence Pattern:**
{{occurrence_pattern}}
**Tech Stack:**
{{tech_stack}}
**Known Context:**
{{known_context}}
**Code Context:**
{{code_snippet}}
Create a comprehensive heisenbug hunting plan:
1. Classify the bug type (race condition, timing issue, uninitialized memory, external dependency, hardware-sensitive, etc.)
2. List the top 5 most likely root causes with probability scores
3. Design targeted instrumentation/logging to capture the bug when it strikes
4. Create a stress-test strategy to increase reproduction rate
5. Suggest code changes to add defensive programming and deterministic behavior
6. Propose a systematic elimination process to narrow down the cause
Format as an action plan with prioritized steps I can execute immediately.
Ergebnisse
## Heisenbug Hunting Plan — intermittent 502s under load
**Bug class:** connection-pool exhaustion (resource leak), not random — correlates with traffic, hides at low load.
## Top 5 root causes (likelihood)
| # | Hypothesis | P |
|---|-----------|---|
| 1 | DB connection not released on the error path in `getUser()` | 0.40 |
| 2 | Pool max (10) too low for peak concurrency | 0.25 |
| 3 | A slow query holds connections past the 30s idle timeout | 0.15 |
| 4 | Keep-alive mismatch between LB and Node | 0.12 |
| 5 | Event-loop block from sync crypto stalls release | 0.08 |
## Instrumentation (do first)
- Log `pool.totalCount` / `pool.idleCount` every 5s — watch idle trend toward 0.
- Add a `queryId` + duration log around every `client.query`, flag any held > 5s.
## Stress strategy
Replay production traffic at 3× with `k6`; the leak that takes hours in prod should surface in minutes.
## Defensive fixes
- Wrap every checkout in `try/finally { client.release() }` so errors can't leak a connection.
- Set a `statement_timeout` so a stuck query frees its slot.
## Elimination
Bisect by disabling the suspect endpoint behind a flag — if 502s vanish, hypothesis #1 is confirmed.
Modell: Claude Sonnet 4
19 Likes8 SavesScore: 12
1 Kommentar
Marco Rossi·
Pasted, tweaked two lines, shipped. Love it.