Skip to main content
Marco Rossi·

Fed it 22 minutes of JSON logs and it traced a cache OOM cascading into our DB pool

Analyze application logs to detect error patterns, anomaly clusters, performance degradation signatures, and generate actionable insights.

Log Analysis & Pattern Detector

Act as an SRE performing deep log analysis. Analyze the following application logs and extract actionable insights. **Log Sample:** ``` {{log_sample}} ``` **Log Format:** {{log_format}} **Application Context:** {{application_context}} **Time Window:** {{time_window}} **Known Issues (if any):** {{known_issues}} Perform comprehensive analysis: 1. **Error Pattern Detection**: Group similar errors, identify top error types and frequencies 2. **Anomaly Clustering**: Find temporal clusters where errors spike 3. **Correlation Analysis**: Correlate errors with warnings/info logs that preceded them 4. **Root Cause Hypotheses**: Top 3 most likely root causes with confidence scores 5. **Performance Signatures**: Identify latency spikes, throughput drops, resource exhaustion patterns 6. **Affected Components**: Which services/endpoints/users are impacted 7. **Severity Assessment**: Classify each issue by severity (P0 critical to P3 minor) 8. **Recommended Actions**: Immediate fixes and long-term preventive measures 9. **Log Quality Assessment**: Identify noisy logs, missing context, structured logging improvements 10. **Alert Recommendations**: Suggest log-based alerts to catch these patterns automatically Output as a structured incident report with timelines and prioritized action items.

Ergebnisse

## Incident Report — `checkout-api` error spike, 14:00–14:22 UTC **Log format:** JSON (pino). **Window:** 22 min. Below is the top-line analysis from the sample. **1. Error clustering** — three error types, one dominates: | Error | Count | Share | |-------|-------|-------| | `ECONNREFUSED redis:6379` | 1,847 | 71% | | `TimeoutError: query exceeded 5000ms` | 612 | 24% | | `ValidationError: missing sku` | 134 | 5% | **2. Temporal correlation:** the Redis refusals begin at 14:00:11, exactly 4 seconds after a `redis client reconnecting` warning. The Postgres timeouts start 90 seconds later — a classic secondary failure: cache down → every request falls through to the DB → connection pool saturates. **3. Root-cause hypothesis (confidence 0.85):** Redis was OOM-killed (the last healthy log shows `used_memory growing, maxmemory 256mb`). The app has no cache-miss circuit breaker, so it stampeded Postgres. **4. Immediate actions:** - Raise Redis `maxmemory` and set `maxmemory-policy allkeys-lru`. - Add a short-circuit: if Redis is down, serve a degraded response instead of hammering the DB. **5. Alert recommendation:** page when `rate({app="checkout"} |= "ECONNREFUSED" [2m]) > 5`. The `ValidationError` noise (5%) is a client bug, not an outage — route it to a daily digest, not the pager. **Log-quality note:** none of these lines carry a `trace_id`, so cross-service correlation was manual. Add request-id propagation.

Modell: Claude Sonnet 4

7 Likes2 SavesScore: 5

1 Kommentar

Priya Nair·

This is the kind of prompt that pays for the subscription by itself.