Ryan Mitchell·
Finally got burn-rate alerts instead of the flaky 5xx threshold that paged us at 3am for nothing
Configure complete monitoring stacks with Prometheus, Grafana, alerting rules, dashboards, and on-call integration.
Monitoring & Alerting Stack Setup
You are a site reliability engineer (SRE) building a monitoring and alerting stack. Design a complete observability solution.
**Infrastructure Under Monitoring:**
{{infrastructure}}
**Tech Stack:**
{{tech_stack}}
**SLI/SLO Requirements:**
{{sli_slo}}
**Alerting Destinations:**
{{alerting_destinations}}
**Existing Tools:**
{{existing_tools}}
Generate:
1. **Prometheus Configuration**: scrape_configs for all services, recording rules, retention
2. **AlertManager Configuration**: Routing tree, inhibition rules, grouping, silence configuration
3. **Alert Rules (YAML)**: 15+ specific alert rules for:
- High error rate (threshold: {{error_threshold}})
- High latency p99 (threshold: {{latency_threshold}})
- Low availability/uptime
- Resource exhaustion (CPU > 80%, memory > 85%, disk > 90%)
- Database connection pool exhaustion
- Queue depth/backup
- Certificate expiry (30, 14, 7 days)
- Unusual traffic patterns
4. **Grafana Dashboards**: JSON model for:
- Service overview dashboard
- Infrastructure resource dashboard
- Application performance dashboard
- Error analysis dashboard
5. **Log Aggregation**: Promtail/Loki configuration for log collection
6. **On-Call Integration**: PagerDuty/OpsGenie webhook configurations
7. **Runbook Links**: Link each alert to a corresponding runbook
8. **SLI Dashboard Panel**: Real-time SLO burn rate display
Output complete configuration files ready to deploy. Include severity levels and escalation policies.
Ergebnisse
Prometheus + Alertmanager stack for a Go microservice fleet. SLO is 99.9% availability, p99 < 300ms. Below are the recording rules and the alert set that page on burn-rate, not raw errors.
```yaml
# rules/slo.yml — multi-window burn-rate alerting (Google SRE method)
groups:
- name: checkout-slo
rules:
- record: job:http_error_ratio:rate5m
expr: |
sum(rate(http_requests_total{job="checkout",code=~"5.."}[5m]))
/ sum(rate(http_requests_total{job="checkout"}[5m]))
- alert: ErrorBudgetBurnFast
expr: job:http_error_ratio:rate5m > (14.4 * 0.001)
for: 2m
labels: { severity: page }
annotations:
summary: "checkout burning error budget 14x — pages on-call"
runbook: "https://runbooks.acme.io/checkout/error-budget"
- alert: HighLatencyP99
expr: |
histogram_quantile(0.99,
sum(rate(http_request_duration_seconds_bucket{job="checkout"}[5m])) by (le)) > 0.3
for: 5m
labels: { severity: page }
- alert: DbPoolNearExhaustion
expr: pg_pool_in_use / pg_pool_max > 0.9
for: 3m
labels: { severity: warn }
```
```yaml
# alertmanager.yml — route by severity, inhibit warns during a page
route:
group_by: [alertname, job]
receiver: slack
routes:
- matchers: [severity="page"]
receiver: pagerduty
inhibit_rules:
- source_matchers: [severity="page"]
target_matchers: [severity="warn"]
equal: [job]
```
**Why burn-rate over a flat threshold:** a fixed "5xx > 5%" alert pages on a 30-second blip and stays silent on a slow bleed. The 14.4x fast-burn rule pages only when the monthly budget would be gone in ~2 days. Each alert links a runbook; the cert-expiry alert (omitted for length) fires at 30/14/7 days.
Modell: Claude Sonnet 4
8 Likes3 SavesScore: 5
1 Kommentar
Tobias Keller·
Bookmarked — exactly the debugging, testing approach I was missing.