Skip to main content
Ryan Mitchell·

Finally got burn-rate alerts instead of the flaky 5xx threshold that paged us at 3am for nothing

Configure complete monitoring stacks with Prometheus, Grafana, alerting rules, dashboards, and on-call integration.

Monitoring & Alerting Stack Setup

You are a site reliability engineer (SRE) building a monitoring and alerting stack. Design a complete observability solution. **Infrastructure Under Monitoring:** {{infrastructure}} **Tech Stack:** {{tech_stack}} **SLI/SLO Requirements:** {{sli_slo}} **Alerting Destinations:** {{alerting_destinations}} **Existing Tools:** {{existing_tools}} Generate: 1. **Prometheus Configuration**: scrape_configs for all services, recording rules, retention 2. **AlertManager Configuration**: Routing tree, inhibition rules, grouping, silence configuration 3. **Alert Rules (YAML)**: 15+ specific alert rules for: - High error rate (threshold: {{error_threshold}}) - High latency p99 (threshold: {{latency_threshold}}) - Low availability/uptime - Resource exhaustion (CPU > 80%, memory > 85%, disk > 90%) - Database connection pool exhaustion - Queue depth/backup - Certificate expiry (30, 14, 7 days) - Unusual traffic patterns 4. **Grafana Dashboards**: JSON model for: - Service overview dashboard - Infrastructure resource dashboard - Application performance dashboard - Error analysis dashboard 5. **Log Aggregation**: Promtail/Loki configuration for log collection 6. **On-Call Integration**: PagerDuty/OpsGenie webhook configurations 7. **Runbook Links**: Link each alert to a corresponding runbook 8. **SLI Dashboard Panel**: Real-time SLO burn rate display Output complete configuration files ready to deploy. Include severity levels and escalation policies.

Ergebnisse

Prometheus + Alertmanager stack for a Go microservice fleet. SLO is 99.9% availability, p99 < 300ms. Below are the recording rules and the alert set that page on burn-rate, not raw errors. ```yaml # rules/slo.yml — multi-window burn-rate alerting (Google SRE method) groups: - name: checkout-slo rules: - record: job:http_error_ratio:rate5m expr: | sum(rate(http_requests_total{job="checkout",code=~"5.."}[5m])) / sum(rate(http_requests_total{job="checkout"}[5m])) - alert: ErrorBudgetBurnFast expr: job:http_error_ratio:rate5m > (14.4 * 0.001) for: 2m labels: { severity: page } annotations: summary: "checkout burning error budget 14x — pages on-call" runbook: "https://runbooks.acme.io/checkout/error-budget" - alert: HighLatencyP99 expr: | histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket{job="checkout"}[5m])) by (le)) > 0.3 for: 5m labels: { severity: page } - alert: DbPoolNearExhaustion expr: pg_pool_in_use / pg_pool_max > 0.9 for: 3m labels: { severity: warn } ``` ```yaml # alertmanager.yml — route by severity, inhibit warns during a page route: group_by: [alertname, job] receiver: slack routes: - matchers: [severity="page"] receiver: pagerduty inhibit_rules: - source_matchers: [severity="page"] target_matchers: [severity="warn"] equal: [job] ``` **Why burn-rate over a flat threshold:** a fixed "5xx > 5%" alert pages on a 30-second blip and stays silent on a slow bleed. The 14.4x fast-burn rule pages only when the monthly budget would be gone in ~2 days. Each alert links a runbook; the cert-expiry alert (omitted for length) fires at 30/14/7 days.

Modell: Claude Sonnet 4

8 Likes3 SavesScore: 5

1 Kommentar

Tobias Keller·

Bookmarked — exactly the debugging, testing approach I was missing.