Monitoring That Actually Tells You Something
I once inherited a Grafana instance with 47 dashboard panels. CPU utilization, memory usage, disk I/O, network bytes, JVM heap — every metric you could imagine. Everything was green. All the time.
Two days later, the API went down for 4 hours. Not a single alert fired.
Why? Because CPU was at 22%, memory at 45%, and disk at 30%. All "healthy." The actual problem was a connection pool exhaustion — a metric nobody was watching.
The Four Golden Signals (and Nothing Else)
Google's SRE book nailed this. You need exactly four signals:
1. Latency — How long do requests take? Not average latency — that hides problems. Track P50, P95, and P99:
- P50 = 200ms means half your users get responses in 200ms (good)
- P95 = 800ms means 1 in 20 users waits 800ms (acceptable)
- P99 = 5000ms means 1 in 100 users waits 5 seconds (problem)
Your P99 is your real performance. The average lies.
2. Traffic — How many requests are you handling? This is your baseline. If traffic drops 80% at 2pm on a Tuesday, something is wrong even if all other metrics are green.
3. Errors — What percentage of requests fail? Track error rate, not error count. 100 errors out of 1 million requests (0.01%) is fine. 100 errors out of 200 requests (50%) is an outage.
4. Saturation — How full is your system? Database connections, memory, queue depth, thread pools. When any resource hits 80% utilization, you need to act — not because it's broken, but because you've lost your headroom.
My Actual Monitoring Setup
For the Nexural platform:
```yaml
What I alert on
alerts:
name: "High Error Rate" condition: error_rate > 5% for 5 minutes severity: critical notify: email + slack
name: "High Latency" condition: p95_latency > 2000ms for 5 minutes severity: warning notify: slack
name: "Traffic Drop" condition: requests_per_minute < 50% of 1h_average severity: warning notify: slack
name: "DB Connection Saturation" condition: active_connections > 80% of pool_size severity: critical notify: email + slack ```
That's 4 alerts. Not 40. Every alert requires action. If an alert fires and the response is "ignore it," delete the alert.
The Anti-Patterns
Dashboard driven development. Adding a panel for every metric because "more data is better." More data is more noise. You end up with 47 panels and zero insight.
Alerting on symptoms, not causes. "CPU is high" is a symptom. "Request queue depth is growing because the database is slow" is a cause. Alert on the cause.
Percentage-based thresholds without baselines. "Alert when CPU > 80%" means nothing if your baseline is 75%. Alert on deviation from baseline, not absolute values.
No alert for the absence of data. If your monitoring system stops receiving data, do you get an alert? Most people's answer is no. Add a heartbeat check: if no data received for 5 minutes, something is wrong.
The Dashboard I Actually Look At
One dashboard. Four panels. That's it.
``` ┌─────────────────────┬─────────────────────┐ │ Request Latency │ Error Rate │ │ P50/P95/P99 │ 5xx / total (%) │ │ (15 min window) │ (15 min window) │ ├─────────────────────┼─────────────────────┤ │ Traffic │ Saturation │ │ Requests/min │ DB connections │ │ (vs 24h ago) │ (% of pool) │ └─────────────────────┴─────────────────────┘ ```
If all four panels are normal, the system is healthy. I don't need to check anything else. If one panel is abnormal, I know exactly where to look.
The Lesson
Good monitoring isn't about collecting data. It's about answering one question quickly: "Is the system working for users right now?"
If your monitoring can't answer that in 10 seconds, it's decoration.