Skip to main content
DevOps10 min read

Monitoring That Actually Tells You Something

Dashboards with 47 panels where everything is green aren't monitoring. They're decoration. Here's what I actually monitor and why most alerting is useless noise.

By Jason TeixeiraNovember 1, 2025
MonitoringSREAlertingDevOpsProductionObservability
Share:
On this page

I once inherited a Grafana instance with 47 dashboard panels. CPU utilization, memory usage, disk I/O, network bytes, JVM heap — every metric you could imagine. Everything was green. All the time.

Two days later, the API went down for 4 hours. Not a single alert fired.

Why? Because CPU was at 22%, memory at 45%, and disk at 30%. All "healthy." The actual problem was a connection pool exhaustion — a metric nobody was watching.

The Four Golden Signals (and Nothing Else)

Google's SRE book nailed this. You need exactly four signals:

1. Latency — How long do requests take? Not average latency — that hides problems. Track P50, P95, and P99:

  • P50 = 200ms means half your users get responses in 200ms (good)
  • P95 = 800ms means 1 in 20 users waits 800ms (acceptable)
  • P99 = 5000ms means 1 in 100 users waits 5 seconds (problem)

Your P99 is your real performance. The average lies.

2. Traffic — How many requests are you handling? This is your baseline. If traffic drops 80% at 2pm on a Tuesday, something is wrong even if all other metrics are green.

3. Errors — What percentage of requests fail? Track error rate, not error count. 100 errors out of 1 million requests (0.01%) is fine. 100 errors out of 200 requests (50%) is an outage.

4. Saturation — How full is your system? Database connections, memory, queue depth, thread pools. When any resource hits 80% utilization, you need to act — not because it's broken, but because you've lost your headroom.

My Actual Monitoring Setup

For the Nexural platform:

\\

Reader route

article -> proof -> offer

ReadClusterProofScope

cluster

Cloud & Infrastructure

intent

DevOps

route

next step

What to do with this

Turn the note into a build path.

If this topic maps to a real business problem, keep reading the cluster, study the academy path, or route the work into a scoped engagement.

Jason Teixeira
Written by
Jason Teixeira
Founder, Sage Ideas Studio · Principal Engineer
livebuild a1556e22026-06-19 03:29Z
// solo studio// no analytics resold// every commit human-reviewed