Skip to main content
Back to Blog
Engineering

How I Debug Production Issues (A Real Framework, Not Guessing)

January 5, 202610 min read
DebuggingProductionIncident ResponseEngineeringFramework

How I Debug Production Issues (A Real Framework, Not Guessing)

Early in my career, I debugged by vibes. Something broke, I'd stare at the code, change something, redeploy, hope. Sometimes it worked. Often it made things worse.

At Home Depot — where a bug affected 2,300+ stores — I couldn't afford to guess. I developed a framework. It's not glamorous, but it works every time.

The Framework: ISOLATE

I — Identify the symptom (not the cause) S — Scope the blast radius O — Observe the data (logs, metrics, traces) L — List hypotheses (minimum 3) A — Assess each hypothesis with evidence T — Test the fix in isolation E — Explain what happened (postmortem)

Let me walk through a real example.

Real Case: Dashboard Loading 30 Seconds

I — Identify the symptom. Users report the quality dashboard takes 30+ seconds to load. Locally it loads in 2 seconds. Production only.

Don't jump to "it's a database problem" or "it's a network issue" yet. Just describe what you see.

S — Scope the blast radius. Is it all users or specific ones? All browsers? Started when? Correlated with a deploy?

In this case: all users, started 3 days ago, no deploy in that window. That eliminates "we shipped broken code" as the cause.

O — Observe the data.

```bash

Check API response times

curl -o /dev/null -s -w "Total: %{time_total}s\n" https://api.example.com/quality

Check database query times

SELECT query, calls, mean_exec_time FROM pg_stat_statements ORDER BY mean_exec_time DESC LIMIT 10;

Check for connection pool exhaustion

SELECT count(*) FROM pg_stat_activity WHERE state = 'active'; ```

The data showed: API response time was 28 seconds. Database query took 0.3 seconds. So the bottleneck wasn't the database.

L — List hypotheses.

  1. GitHub API rate limiting (we fetch CI artifacts)
  2. Artifact ZIP file grew too large (parsing bottleneck)
  3. DNS resolution delay on the serverless function cold start

Never go with just one hypothesis. If you only have one, you'll confirm it whether it's right or not.

A — Assess each hypothesis.

Hypothesis 1: Checked GitHub API response headers — `X-RateLimit-Remaining: 4,823`. Not rate limited.

Hypothesis 2: Downloaded the latest artifact ZIP — 340KB. Normal size. But wait — the function was downloading artifacts from the LAST 20 workflow runs to find the newest one. And GitHub's artifact API was paginating slowly because the repo had 500+ workflow runs.

Found it. The API was making 20 sequential HTTP requests to GitHub, each taking ~1.4 seconds.

T — Test the fix. Changed the logic to fetch only the latest 3 runs instead of 20. Tested locally against production GitHub data. Load time: 3.2 seconds.

Deployed behind a feature flag. Monitored for 24 hours. Confirmed fix.

E — Explain what happened. Wrote a 1-page postmortem: what broke, why, how we found it, what we changed, and what we'd do to prevent similar issues (added a `perPage=3` parameter and a circuit breaker for GitHub API calls).

Why This Works Better Than Guessing

The framework forces you to separate observation from interpretation. Most debugging goes wrong at step 1: someone observes "it's slow" and immediately concludes "it's a database problem." They spend 4 hours optimizing queries while the real issue is a network call.

By listing 3+ hypotheses before investigating any of them, you prevent confirmation bias. The fix is usually in the hypothesis you almost didn't write down.

The One-Liner Version

When someone asks me how I debug, I say: "I don't look for the bug. I look for the data that tells me where the bug isn't. I eliminate everything it's not, and what's left is the answer."

It's Sherlock Holmes, but for API calls.

Want to see this in action?

Check out the projects and case studies behind these articles.