Ops & Reliability

SLOs + Incident Drills

This portfolio is intentionally operated like a production system. Recruiters can skim this page and see the exact signals senior cloud/platform teams look for: SLOs/SLIs, error budgets, alerting intent, and a repeatable incident drill loop.

SLOs (targets)

  • Dashboard availability: 99.9% monthly (public pages + API health)
  • Telemetry freshness: metrics updated within 24h (CI snapshot) and “live” is best-effort
  • AWS proxy reliability: 99.9% for /metrics/latest
Why this matters: high-comp cloud roles are hired to hit SLOs under cost and security constraints.

SLIs (how we measure)

  • Availability: synthetic HTTP checks (dashboard + /api/quality)
  • Latency: p95 response times (CloudWatch + logs)
  • Error rate: Lambda errors + API Gateway 4xx/5xx
Pattern: measure → alert → drill → postmortem → fix.

Incident drills (repeatable loop)

I run small, contained drills that simulate common failure modes (rate limits, missing artifacts, AWS proxy token mismatch, missing S3 object). The goal isn’t perfection — it’s proving the feedback loop and the operator mindset.

Security + blast radius

  • No AWS credentials in Vercel runtime (proxy pattern)
  • Least-privilege IAM + GitHub OIDC for CI publishing
  • Fail-safe degradation (snapshot baseline) prevents cascading outages
Next: add explicit “Security & Guardrails” receipts page (WAF + IAM + threat model)