Ops & Reliability

SLOs + Incident Drills

This portfolio is intentionally operated like a production system. Recruiters can skim this page and see the exact signals senior cloud/platform teams look for: SLOs/SLIs, error budgets, alerting intent, and a repeatable incident drill loop.

View dashboard Telemetry system design Evidence + runbooks

SLOs (targets)

Dashboard availability: 99.9% monthly (public pages + API health)
Telemetry freshness: metrics updated within 24h (CI snapshot) and “live” is best-effort
AWS proxy reliability: 99.9% for /metrics/latest

Why this matters: high-comp cloud roles are hired to hit SLOs under cost and security constraints.

SLIs (how we measure)

Availability: synthetic HTTP checks (dashboard + /api/quality)
Latency: p95 response times (CloudWatch + logs)
Error rate: Lambda errors + API Gateway 4xx/5xx

Pattern: measure → alert → drill → postmortem → fix.

Incident drills (repeatable loop)

I run small, contained drills that simulate common failure modes (rate limits, missing artifacts, AWS proxy token mismatch, missing S3 object). The goal isn’t perfection — it’s proving the feedback loop and the operator mindset.

Incident triage playbook Incident drill report (evidence)

Security + blast radius

No AWS credentials in Vercel runtime (proxy pattern)
Least-privilege IAM + GitHub OIDC for CI publishing
Fail-safe degradation (snapshot baseline) prevents cascading outages

Next: add explicit “Security & Guardrails” receipts page (WAF + IAM + threat model)