Ops & Reliability
SLOs + Incident Drills
This portfolio is intentionally operated like a production system. Recruiters can skim this page and see the exact signals senior cloud/platform teams look for: SLOs/SLIs, error budgets, alerting intent, and a repeatable incident drill loop.
SLOs (targets)
- Dashboard availability: 99.9% monthly (public pages + API health)
- Telemetry freshness: metrics updated within 24h (CI snapshot) and “live” is best-effort
- AWS proxy reliability: 99.9% for /metrics/latest
Why this matters: high-comp cloud roles are hired to hit SLOs under cost and security constraints.
SLIs (how we measure)
- Availability: synthetic HTTP checks (dashboard + /api/quality)
- Latency: p95 response times (CloudWatch + logs)
- Error rate: Lambda errors + API Gateway 4xx/5xx
Pattern: measure → alert → drill → postmortem → fix.
Incident drills (repeatable loop)
I run small, contained drills that simulate common failure modes (rate limits, missing artifacts, AWS proxy token mismatch, missing S3 object). The goal isn’t perfection — it’s proving the feedback loop and the operator mindset.
Security + blast radius
- No AWS credentials in Vercel runtime (proxy pattern)
- Least-privilege IAM + GitHub OIDC for CI publishing
- Fail-safe degradation (snapshot baseline) prevents cascading outages
Next: add explicit “Security & Guardrails” receipts page (WAF + IAM + threat model)