Prompt Details
Model
(claude-4-6-sonnet)
Token size
1,483
Example input
[INCIDENT_SIGNAL]: 3% of users seeing 500 errors on login, started 20 min ago
[SYSTEM_CONTEXT]: Kubernetes cluster, 12 pods, recent HPA scale-up event
[CURRENT_STATE]: Errors cluster on 2 specific pods, others healthy; no deploy in 48h
[CONSTRAINTS]: Can't disable login, must preserve sessions
Example output
## 🚨 INCIDENT TRIAGE REPORT
**Severity:** SEV2 — 3% user impact on authentication (critical function), blast radius trajectory unknown, revenue/session loss risk active
**Confidence in diagnosis:** Medium — pod-specific clustering is a strong discriminating signal
**Recommended action window:** 10 minutes to mitigation decision
---
### 1. Signal Breakdown
**Observed (direct):**
- 500 errors on login endpoint, started ~20 min ago
- Errors isolated to 2 specific pods (named below as P-A, P-B)
- 10 remaining pods serving login without errors
- No deploy in 48h
- HPA scale-up event occurred (timing relative to incident: TBD — **critical gap**)
**Reported (secondhand):**
- ~3% of users affected (assumed uniform hash distribution across pods)
**{ASSUMPTION}:**
- {ASSUMPTION} HPA scale-up preceded or coincided with incident start — not confirmed
- {ASSUMPTION} The 2 bad pods are the newly provisioned ones from the scale-up
- {ASSUMPTION} Sessions are stored externally (Redis/DB), not in-pod memory
- {ASSUMPTION} 500s are application-layer, not infrastructure (OOM/CrashLoop)
---
### 2. Ranked Hypotheses
| # | Hypothesis | Prior | Cheapest Test | Time |
|---|---|---|---|---|
| H1 | New HPA pods failed to mount a secret/configmap (e.g., JWT signing key, DB creds) | **High** | `kubectl describe pod <P-A>` → check Events + env vars | <2 min |
| H2 | New pods can't reach a dependency (auth DB, Redis, identity service) — network policy or DNS not applied to new pod IPs | **High** | `kubectl exec <P-A> -- curl <auth-db-host>:<port>` | <2 min |
| H3 | New pods are running a stale/different image pulled from cache (digest mismatch) | **Med** | `kubectl get pod <P-A> -o jsonpath='{.spec.containers[].image}'` vs healthy pod | <2 min |
| H4 | Pod-local state corruption — bad pods received requests mid-startup before readiness probe passed | **Med** | Check pod age vs incident start: `kubectl get pods -o wide --sort-by=.metadata.creationTimestamp` | <2 min |
| H5 | Resource exhaustion on new nodes (CPU throttle, OOM) causing handler timeouts | **Low** | `kubectl top pod <P-A>` | <2 min |
**Ranking rationale:** H1 and H2 are highest prior because HPA provisioning is the only system change in the window and both failure modes are classic "new pod, old environment" pathologies.
---
### 3. Bisection Diagnostic Plan
```
START
│
├── Run: kubectl describe pod <P-A> | grep -A5 "Events\|Warning\|Error"
│
├── IF MountError / SecretNotFound / EnvVar missing
│ → H1 CONFIRMED → skip to Mitigation M1
│ → Eliminates H2, H3, H4, H5
│
├── IF No mount errors, pod looks "Healthy" in describe
│ │
│ ├── Run: kubectl exec <P-A> -- curl -sv <auth-db>:<port> (or redis-cli ping)
│ │
│ ├── IF connection refused / timeout
│ │ → H2 CONFIRMED → skip to Mitigation M2
│ │ → Eliminates H1, H3, H4
│ │
│ └── IF connection succeeds
│ │
│ ├── Compare image digest: kubectl get pod <P-A> vs <healthy-pod> -o jsonpath='{..imageID}'
│ │
│ ├── IF digest differs → H3 CONFIRMED → M3
│ └── IF digest same → Check resource metrics (kubectl top)
│ IF saturated → H5
│ IF normal → H4 (readiness race) — check pod age vs error onset
```
Each branch eliminates ≥50% of the hypothesis space.
---
### 4. Mitigation (Fast Path)
**M1 — Cordon + drain the 2 bad pods (if H1 or H2 confirmed)**
- Action: `kubectl cordon <node-of-P-A>` then `kubectl delete pod <P-A> <P-B>` (let scheduler rebuild elsewhere, or reduce HPA min)
- Traces to: H1, H2, H4
- Blast radius: Login traffic redistributes to 10 healthy pods; verify they can absorb load before executing (`kubectl top pods`)
- Reversibility: Seconds — pods reschedule automatically
- Pre-check: Confirm 10 healthy pods have headroom (`kubectl top pods`; ensure they are not near CPU/memory limits). **Do not drain if healthy pods are already at >70% CPU.**
- Session constraint: Safe — sessions are external; pod deletion doesn't destroy session state ({ASSUMPTION} — verify before executing)
**M2 — Force pod replacement via rollout restart (if H3 confirmed)**
- Action: `kubectl rollout restart deployment/<login-deployment>`
- Traces to: H3
- Blast radius: Rolling restart, no downtime if `maxUnavailable=0` is set
- Reversibility: Minutes; rollback via `kubectl rollout undo`
- Pre-check: Confirm rollout strategy allows zero-downtime
> ⚠️ **Do NOT** disable login (per constraint). Both mitigations route around bad pods while keeping login globally available.
---
### 5. Root-Cause Fix
**If H1 (secret/config mount failure):**
- Fix: Add the missing secret/configmap reference to the Deployment spec with correct `envFrom` or `volumeMount` path. Confirm HPA node group has IAM/RBAC to read the secret (if using external secrets operator).
- Test coverage needed: Integration test that spins a pod from scratch in a staging namespace and validates login end-to-end before merge.
- Deploy strategy: Canary (10% traffic) with error rate gate — auto-rollback if login 5xx > 0.5% for 2 min.
**If H2 (network policy / DNS):**
- Fix: Update NetworkPolicy to include CIDR range or pod selector covering new HPA node pool. Alternatively, validate CoreDNS is resolving service names from new node subnet.
- Test coverage: Automated network reachability test (e.g., netpol validation in CI via `kubectl-netpol` or Cilium editor).
- Deploy strategy: Apply NetworkPolicy change to staging cluster first; validate with `kubectl exec` connectivity test before prod.
---
### 6. Verification Thresholds
- Login 500 error rate: must drop to **< 0.1%** sustained for **5 consecutive minutes**
- P-A and P-B pod health: `kubectl get pods` shows `Running` + all containers `Ready 1/1`
- No new 500 errors from previously healthy pods for **10 minutes** post-mitigation
- HPA status: `kubectl get hpa` shows desired == ready replicas (no pending)
- Synthetic login probe (if you have one): must return HTTP 200 for **3 consecutive checks**
---
### 7. Post-Incident Follow-Up
**Detection gap:** The 20-minute window before detection suggests login error alerting threshold is too high or too slow. A 1% error rate on a critical auth endpoint should page in under 3 minutes. Review alert `for:` duration and threshold.
**One systemic improvement:** Add a **post-scale-up smoke test** to the HPA lifecycle — a `readinessProbe` alone doesn't validate dependency connectivity. A startup validation job (or `postStart` hook with a dependency health check) that gates traffic until auth-db and session-store are confirmed reachable from the new pod would have surfaced H1/H2 at pod start rather than at user impact.
---
### 10-Pillar Coverage Check
| Pillar | Status |
|---|---|
| **P1 Detection** | 20-min lag detected via user-facing errors — alerting threshold likely too coarse for auth path |
| **P2 Observability** | Missing: exact HPA scale-up timestamp relative to first error; pod startup logs for P-A/P-B |
| **P3 Blast Radius** | Currently 3% / 2 pods — contained but unknown if HPA will provision more bad pods |
| **P4 Dependencies** | Auth DB, Redis/session store, identity service — reachability from new pods unverified |
| **P5 Recent Changes** | No deploy in 48h — HPA scale-up is the sole change vector; confirm its exact timestamp |
| **P6 Rollback Readiness** | Pod deletion is clean rollback; Deployment rollout undo available if image is root cause |
| **P7 Data Integrity** | Low risk — 500s on login don't mutate data; verify no partial session writes on failure path |
| **P8 Communication** | Notify on-call auth team lead now; if SEV2 holds at 15 min, page engineering manager |
| **P9 Runbook Coverage** | Verify if HPA scale-up failure is documented — classic gap in runbooks that assume stable replica count |
| **P10 Learning Capture** | Note now: HPA event time, which node pool got new pods, which secrets/policies were in scope |
---
**What would change my mind on H1 being top hypothesis:** If `kubectl describe` shows clean mounts and all env vars present, H1 is eliminated and H2 immediately becomes the working hypothesis. Run the connectivity check next with no further delay.
By purchasing this prompt, you agree to our terms of service
CLAUDE-4-6-SONNET
Production Crisis → Structured Report in One Response
✅ Severity classification with signal breakdown
✅ 3–6 ranked hypotheses with cheapest confirmation tests
✅ Bisection-style diagnostic plan with branching rules
✅ Fast mitigation + root-cause fix, each traced to a hypothesis
✅ Verification thresholds + post-incident follow-up
...more
Added 10 hours ago
