Prompt Details

Model

(claude-4-6-sonnet)

Token size

1,483

Example input

[INCIDENT_SIGNAL]: 3% of users seeing 500 errors on login, started 20 min ago [SYSTEM_CONTEXT]: Kubernetes cluster, 12 pods, recent HPA scale-up event [CURRENT_STATE]: Errors cluster on 2 specific pods, others healthy; no deploy in 48h [CONSTRAINTS]: Can't disable login, must preserve sessions

Example output

## 🚨 INCIDENT TRIAGE REPORT **Severity:** SEV2 — 3% user impact on authentication (critical function), blast radius trajectory unknown, revenue/session loss risk active **Confidence in diagnosis:** Medium — pod-specific clustering is a strong discriminating signal **Recommended action window:** 10 minutes to mitigation decision --- ### 1. Signal Breakdown **Observed (direct):** - 500 errors on login endpoint, started ~20 min ago - Errors isolated to 2 specific pods (named below as P-A, P-B) - 10 remaining pods serving login without errors - No deploy in 48h - HPA scale-up event occurred (timing relative to incident: TBD — **critical gap**) **Reported (secondhand):** - ~3% of users affected (assumed uniform hash distribution across pods) **{ASSUMPTION}:** - {ASSUMPTION} HPA scale-up preceded or coincided with incident start — not confirmed - {ASSUMPTION} The 2 bad pods are the newly provisioned ones from the scale-up - {ASSUMPTION} Sessions are stored externally (Redis/DB), not in-pod memory - {ASSUMPTION} 500s are application-layer, not infrastructure (OOM/CrashLoop) --- ### 2. Ranked Hypotheses | # | Hypothesis | Prior | Cheapest Test | Time | |---|---|---|---|---| | H1 | New HPA pods failed to mount a secret/configmap (e.g., JWT signing key, DB creds) | **High** | `kubectl describe pod <P-A>` → check Events + env vars | <2 min | | H2 | New pods can't reach a dependency (auth DB, Redis, identity service) — network policy or DNS not applied to new pod IPs | **High** | `kubectl exec <P-A> -- curl <auth-db-host>:<port>` | <2 min | | H3 | New pods are running a stale/different image pulled from cache (digest mismatch) | **Med** | `kubectl get pod <P-A> -o jsonpath='{.spec.containers[].image}'` vs healthy pod | <2 min | | H4 | Pod-local state corruption — bad pods received requests mid-startup before readiness probe passed | **Med** | Check pod age vs incident start: `kubectl get pods -o wide --sort-by=.metadata.creationTimestamp` | <2 min | | H5 | Resource exhaustion on new nodes (CPU throttle, OOM) causing handler timeouts | **Low** | `kubectl top pod <P-A>` | <2 min | **Ranking rationale:** H1 and H2 are highest prior because HPA provisioning is the only system change in the window and both failure modes are classic "new pod, old environment" pathologies. --- ### 3. Bisection Diagnostic Plan ``` START │ ├── Run: kubectl describe pod <P-A> | grep -A5 "Events\|Warning\|Error" │ ├── IF MountError / SecretNotFound / EnvVar missing │ → H1 CONFIRMED → skip to Mitigation M1 │ → Eliminates H2, H3, H4, H5 │ ├── IF No mount errors, pod looks "Healthy" in describe │ │ │ ├── Run: kubectl exec <P-A> -- curl -sv <auth-db>:<port> (or redis-cli ping) │ │ │ ├── IF connection refused / timeout │ │ → H2 CONFIRMED → skip to Mitigation M2 │ │ → Eliminates H1, H3, H4 │ │ │ └── IF connection succeeds │ │ │ ├── Compare image digest: kubectl get pod <P-A> vs <healthy-pod> -o jsonpath='{..imageID}' │ │ │ ├── IF digest differs → H3 CONFIRMED → M3 │ └── IF digest same → Check resource metrics (kubectl top) │ IF saturated → H5 │ IF normal → H4 (readiness race) — check pod age vs error onset ``` Each branch eliminates ≥50% of the hypothesis space. --- ### 4. Mitigation (Fast Path) **M1 — Cordon + drain the 2 bad pods (if H1 or H2 confirmed)** - Action: `kubectl cordon <node-of-P-A>` then `kubectl delete pod <P-A> <P-B>` (let scheduler rebuild elsewhere, or reduce HPA min) - Traces to: H1, H2, H4 - Blast radius: Login traffic redistributes to 10 healthy pods; verify they can absorb load before executing (`kubectl top pods`) - Reversibility: Seconds — pods reschedule automatically - Pre-check: Confirm 10 healthy pods have headroom (`kubectl top pods`; ensure they are not near CPU/memory limits). **Do not drain if healthy pods are already at >70% CPU.** - Session constraint: Safe — sessions are external; pod deletion doesn't destroy session state ({ASSUMPTION} — verify before executing) **M2 — Force pod replacement via rollout restart (if H3 confirmed)** - Action: `kubectl rollout restart deployment/<login-deployment>` - Traces to: H3 - Blast radius: Rolling restart, no downtime if `maxUnavailable=0` is set - Reversibility: Minutes; rollback via `kubectl rollout undo` - Pre-check: Confirm rollout strategy allows zero-downtime > ⚠️ **Do NOT** disable login (per constraint). Both mitigations route around bad pods while keeping login globally available. --- ### 5. Root-Cause Fix **If H1 (secret/config mount failure):** - Fix: Add the missing secret/configmap reference to the Deployment spec with correct `envFrom` or `volumeMount` path. Confirm HPA node group has IAM/RBAC to read the secret (if using external secrets operator). - Test coverage needed: Integration test that spins a pod from scratch in a staging namespace and validates login end-to-end before merge. - Deploy strategy: Canary (10% traffic) with error rate gate — auto-rollback if login 5xx > 0.5% for 2 min. **If H2 (network policy / DNS):** - Fix: Update NetworkPolicy to include CIDR range or pod selector covering new HPA node pool. Alternatively, validate CoreDNS is resolving service names from new node subnet. - Test coverage: Automated network reachability test (e.g., netpol validation in CI via `kubectl-netpol` or Cilium editor). - Deploy strategy: Apply NetworkPolicy change to staging cluster first; validate with `kubectl exec` connectivity test before prod. --- ### 6. Verification Thresholds - Login 500 error rate: must drop to **< 0.1%** sustained for **5 consecutive minutes** - P-A and P-B pod health: `kubectl get pods` shows `Running` + all containers `Ready 1/1` - No new 500 errors from previously healthy pods for **10 minutes** post-mitigation - HPA status: `kubectl get hpa` shows desired == ready replicas (no pending) - Synthetic login probe (if you have one): must return HTTP 200 for **3 consecutive checks** --- ### 7. Post-Incident Follow-Up **Detection gap:** The 20-minute window before detection suggests login error alerting threshold is too high or too slow. A 1% error rate on a critical auth endpoint should page in under 3 minutes. Review alert `for:` duration and threshold. **One systemic improvement:** Add a **post-scale-up smoke test** to the HPA lifecycle — a `readinessProbe` alone doesn't validate dependency connectivity. A startup validation job (or `postStart` hook with a dependency health check) that gates traffic until auth-db and session-store are confirmed reachable from the new pod would have surfaced H1/H2 at pod start rather than at user impact. --- ### 10-Pillar Coverage Check | Pillar | Status | |---|---| | **P1 Detection** | 20-min lag detected via user-facing errors — alerting threshold likely too coarse for auth path | | **P2 Observability** | Missing: exact HPA scale-up timestamp relative to first error; pod startup logs for P-A/P-B | | **P3 Blast Radius** | Currently 3% / 2 pods — contained but unknown if HPA will provision more bad pods | | **P4 Dependencies** | Auth DB, Redis/session store, identity service — reachability from new pods unverified | | **P5 Recent Changes** | No deploy in 48h — HPA scale-up is the sole change vector; confirm its exact timestamp | | **P6 Rollback Readiness** | Pod deletion is clean rollback; Deployment rollout undo available if image is root cause | | **P7 Data Integrity** | Low risk — 500s on login don't mutate data; verify no partial session writes on failure path | | **P8 Communication** | Notify on-call auth team lead now; if SEV2 holds at 15 min, page engineering manager | | **P9 Runbook Coverage** | Verify if HPA scale-up failure is documented — classic gap in runbooks that assume stable replica count | | **P10 Learning Capture** | Note now: HPA event time, which node pool got new pods, which secrets/policies were in scope | --- **What would change my mind on H1 being top hypothesis:** If `kubectl describe` shows clean mounts and all env vars present, H1 is eliminated and H2 immediately becomes the working hypothesis. Run the connectivity check next with no further delay.

🌀 Claude

Incident Triage Architect

Name: Incident Triage Architect Claude Prompt
Brand: PromptBase
Price: 19.99 USD
Availability: InStock
Author: monna

@monna

$19.99

What do I get when I buy a prompt?

Add to Cart

Instant access

Commercial use

Money‑back

By purchasing this prompt, you agree to our terms of service

CLAUDE-4-6-SONNET

Production Crisis → Structured Report in One Response ✅ Severity classification with signal breakdown ✅ 3–6 ranked hypotheses with cheapest confirmation tests ✅ Bisection-style diagnostic plan with branching rules ✅ Fast mitigation + root-cause fix, each traced to a hypothesis ✅ Verification thresholds + post-incident follow-up

...more

Added over 1 month ago

Browse Marketplace