Prompt Details
Model
(claude-4-6-opus)
Token size
876
Example input
[CODE_SNIPPET]: async function createOrder(userId, cart) { const payment = await stripe.charges.create({ amount: cart.total, customer: userId }); await db.orders.insert({ userId, items: cart.items, paymentId: payment.id }); return payment; }
[TECH_STACK]: Node.js 20, Express, PostgreSQL 15, Stripe SDK, deployed on AWS ECS Fargate behind an ALB
[SYMPTOMS]: 5% of checkout requests returning 500, duplicate charges in Stripe dashboard for ~12 users, log pattern "duplicate key value violates unique constraint orders_payment_id_key" appearing 40 times/hour since 14:20 UTC
[OCCURRENCE_PATTERN]: Started 25 minutes after deploying release v2.8.1 which added ECS auto-scaling from 2 to 6 tasks. Concentrated in peak hours 14:00-19:00 UTC. Not reproducible in staging with single task.
[BLAST_RADIUS]: Affects roughly 8% of logged-in users on the checkout flow. Revenue path degraded but not down. 12 confirmed duplicate charges requiring manual refund. Critical severity for the business.
Example output
# Production Incident Triage Report
## 1. Incident Snapshot
**Problem statement:** Since 14:20 UTC today β 25 minutes after deploying v2.8.1 with ECS auto-scaling (2β6 tasks) β approximately 5% of checkout requests are returning HTTP 500 errors, and a subset of users (~12 confirmed) have been double-charged in Stripe. The immediate trigger is a unique constraint violation on `orders_payment_id_key`, meaning the same Stripe charge ID is being inserted into the `orders` table more than once.
**Severity: SEV2 β Major degradation.** Revenue path is functional but degraded for ~8% of checkout users, and real money has been incorrectly taken from 12 customers. Not SEV1 because checkout is not fully down; elevated above SEV3 because duplicate charges carry financial and regulatory exposure.
**Signal breakdown:**
| Signal | Classification |
|---|---|
| 5% of checkouts returning 500 | **Load-dependent / environment-dependent** β correlates with peak hours and multi-task scaling |
| Duplicate Stripe charges for ~12 users | **Intermittent, load-dependent** β subset of the 500s, requires race condition to manifest |
| `duplicate key value violates unique constraint orders_payment_id_key` 40Γ/hr | **Deterministic given the race** β direct consequence of two tasks inserting the same payment ID |
| Onset 25 min after v2.8.1 deploy (scaling 2β6 tasks) | **Environment-dependent** β not reproducible on single-task staging |
| Concentrated 14:00β19:00 UTC peak | **Load-dependent** β higher concurrency increases race window |
**Ruled in:** A concurrency problem introduced or exposed by horizontal scaling. The code charges Stripe first, then inserts the order β any retry or duplicate request hitting a second task will re-charge Stripe and then collide on the DB insert.
**Ruled out:** Database schema migration bug (constraint exists and is enforced correctly). Stripe SDK version defect (charges succeed; the problem is *too many* charges, not failed ones). Single-task logic error (staging with one task cannot reproduce it). Network-layer 500s unrelated to application logic (the error message is application-specific, from PostgreSQL).
---
## 2. Hypothesis Ledger
### H1 β ALB retry / client retry hitting a second ECS task (no idempotency)
**Mechanism:** User submits checkout. The request is slow (Stripe latency) or the ALB/client times out and retries. The retry lands on a different ECS task. Both tasks call `stripe.charges.create` independently β producing two real charges with two different `payment.id` values. The first insert succeeds; the second hits the unique constraint *only if* both happen to derive the same payment ID, OR (more likely) the user's client retries and the first request actually completed, so the second request creates a *new* charge and tries to insert with a new payment ID that doesn't collide β but the user sees a 500 from the timed-out first request and retries manually, creating a *second* order row with a *second* charge. **However**, the log says the constraint violated is `orders_payment_id_key`, meaning the *same* `payment.id` is being inserted twice. This points to the exact same Stripe charge response being processed twice β i.e., the same request execution path running twice.
**Revised mechanism:** ALB retry (default: connection-level, not HTTP-level on 5xx for POST) is unlikely for POST. More probable: the client-side code retries the POST on timeout or 500. The first request succeeds end-to-end (Stripe charge + DB insert). The client doesn't see the success (timeout), retries, the second request calls `stripe.charges.create` again β getting a *different* charge ID β and inserts successfully with a different payment ID. This would cause duplicate charges but *not* the unique constraint error. So this doesn't fully explain the `orders_payment_id_key` violation unless the same charge ID is reused.
**Re-evaluation:** The constraint violation on `orders_payment_id_key` means the *identical* `payment.id` is being inserted twice. This can only happen if the same Stripe API response (`payment` object) is being used in two concurrent `db.orders.insert` calls. That means two executions of the same function with the same `payment` result β likely the same in-flight request being processed twice, or the function being called twice with a cached/shared payment object.
**Supporting signals:** Onset with multi-task scaling; not reproducible with single task; peak-hour concentration.
**Disconfirming signals:** ALB does not retry POST by default on HTTP 5xx β only on connection failures.
**Likelihood:** Medium.
**Cheapest confirmation:** Check ALB access logs for two requests from the same client IP with the same request body within a 1β5 second window to the `/checkout` or `/order` endpoint.
---
### H2 β Client-side double-submit with Stripe idempotency key reuse (MOST LIKELY)
**Mechanism:** The frontend checkout button lacks debounce or disabling on click. User double-clicks or the SPA retries on perceived failure. Both requests reach different ECS tasks. Both call `stripe.charges.create` β but because `createOrder` does **not pass an idempotency key** to Stripe, each call creates a *separate* charge with a *different* `payment.id`. This would produce duplicate charges but different payment IDs, so the unique constraint on `payment_id` would NOT fire.
**Contradiction with symptom:** The constraint violation is on `payment_id`, meaning same ID inserted twice. This hypothesis explains the duplicate charges for some users but not the constraint error.
**Likelihood:** Medium for the duplicate charges symptom; does NOT explain the constraint violation.
**Cheapest confirmation:** Check Stripe dashboard β do the 12 duplicate-charged users have two *different* charge IDs or the *same* charge ID?
---
### H3 β ECS task crash/restart replaying the DB insert after Stripe charge succeeded (HIGHEST LIKELIHOOD)
**Mechanism:** `stripe.charges.create` succeeds and returns `payment`. Before or during `db.orders.insert`, the ECS task is killed (OOM, scaling event, deployment rolling restart) or the DB connection drops transiently. The request fails with a 500 to the user. The user (or client) retries. On retry, `stripe.charges.create` runs again β creating a **second** charge. Then `db.orders.insert` runs with the **new** payment ID. This explains duplicate charges but not the unique constraint violation.
**Alternative sub-mechanism β the critical one:** The task doesn't crash. Instead, the **same request** is somehow processed twice within the same task or across tasks. Given Node.js 20 + Express, there is no built-in request deduplication. But the unique constraint violation on `orders_payment_id_key` requires the *same* `payment.id`. The only way this happens is if the same `payment` object's ID is inserted twice. This means either: (a) two concurrent calls to `db.orders.insert` with the same payment ID, or (b) a retry of the insert after a transient failure where the first insert actually committed.
**(b) is the strongest fit:** `db.orders.insert` succeeds, but the acknowledgment is lost (TCP reset, connection pool timeout under load). The `await` throws. The function throws a 500. The user retries, gets a new Stripe charge (duplicate charge). But separately β the **framework or middleware retries the insert** on the same connection, or there's a retry wrapper around the DB call. However, there's no retry logic visible in the snippet.
**Likelihood:** Low-Medium.
**Cheapest confirmation:** Correlate ECS task termination/restart events with the timestamps of the 40/hr constraint violations.
---
### H4 β Express middleware or upstream proxy replaying the request body to a second handler invocation within the same task
**Mechanism:** A middleware (e.g., body-parser stream re-read, or a custom retry middleware added in v2.8.1) inadvertently calls `next()` twice or re-invokes the route handler. The same request runs `createOrder` twice in the same task. First call: Stripe charge succeeds, DB insert succeeds. Second call: Stripe charge creates a *new* charge (duplicate), DB insert fails on unique constraint of⦠wait, the second call would have a *different* payment ID from the second Stripe call.
**Does not explain the same payment ID being inserted twice.** Unless the middleware replays *after* Stripe returns but *before* DB insert β sharing the same `payment` variable in a closure. In Express, this is unlikely unless the handler is written as a generator or there's shared mutable state.
**Likelihood:** Low.
**Cheapest confirmation:** Add `console.log` of `request.id` + timestamp at entry of `createOrder` β look for duplicate invocations with the same request ID.
---
### H5 β PostgreSQL connection pool exhaustion causing insert timeout + ghost commit (HIGHEST LIKELIHOOD)
**Mechanism:** Scaling from 2 to 6 tasks multiplied the connection pool by 3Γ. PostgreSQL's `max_connections` (default 100) or the pool per-task (commonly 10β20, so 6 tasks = 60β120 connections) is under pressure during peak. The `db.orders.insert` call acquires a connection, sends the INSERT, the row is committed on the server side, but the **acknowledgment back to the client** is delayed or lost due to pool contention or TCP-level timeout. The `await` on `db.orders.insert` **throws** (connection timeout or pool timeout). `createOrder` throws β Express returns 500. The row is **actually in the database**. The user retries the checkout. On retry, `stripe.charges.create` creates a **new, second charge** (duplicate!). Then `db.orders.insert` tries to insert with the **new** payment ID β and **succeeds**, because the payment ID is different.
**But this doesn't explain the unique constraint violation on the same payment ID.**
**Revised mechanism for the constraint violation:** The pool is exhausted. The `await db.orders.insert(...)` call times out at the client pool level *before* sending the query. Or it sends the query, the query is queued by PgBouncer or the pool, and executes twice due to a retry at the driver level. Node.js `pg` driver: by default, it does not retry queries. But some ORMs (Sequelize, Prisma, TypeORM) have retry logic. If `db.orders` is using an ORM with automatic query retry, the first attempt might commit, the second attempt hits the unique constraint.
**Supporting signals:** Onset exactly with scaling 2β6 (3Γ connection demand); peak-hour concentration (highest pool pressure); `duplicate key` is the DB error, consistent with a committed row + retry; not reproducible on single task (pool never stressed).
**Disconfirming signals:** We don't know the ORM or driver. If raw `pg` with no retry, this mechanism needs a different retry source.
**Likelihood:** **High.**
**Cheapest confirmation:** Check PostgreSQL `max_connections` and current connection count during peak. Check the Node.js DB library for built-in query retry. Run `SELECT count(*) FROM pg_stat_activity;` during peak hours.
---
### H6 β Stripe webhook + synchronous flow race creating duplicate order inserts
**Mechanism:** A Stripe webhook handler (e.g., `charge.succeeded`) also calls `db.orders.insert` with the same `paymentId`. The synchronous `createOrder` function inserts the order, *and* the webhook fires and tries to insert the same order. With 6 tasks, the webhook may land on a different task, and the two inserts race. The one that loses hits `orders_payment_id_key`.
**Supporting signals:** The constraint violation is on the *same* `payment.id` β this is the most natural explanation for the same ID appearing twice. Webhooks are asynchronous and can arrive within milliseconds of the charge creation. Multi-task scaling means the webhook request and the synchronous insert can truly run in parallel.
**Disconfirming signals:** We have no direct evidence of a webhook handler. However, it is extremely common in Stripe integrations to have one, and the `orders_payment_id_key` constraint violation on the *same* payment ID is hard to explain otherwise.
**Likelihood:** **High.**
**Cheapest confirmation:** Search the codebase for `charge.succeeded` or `webhook` or `stripe.webhooks.constructEvent`. Check Stripe dashboard β Webhooks β recent deliveries for `charge.succeeded` events matching the 12 affected users.
---
**Ranked summary:**
| Rank | ID | Title | Likelihood |
|---|---|---|---|
| 1 | H6 | Stripe webhook + sync flow race on insert | High |
| 2 | H5 | Connection pool exhaustion β driver-level retry β double insert | High |
| 3 | H1 | ALB/client retry hitting second task | Medium |
| 4 | H2 | Client double-submit (explains dup charges, not constraint error) | Medium |
| 5 | H3 | ECS task crash replaying partial work | Low-Medium |
| 6 | H4 | Middleware double-invocation | Low |
---
## 3. Diagnostic Plan
**Step 1 β Identify whether the same `payment.id` or different `payment.id` values exist for duplicate-charged users**
- **Action:** Query `SELECT user_id, payment_id, created_at FROM orders WHERE user_id IN (<12 affected user IDs>) ORDER BY user_id, created_at;` Also check Stripe dashboard for those 12 users β count distinct charge IDs per user.
- **Expected outcome if H6 is true:** Each user has ONE charge in Stripe but TWO insert attempts (one succeeded, one threw the constraint error). Only one row per user in the orders table.
- **Expected outcome if H1/H2 is true:** Each user has TWO charges in Stripe with DIFFERENT charge IDs. Possibly two rows in the orders table.
- **Branching:** If one charge ID per user β go to Step 2 (confirms H6 or H5). If two charge IDs per user β go to Step 3 (confirms retry/double-submit path H1/H2).
**Step 2 β Search codebase for Stripe webhook handler**
- **Action:** `grep -r "charge.succeeded\|webhook\|constructEvent\|stripe.*event" --include="*.js" --include="*.ts" .`
- **Expected outcome if H6 is true:** A webhook handler is found that also calls `db.orders.insert` with the payment ID from the event payload.
- **Branching:** If webhook handler found with insert logic β H6 confirmed, proceed to mitigation. If no webhook handler β H5 is primary suspect, go to Step 4.
**Step 3 β Check ALB access logs for duplicate POSTs**
- **Action:** Query ALB access logs in CloudWatch for the `/checkout` or `/order` endpoint: filter for same `client_ip + request_path` pairs within a 5-second window.
- **Expected outcome if H1 is true:** Paired requests visible with different `target_processing_time` values and different backend task IPs.
- **Branching:** If duplicate POSTs found β client-side or ALB retry confirmed (H1/H2). If not β loop back to H5.
**Step 4 β Check DB connection pressure**
- **Action:** `SELECT count(*), state FROM pg_stat_activity GROUP BY state;` during peak. Also check the DB driver configuration for retry settings (`retry`, `retryAttempts`, `retryDelay`).
- **Expected outcome if H5 is true:** Connections near `max_connections`; driver has retry enabled.
- **Branching:** If connections saturated + retry exists β H5 confirmed. If connections healthy and no retry logic β revisit H3/H4.
---
## 4. Mitigation Options
### Fast Mitigation (deployable in minutes)
**FM1 β Scale ECS tasks back to 2 (addresses H5, H6, H1)**
- **What it does:** Reduces concurrency to pre-incident level, shrinking the race window and connection pool pressure.
- **What it sacrifices:** Reduced capacity during peak; risk of latency-induced timeouts if traffic exceeds 2-task capacity.
- **Rollback:** Scale back to 6 via ECS service update.
- **Hypothesis addressed:** All hypotheses β reduces the probability of any concurrent-execution race.
**FM2 β Add a `SELECT ... FOR UPDATE` or `INSERT ... ON CONFLICT DO NOTHING` guard around the order insert (addresses H6, H5)**
- **What it does:** Changes the insert to `INSERT INTO orders (...) VALUES (...) ON CONFLICT (payment_id) DO NOTHING`. The second insert silently no-ops instead of throwing a 500.
- **What it sacrifices:** Masks the duplicate; the second execution path returns `payment` without a corresponding order row (if using DO NOTHING). Acceptable as a stop-the-bleeding measure.
- **Rollback:** Revert the query change.
- **Hypothesis addressed:** H6, H5 β prevents the constraint violation from surfacing as a 500 regardless of cause.
**FM3 β Disable the Stripe webhook endpoint temporarily (addresses H6)**
- **What it does:** In Stripe Dashboard β Webhooks β disable the endpoint. Eliminates the race between sync flow and async webhook.
- **What it sacrifices:** Any downstream logic dependent on webhooks (e.g., email triggers, fulfillment) stops.
- **Rollback:** Re-enable the webhook endpoint in Stripe.
- **Hypothesis addressed:** H6 directly.
### Root-Cause Fix
**RCF1 β Add Stripe idempotency keys to `stripe.charges.create` (closes H1, H2, H3)**
- **What it changes:** Pass `{ idempotencyKey: deterministic_hash(userId, cart) }` to the Stripe call. Any retry with the same inputs reuses the original charge.
- **Dependencies:** Requires choosing a stable idempotency key derivation (e.g., a cart session ID or checkout nonce generated client-side).
- **Testing:** Replay a checkout request twice with the same key; confirm Stripe returns the same charge ID.
- **Regression risk:** Low. Stripe's idempotency is well-documented. Edge case: user legitimately re-orders identical cart (use a session-scoped nonce, not cart hash).
**RCF2 β Make the webhook handler and sync flow mutually exclusive via upsert (closes H6)**
- **What it changes:** Both paths use `INSERT ... ON CONFLICT (payment_id) DO UPDATE SET ...` (or a distributed lock / "claim" pattern). The first writer wins; the second is a no-op or merges cleanly.
- **Dependencies:** Requires auditing both the sync handler and webhook handler to ensure consistent column sets.
- **Testing:** Simulate concurrent sync + webhook insert in integration test with 6 parallel workers.
- **Regression risk:** Medium β must verify that downstream consumers of the order row handle the upsert semantics correctly.
**RCF3 β Right-size connection pool per task and set PostgreSQL `max_connections` appropriately (closes H5)**
- **What it changes:** Set per-task pool size to `max_connections / max_tasks` (e.g., `100 / 6 β 15`). Or increase `max_connections`. Add connection pool monitoring.
- **Dependencies:** May require PgBouncer if not already in place.
- **Testing:** Load test with 6 tasks at peak concurrency; confirm no pool exhaustion.
- **Regression risk:** Low if PgBouncer is used; medium if changing `max_connections` on RDS (requires reboot on some configurations).
### Recommended Order of Application
1. **FM1 β Scale back to 2 tasks** (immediate, stops the bleeding for all hypotheses)
2. **Run Diagnostic Step 1 + Step 2** (5β10 minutes, disambiguates H6 vs H5)
3. **FM2 β ON CONFLICT DO NOTHING** (deploy while investigating, prevents 500s if scaling back up)
4. **FM3 β Disable webhook** (only if Step 2 confirms H6)
5. **RCF1 + RCF2** (proper fix, deploy together in next release)
6. **RCF3** (before re-enabling auto-scaling)
---
## 5. Verification
**Mitigation success signals:**
- `duplicate key value violates unique constraint orders_payment_id_key` drops to **0 occurrences/hour** within 15 minutes of FM1 + FM2 deployment.
- Checkout 500 rate drops below **0.5%** (baseline) within 15 minutes.
- No new duplicate charges appearing in Stripe dashboard for 1 hour after mitigation.
**Root-cause fix verification:**
- Load test with 6 ECS tasks at 2Γ peak traffic: zero constraint violations, zero duplicate Stripe charges.
- Integration test: fire a `charge.succeeded` webhook simultaneously with the sync `createOrder` return path β confirm exactly one order row is created.
- Replay the exact request pattern from the 12 affected users in staging with 6 tasks β confirm idempotency key prevents duplicate charges.
**Regression watch:**
- **Order fulfillment pipeline:** If it consumes webhook events, disabling the webhook (FM3) may cause orders to be created but not fulfilled. Monitor fulfillment queue depth.
- **Stripe reconciliation:** The `ON CONFLICT DO NOTHING` in FM2 means some webhook-originated inserts silently vanish. Confirm that reporting/reconciliation queries are not affected.
- **Connection pool health:** After re-enabling auto-scaling with RCF3, monitor `pg_stat_activity` count and pool wait time metrics for 48 hours.
---
## 6. Post-Incident Follow-Up
**Detection gap:** No alert existed on the `duplicate key` PostgreSQL error rate or on the Stripe duplicate-charge pattern. A simple alert β "rate of `orders_payment_id_key` violations > 5/hour" β would have fired within the first 10 minutes of the incident, 15 minutes before the first customer complaint. Additionally, a Stripe webhook reconciliation check (charge count vs. order count per user per hour) would surface duplicate charges proactively.
**Hardening opportunity:** Implement an **idempotency layer at the application boundary** for all payment-mutating endpoints. This means generating a client-side idempotency token at checkout initiation, threading it through to Stripe (as an idempotency key) and to PostgreSQL (as an `INSERT ... ON CONFLICT` guard). This single pattern eliminates the entire class of duplicate-charge incidents caused by retries, webhooks, scaling races, or client double-submits.
**Blameless timeline note:** At 13:55 UTC, release v2.8.1 was deployed, enabling ECS auto-scaling from 2 to 6 tasks. At 14:20 UTC, during peak traffic ramp-up, the first `orders_payment_id_key` constraint violation appeared in application logs. By 14:45 UTC, the error rate stabilized at approximately 40 occurrences per hour. The incident was detected at [DETECTION_TIME] via [customer report / monitoring β to be filled in]. The most probable root cause is a race condition between the synchronous checkout flow and an asynchronous Stripe webhook handler, both attempting to insert an order row with the same payment ID β a race whose window was negligible at 2 tasks but became significant at 6 concurrent tasks under peak load.
By purchasing this prompt, you agree to our terms of service
CLAUDE-4-6-OPUS
π οΈ Turn a live production incident into a structured triage report in one response.
βͺοΈ Severity classification and signal breakdown
βͺοΈ 3 to 6 ranked hypotheses with cheapest confirmation tests
βͺοΈ Bisection-style diagnostic plan with branching rules
βͺοΈ Fast mitigation and root-cause fix, each traced to a hypothesis
βͺοΈ Verification thresholds and post-incident follow-up
Built for SREs, on-call engineers, tech leads, and small teams.
π Instructions and tips included.
...more
Added 3 weeks ago
