PromptBase
Upgrade
Close icon
General
Home
Marketplace
Create
Hire
Login
Chat
Sell
Explore

Prompt Details

Model
(claude-4-6-opus)
Token size
876
Example input
[CODE_SNIPPET]: async function createOrder(userId, cart) { const payment = await stripe.charges.create({ amount: cart.total, customer: userId }); await db.orders.insert({ userId, items: cart.items, paymentId: payment.id }); return payment; } [TECH_STACK]: Node.js 20, Express, PostgreSQL 15, Stripe SDK, deployed on AWS ECS Fargate behind an ALB [SYMPTOMS]: 5% of checkout requests returning 500, duplicate charges in Stripe dashboard for ~12 users, log pattern "duplicate key value violates unique constraint orders_payment_id_key" appearing 40 times/hour since 14:20 UTC [OCCURRENCE_PATTERN]: Started 25 minutes after deploying release v2.8.1 which added ECS auto-scaling from 2 to 6 tasks. Concentrated in peak hours 14:00-19:00 UTC. Not reproducible in staging with single task. [BLAST_RADIUS]: Affects roughly 8% of logged-in users on the checkout flow. Revenue path degraded but not down. 12 confirmed duplicate charges requiring manual refund. Critical severity for the business.
Example output
# Production Incident Triage Report ## 1. Incident Snapshot **Problem statement:** Since 14:20 UTC today β€” 25 minutes after deploying v2.8.1 with ECS auto-scaling (2β†’6 tasks) β€” approximately 5% of checkout requests are returning HTTP 500 errors, and a subset of users (~12 confirmed) have been double-charged in Stripe. The immediate trigger is a unique constraint violation on `orders_payment_id_key`, meaning the same Stripe charge ID is being inserted into the `orders` table more than once. **Severity: SEV2 β€” Major degradation.** Revenue path is functional but degraded for ~8% of checkout users, and real money has been incorrectly taken from 12 customers. Not SEV1 because checkout is not fully down; elevated above SEV3 because duplicate charges carry financial and regulatory exposure. **Signal breakdown:** | Signal | Classification | |---|---| | 5% of checkouts returning 500 | **Load-dependent / environment-dependent** β€” correlates with peak hours and multi-task scaling | | Duplicate Stripe charges for ~12 users | **Intermittent, load-dependent** β€” subset of the 500s, requires race condition to manifest | | `duplicate key value violates unique constraint orders_payment_id_key` 40Γ—/hr | **Deterministic given the race** β€” direct consequence of two tasks inserting the same payment ID | | Onset 25 min after v2.8.1 deploy (scaling 2β†’6 tasks) | **Environment-dependent** β€” not reproducible on single-task staging | | Concentrated 14:00–19:00 UTC peak | **Load-dependent** β€” higher concurrency increases race window | **Ruled in:** A concurrency problem introduced or exposed by horizontal scaling. The code charges Stripe first, then inserts the order β€” any retry or duplicate request hitting a second task will re-charge Stripe and then collide on the DB insert. **Ruled out:** Database schema migration bug (constraint exists and is enforced correctly). Stripe SDK version defect (charges succeed; the problem is *too many* charges, not failed ones). Single-task logic error (staging with one task cannot reproduce it). Network-layer 500s unrelated to application logic (the error message is application-specific, from PostgreSQL). --- ## 2. Hypothesis Ledger ### H1 β€” ALB retry / client retry hitting a second ECS task (no idempotency) **Mechanism:** User submits checkout. The request is slow (Stripe latency) or the ALB/client times out and retries. The retry lands on a different ECS task. Both tasks call `stripe.charges.create` independently β€” producing two real charges with two different `payment.id` values. The first insert succeeds; the second hits the unique constraint *only if* both happen to derive the same payment ID, OR (more likely) the user's client retries and the first request actually completed, so the second request creates a *new* charge and tries to insert with a new payment ID that doesn't collide β€” but the user sees a 500 from the timed-out first request and retries manually, creating a *second* order row with a *second* charge. **However**, the log says the constraint violated is `orders_payment_id_key`, meaning the *same* `payment.id` is being inserted twice. This points to the exact same Stripe charge response being processed twice β€” i.e., the same request execution path running twice. **Revised mechanism:** ALB retry (default: connection-level, not HTTP-level on 5xx for POST) is unlikely for POST. More probable: the client-side code retries the POST on timeout or 500. The first request succeeds end-to-end (Stripe charge + DB insert). The client doesn't see the success (timeout), retries, the second request calls `stripe.charges.create` again β€” getting a *different* charge ID β€” and inserts successfully with a different payment ID. This would cause duplicate charges but *not* the unique constraint error. So this doesn't fully explain the `orders_payment_id_key` violation unless the same charge ID is reused. **Re-evaluation:** The constraint violation on `orders_payment_id_key` means the *identical* `payment.id` is being inserted twice. This can only happen if the same Stripe API response (`payment` object) is being used in two concurrent `db.orders.insert` calls. That means two executions of the same function with the same `payment` result β€” likely the same in-flight request being processed twice, or the function being called twice with a cached/shared payment object. **Supporting signals:** Onset with multi-task scaling; not reproducible with single task; peak-hour concentration. **Disconfirming signals:** ALB does not retry POST by default on HTTP 5xx β€” only on connection failures. **Likelihood:** Medium. **Cheapest confirmation:** Check ALB access logs for two requests from the same client IP with the same request body within a 1–5 second window to the `/checkout` or `/order` endpoint. --- ### H2 β€” Client-side double-submit with Stripe idempotency key reuse (MOST LIKELY) **Mechanism:** The frontend checkout button lacks debounce or disabling on click. User double-clicks or the SPA retries on perceived failure. Both requests reach different ECS tasks. Both call `stripe.charges.create` β€” but because `createOrder` does **not pass an idempotency key** to Stripe, each call creates a *separate* charge with a *different* `payment.id`. This would produce duplicate charges but different payment IDs, so the unique constraint on `payment_id` would NOT fire. **Contradiction with symptom:** The constraint violation is on `payment_id`, meaning same ID inserted twice. This hypothesis explains the duplicate charges for some users but not the constraint error. **Likelihood:** Medium for the duplicate charges symptom; does NOT explain the constraint violation. **Cheapest confirmation:** Check Stripe dashboard β€” do the 12 duplicate-charged users have two *different* charge IDs or the *same* charge ID? --- ### H3 β€” ECS task crash/restart replaying the DB insert after Stripe charge succeeded (HIGHEST LIKELIHOOD) **Mechanism:** `stripe.charges.create` succeeds and returns `payment`. Before or during `db.orders.insert`, the ECS task is killed (OOM, scaling event, deployment rolling restart) or the DB connection drops transiently. The request fails with a 500 to the user. The user (or client) retries. On retry, `stripe.charges.create` runs again β€” creating a **second** charge. Then `db.orders.insert` runs with the **new** payment ID. This explains duplicate charges but not the unique constraint violation. **Alternative sub-mechanism β€” the critical one:** The task doesn't crash. Instead, the **same request** is somehow processed twice within the same task or across tasks. Given Node.js 20 + Express, there is no built-in request deduplication. But the unique constraint violation on `orders_payment_id_key` requires the *same* `payment.id`. The only way this happens is if the same `payment` object's ID is inserted twice. This means either: (a) two concurrent calls to `db.orders.insert` with the same payment ID, or (b) a retry of the insert after a transient failure where the first insert actually committed. **(b) is the strongest fit:** `db.orders.insert` succeeds, but the acknowledgment is lost (TCP reset, connection pool timeout under load). The `await` throws. The function throws a 500. The user retries, gets a new Stripe charge (duplicate charge). But separately β€” the **framework or middleware retries the insert** on the same connection, or there's a retry wrapper around the DB call. However, there's no retry logic visible in the snippet. **Likelihood:** Low-Medium. **Cheapest confirmation:** Correlate ECS task termination/restart events with the timestamps of the 40/hr constraint violations. --- ### H4 β€” Express middleware or upstream proxy replaying the request body to a second handler invocation within the same task **Mechanism:** A middleware (e.g., body-parser stream re-read, or a custom retry middleware added in v2.8.1) inadvertently calls `next()` twice or re-invokes the route handler. The same request runs `createOrder` twice in the same task. First call: Stripe charge succeeds, DB insert succeeds. Second call: Stripe charge creates a *new* charge (duplicate), DB insert fails on unique constraint of… wait, the second call would have a *different* payment ID from the second Stripe call. **Does not explain the same payment ID being inserted twice.** Unless the middleware replays *after* Stripe returns but *before* DB insert β€” sharing the same `payment` variable in a closure. In Express, this is unlikely unless the handler is written as a generator or there's shared mutable state. **Likelihood:** Low. **Cheapest confirmation:** Add `console.log` of `request.id` + timestamp at entry of `createOrder` β€” look for duplicate invocations with the same request ID. --- ### H5 β€” PostgreSQL connection pool exhaustion causing insert timeout + ghost commit (HIGHEST LIKELIHOOD) **Mechanism:** Scaling from 2 to 6 tasks multiplied the connection pool by 3Γ—. PostgreSQL's `max_connections` (default 100) or the pool per-task (commonly 10–20, so 6 tasks = 60–120 connections) is under pressure during peak. The `db.orders.insert` call acquires a connection, sends the INSERT, the row is committed on the server side, but the **acknowledgment back to the client** is delayed or lost due to pool contention or TCP-level timeout. The `await` on `db.orders.insert` **throws** (connection timeout or pool timeout). `createOrder` throws β†’ Express returns 500. The row is **actually in the database**. The user retries the checkout. On retry, `stripe.charges.create` creates a **new, second charge** (duplicate!). Then `db.orders.insert` tries to insert with the **new** payment ID β€” and **succeeds**, because the payment ID is different. **But this doesn't explain the unique constraint violation on the same payment ID.** **Revised mechanism for the constraint violation:** The pool is exhausted. The `await db.orders.insert(...)` call times out at the client pool level *before* sending the query. Or it sends the query, the query is queued by PgBouncer or the pool, and executes twice due to a retry at the driver level. Node.js `pg` driver: by default, it does not retry queries. But some ORMs (Sequelize, Prisma, TypeORM) have retry logic. If `db.orders` is using an ORM with automatic query retry, the first attempt might commit, the second attempt hits the unique constraint. **Supporting signals:** Onset exactly with scaling 2β†’6 (3Γ— connection demand); peak-hour concentration (highest pool pressure); `duplicate key` is the DB error, consistent with a committed row + retry; not reproducible on single task (pool never stressed). **Disconfirming signals:** We don't know the ORM or driver. If raw `pg` with no retry, this mechanism needs a different retry source. **Likelihood:** **High.** **Cheapest confirmation:** Check PostgreSQL `max_connections` and current connection count during peak. Check the Node.js DB library for built-in query retry. Run `SELECT count(*) FROM pg_stat_activity;` during peak hours. --- ### H6 β€” Stripe webhook + synchronous flow race creating duplicate order inserts **Mechanism:** A Stripe webhook handler (e.g., `charge.succeeded`) also calls `db.orders.insert` with the same `paymentId`. The synchronous `createOrder` function inserts the order, *and* the webhook fires and tries to insert the same order. With 6 tasks, the webhook may land on a different task, and the two inserts race. The one that loses hits `orders_payment_id_key`. **Supporting signals:** The constraint violation is on the *same* `payment.id` β€” this is the most natural explanation for the same ID appearing twice. Webhooks are asynchronous and can arrive within milliseconds of the charge creation. Multi-task scaling means the webhook request and the synchronous insert can truly run in parallel. **Disconfirming signals:** We have no direct evidence of a webhook handler. However, it is extremely common in Stripe integrations to have one, and the `orders_payment_id_key` constraint violation on the *same* payment ID is hard to explain otherwise. **Likelihood:** **High.** **Cheapest confirmation:** Search the codebase for `charge.succeeded` or `webhook` or `stripe.webhooks.constructEvent`. Check Stripe dashboard β†’ Webhooks β†’ recent deliveries for `charge.succeeded` events matching the 12 affected users. --- **Ranked summary:** | Rank | ID | Title | Likelihood | |---|---|---|---| | 1 | H6 | Stripe webhook + sync flow race on insert | High | | 2 | H5 | Connection pool exhaustion β†’ driver-level retry β†’ double insert | High | | 3 | H1 | ALB/client retry hitting second task | Medium | | 4 | H2 | Client double-submit (explains dup charges, not constraint error) | Medium | | 5 | H3 | ECS task crash replaying partial work | Low-Medium | | 6 | H4 | Middleware double-invocation | Low | --- ## 3. Diagnostic Plan **Step 1 β€” Identify whether the same `payment.id` or different `payment.id` values exist for duplicate-charged users** - **Action:** Query `SELECT user_id, payment_id, created_at FROM orders WHERE user_id IN (<12 affected user IDs>) ORDER BY user_id, created_at;` Also check Stripe dashboard for those 12 users β€” count distinct charge IDs per user. - **Expected outcome if H6 is true:** Each user has ONE charge in Stripe but TWO insert attempts (one succeeded, one threw the constraint error). Only one row per user in the orders table. - **Expected outcome if H1/H2 is true:** Each user has TWO charges in Stripe with DIFFERENT charge IDs. Possibly two rows in the orders table. - **Branching:** If one charge ID per user β†’ go to Step 2 (confirms H6 or H5). If two charge IDs per user β†’ go to Step 3 (confirms retry/double-submit path H1/H2). **Step 2 β€” Search codebase for Stripe webhook handler** - **Action:** `grep -r "charge.succeeded\|webhook\|constructEvent\|stripe.*event" --include="*.js" --include="*.ts" .` - **Expected outcome if H6 is true:** A webhook handler is found that also calls `db.orders.insert` with the payment ID from the event payload. - **Branching:** If webhook handler found with insert logic β†’ H6 confirmed, proceed to mitigation. If no webhook handler β†’ H5 is primary suspect, go to Step 4. **Step 3 β€” Check ALB access logs for duplicate POSTs** - **Action:** Query ALB access logs in CloudWatch for the `/checkout` or `/order` endpoint: filter for same `client_ip + request_path` pairs within a 5-second window. - **Expected outcome if H1 is true:** Paired requests visible with different `target_processing_time` values and different backend task IPs. - **Branching:** If duplicate POSTs found β†’ client-side or ALB retry confirmed (H1/H2). If not β†’ loop back to H5. **Step 4 β€” Check DB connection pressure** - **Action:** `SELECT count(*), state FROM pg_stat_activity GROUP BY state;` during peak. Also check the DB driver configuration for retry settings (`retry`, `retryAttempts`, `retryDelay`). - **Expected outcome if H5 is true:** Connections near `max_connections`; driver has retry enabled. - **Branching:** If connections saturated + retry exists β†’ H5 confirmed. If connections healthy and no retry logic β†’ revisit H3/H4. --- ## 4. Mitigation Options ### Fast Mitigation (deployable in minutes) **FM1 β€” Scale ECS tasks back to 2 (addresses H5, H6, H1)** - **What it does:** Reduces concurrency to pre-incident level, shrinking the race window and connection pool pressure. - **What it sacrifices:** Reduced capacity during peak; risk of latency-induced timeouts if traffic exceeds 2-task capacity. - **Rollback:** Scale back to 6 via ECS service update. - **Hypothesis addressed:** All hypotheses β€” reduces the probability of any concurrent-execution race. **FM2 β€” Add a `SELECT ... FOR UPDATE` or `INSERT ... ON CONFLICT DO NOTHING` guard around the order insert (addresses H6, H5)** - **What it does:** Changes the insert to `INSERT INTO orders (...) VALUES (...) ON CONFLICT (payment_id) DO NOTHING`. The second insert silently no-ops instead of throwing a 500. - **What it sacrifices:** Masks the duplicate; the second execution path returns `payment` without a corresponding order row (if using DO NOTHING). Acceptable as a stop-the-bleeding measure. - **Rollback:** Revert the query change. - **Hypothesis addressed:** H6, H5 β€” prevents the constraint violation from surfacing as a 500 regardless of cause. **FM3 β€” Disable the Stripe webhook endpoint temporarily (addresses H6)** - **What it does:** In Stripe Dashboard β†’ Webhooks β†’ disable the endpoint. Eliminates the race between sync flow and async webhook. - **What it sacrifices:** Any downstream logic dependent on webhooks (e.g., email triggers, fulfillment) stops. - **Rollback:** Re-enable the webhook endpoint in Stripe. - **Hypothesis addressed:** H6 directly. ### Root-Cause Fix **RCF1 β€” Add Stripe idempotency keys to `stripe.charges.create` (closes H1, H2, H3)** - **What it changes:** Pass `{ idempotencyKey: deterministic_hash(userId, cart) }` to the Stripe call. Any retry with the same inputs reuses the original charge. - **Dependencies:** Requires choosing a stable idempotency key derivation (e.g., a cart session ID or checkout nonce generated client-side). - **Testing:** Replay a checkout request twice with the same key; confirm Stripe returns the same charge ID. - **Regression risk:** Low. Stripe's idempotency is well-documented. Edge case: user legitimately re-orders identical cart (use a session-scoped nonce, not cart hash). **RCF2 β€” Make the webhook handler and sync flow mutually exclusive via upsert (closes H6)** - **What it changes:** Both paths use `INSERT ... ON CONFLICT (payment_id) DO UPDATE SET ...` (or a distributed lock / "claim" pattern). The first writer wins; the second is a no-op or merges cleanly. - **Dependencies:** Requires auditing both the sync handler and webhook handler to ensure consistent column sets. - **Testing:** Simulate concurrent sync + webhook insert in integration test with 6 parallel workers. - **Regression risk:** Medium β€” must verify that downstream consumers of the order row handle the upsert semantics correctly. **RCF3 β€” Right-size connection pool per task and set PostgreSQL `max_connections` appropriately (closes H5)** - **What it changes:** Set per-task pool size to `max_connections / max_tasks` (e.g., `100 / 6 β‰ˆ 15`). Or increase `max_connections`. Add connection pool monitoring. - **Dependencies:** May require PgBouncer if not already in place. - **Testing:** Load test with 6 tasks at peak concurrency; confirm no pool exhaustion. - **Regression risk:** Low if PgBouncer is used; medium if changing `max_connections` on RDS (requires reboot on some configurations). ### Recommended Order of Application 1. **FM1 β€” Scale back to 2 tasks** (immediate, stops the bleeding for all hypotheses) 2. **Run Diagnostic Step 1 + Step 2** (5–10 minutes, disambiguates H6 vs H5) 3. **FM2 β€” ON CONFLICT DO NOTHING** (deploy while investigating, prevents 500s if scaling back up) 4. **FM3 β€” Disable webhook** (only if Step 2 confirms H6) 5. **RCF1 + RCF2** (proper fix, deploy together in next release) 6. **RCF3** (before re-enabling auto-scaling) --- ## 5. Verification **Mitigation success signals:** - `duplicate key value violates unique constraint orders_payment_id_key` drops to **0 occurrences/hour** within 15 minutes of FM1 + FM2 deployment. - Checkout 500 rate drops below **0.5%** (baseline) within 15 minutes. - No new duplicate charges appearing in Stripe dashboard for 1 hour after mitigation. **Root-cause fix verification:** - Load test with 6 ECS tasks at 2Γ— peak traffic: zero constraint violations, zero duplicate Stripe charges. - Integration test: fire a `charge.succeeded` webhook simultaneously with the sync `createOrder` return path β€” confirm exactly one order row is created. - Replay the exact request pattern from the 12 affected users in staging with 6 tasks β€” confirm idempotency key prevents duplicate charges. **Regression watch:** - **Order fulfillment pipeline:** If it consumes webhook events, disabling the webhook (FM3) may cause orders to be created but not fulfilled. Monitor fulfillment queue depth. - **Stripe reconciliation:** The `ON CONFLICT DO NOTHING` in FM2 means some webhook-originated inserts silently vanish. Confirm that reporting/reconciliation queries are not affected. - **Connection pool health:** After re-enabling auto-scaling with RCF3, monitor `pg_stat_activity` count and pool wait time metrics for 48 hours. --- ## 6. Post-Incident Follow-Up **Detection gap:** No alert existed on the `duplicate key` PostgreSQL error rate or on the Stripe duplicate-charge pattern. A simple alert β€” "rate of `orders_payment_id_key` violations > 5/hour" β€” would have fired within the first 10 minutes of the incident, 15 minutes before the first customer complaint. Additionally, a Stripe webhook reconciliation check (charge count vs. order count per user per hour) would surface duplicate charges proactively. **Hardening opportunity:** Implement an **idempotency layer at the application boundary** for all payment-mutating endpoints. This means generating a client-side idempotency token at checkout initiation, threading it through to Stripe (as an idempotency key) and to PostgreSQL (as an `INSERT ... ON CONFLICT` guard). This single pattern eliminates the entire class of duplicate-charge incidents caused by retries, webhooks, scaling races, or client double-submits. **Blameless timeline note:** At 13:55 UTC, release v2.8.1 was deployed, enabling ECS auto-scaling from 2 to 6 tasks. At 14:20 UTC, during peak traffic ramp-up, the first `orders_payment_id_key` constraint violation appeared in application logs. By 14:45 UTC, the error rate stabilized at approximately 40 occurrences per hour. The incident was detected at [DETECTION_TIME] via [customer report / monitoring β€” to be filled in]. The most probable root cause is a race condition between the synchronous checkout flow and an asynchronous Stripe webhook handler, both attempting to insert an order row with the same payment ID β€” a race whose window was negligible at 2 tasks but became significant at 6 concurrent tasks under peak load.
πŸŒ€ Claude

Production Incident Triage Engine

Add to Cart
Instant accessInstant access
Usage rightsCommercial use
Money-back guaranteeMoney‑back
By purchasing this prompt, you agree to our terms of service
CLAUDE-4-6-OPUS
Tested icon
Guide icon
4 examples icon
Free credits icon
πŸ› οΈ Turn a live production incident into a structured triage report in one response. β–ͺ️ Severity classification and signal breakdown β–ͺ️ 3 to 6 ranked hypotheses with cheapest confirmation tests β–ͺ️ Bisection-style diagnostic plan with branching rules β–ͺ️ Fast mitigation and root-cause fix, each traced to a hypothesis β–ͺ️ Verification thresholds and post-incident follow-up Built for SREs, on-call engineers, tech leads, and small teams. πŸ“ Instructions and tips included.
...more
Added 3 weeks ago
Report
Browse Marketplace