Prompt Details

Model

(claude-4-5-sonnet)

Token size

454

Example input

[QUESTION]: What is our refund policy for digital products purchased via the mobile app? [CONTEXT]: Four chunks from internal docs, mostly about physical product returns. Only one short line mentions digital products but is incomplete. [ANSWER]: Gives a detailed policy for physical returns and shipping labels, says nothing about digital products. [EXPECTED]: Should say: "Digital products are non-refundable except in cases of failed access within 14 days..." etc. [SYSTEM_NOTES]: - Chunk size: 512 tokens, overlap 64 - top_k: 5 - No metadata filters - Prompt: generic "Answer based only on the context."

Example output

# RAG Failure Forensics Analysis ## Failure Classification **Primary Failure Type:** Retrieval Mismatch + Context Insufficiency **Plain Language Explanation:** The retrieval system successfully fetched relevant documents but failed to retrieve the specific chunk containing the digital products refund policy. The model then answered based on what it *did* receive (physical product returns), resulting in a topically adjacent but factually incorrect response. This is a classic "plausible but wrong" failure—the system retrieved content from the correct domain (refunds) but the wrong subdomain (physical vs. digital). **Secondary Issue:** Prompt Weakness The generic prompt ("Answer based only on the context") didn't instruct the model to explicitly acknowledge missing information, so it filled the gap with irrelevant but contextually similar content. --- ## Root Cause Analysis ### Component Breakdown **1. Retrieval Layer (Primary Culprit - 60% responsibility)** - **Issue:** Query-document mismatch in semantic space - User query: "refund policy for **digital products** via **mobile app**" - Retrieved chunks: Predominantly physical product returns - The one relevant line about digital products was either: - Split across chunk boundaries (chunking artifact) - Buried in a lower-ranked position (beyond top-5) - Semantically diluted by surrounding physical product context - **Contributing factors:** - No metadata filtering on product type (digital vs. physical) - 512-token chunks may have grouped digital policy with unrelated content, weakening semantic coherence - Overlap of 64 tokens (12.5%) is relatively small—may miss critical context at boundaries **2. Ranking Layer (25% responsibility)** - **Issue:** Relevance scoring failed to prioritize digital-specific content - Likely due to term frequency: "refund" and "policy" appear in both physical and digital contexts, but "digital products" appears rarely - No metadata signals to boost digital product chunks - No query expansion to emphasize "digital," "non-physical," or "downloadable" **3. Prompt Engineering (10% responsibility)** - **Issue:** No safety net for incomplete context - Prompt doesn't instruct: "If context doesn't contain information about the specific question, say so explicitly" - No guidance to distinguish between product types - Generic instruction encourages the model to "make do" with available context **4. Metadata & Filtering (5% responsibility)** - **Issue:** No product-type tagging - Chunks lack metadata like `product_type: digital` or `channel: mobile_app` - Top-k operates blindly without domain-specific filters **5. Model Behavior (Negligible)** - The model performed as expected given poor inputs—it answered based on retrieved context per instructions --- ## Assumptions & Gaps Since [SYSTEM_NOTES] is limited, I'm explicitly noting unknowns: - **Unknown:** Embedding model used (e.g., OpenAI ada-002, sentence-transformers, custom) - **Unknown:** Whether the digital products policy exists as a standalone document or is embedded within a larger policy doc - **Unknown:** Query preprocessing (stemming, entity extraction, etc.) - **Unknown:** Reranking model (if any) - **Assumed:** No hybrid search (keyword + semantic) in use - **Assumed:** No user context (e.g., "this user is on mobile app") passed to retrieval --- ## Fix Plan ### High-Impact Fixes (Implement First) **Fix 1: Add Metadata Filters with Product Type Tags** - **Action:** - Tag all policy chunks with `product_type: [physical|digital|service]` and `channel: [web|mobile_app|in-store]` - Modify retrieval to apply filters when query contains product type indicators - Example: Query "digital products" → filter `product_type: digital` - **Validation:** - Run 20 test queries about digital products; measure retrieval of digital-tagged chunks (target: 80%+ precision@5) - A/B test: filtered vs. unfiltered retrieval on mixed product-type queries **Fix 2: Reduce Chunk Size + Increase Overlap for Policy Documents** - **Action:** - For policy/legal docs, reduce chunk size to **256 tokens** with **128-token overlap** (50%) - Rationale: Policies are dense; smaller chunks = tighter semantic coherence, less noise - Apply selectively to "policy" document type (don't change knowledge base globally) - **Validation:** - Compare retrieval precision before/after on 50 policy-related questions - Check for reduced chunk boundary artifacts (manual review of 10 split examples) **Fix 3: Enhance Prompt with Explicit Handling of Missing Info** - **Action:** - Replace: `"Answer based only on the context."` - With: `"Answer based only on the provided context. If the context does not contain information to fully answer the question—especially regarding specific product types, channels, or edge cases—explicitly state what is missing. Do not infer or use information about different products/scenarios."` - **Validation:** - Run the failed query again; model should now say "Context does not specify digital product refund policy" - Test on 10 "missing context" scenarios; measure explicit acknowledgment rate (target: 90%) ### Medium-Impact Fixes (Implement Second) **Fix 4: Implement Hybrid Search (Keyword + Semantic)** - **Action:** - Add BM25 keyword search alongside semantic embeddings - Weight: 0.3 keyword, 0.7 semantic (tune on validation set) - Ensures exact matches like "digital products" boost relevance - **Validation:** - Measure MRR (Mean Reciprocal Rank) improvement on 100 diverse queries - Target: +15-20% MRR for queries with rare/specific terms **Fix 5: Add Query Expansion for Product-Type Queries** - **Action:** - When query contains "digital products," expand to: `["digital products", "non-physical items", "downloadable content", "electronic purchases"]` - Use OR logic in retrieval - **Validation:** - Check if digital product chunk rank improves (target: moves from position 8+ to top-3) - Monitor precision@1 on 30 digital-specific queries **Fix 6: Rerank Top-K with Cross-Encoder** - **Action:** - Retrieve top-20 with current method, then rerank with cross-encoder model (e.g., `ms-marco-MiniLM-L-6-v2`) - Return final top-5 to model - **Validation:** - Measure NDCG@5 improvement on labeled test set (target: +10%) - Latency check: Ensure p95 stays under 300ms ### Low-Impact / Monitoring Fixes **Fix 7: Add Negative Feedback Loop** - **Action:** - Log queries where model says "context incomplete" or user downvotes answer - Quarterly review to identify systematic retrieval gaps (e.g., mobile app policies underrepresented) - **Validation:** - Track reduction in "incomplete context" responses over 3 months (target: -30%) --- ## Test Cases ### Test Suite: Digital Products & Specific Channels | # | Test Question | Good Answer Indicators | Bad Answer Indicators | |---|---------------|------------------------|----------------------| | 1 | "Are digital purchases refundable if I bought them on the mobile app?" | Mentions digital-specific policy, 14-day access window, mobile app acknowledged | Describes physical product returns or ignores "digital" | | 2 | "Can I return an e-book after downloading it?" | States non-refundable after download unless technical failure | Says "ship it back" or refers to packaging | | 3 | "What's the refund policy for in-app purchases?" | Distinguishes in-app purchases from other digital goods, cites platform rules (iOS/Android) if applicable | Generic refund policy without digital context | | 4 | "I bought a digital course but can't access it—can I get a refund?" | Mentions technical access failure exception, support contact | Says no refunds without acknowledging access issue | | 5 | "Do digital products have the same 30-day return window as physical items?" | Explicitly contrasts digital (non-refundable or shorter window) vs. physical (30 days) | Treats all products identically | | 6 | "Can I get a refund for a subscription purchased on the website vs. mobile app?" | Notes channel differences if any (e.g., App Store vs. direct billing) | Ignores channel or conflates policies | | 7 | "What items are excluded from your standard refund policy?" | Lists digital products, gift cards, personalized items, etc. | Omits digital products from exclusion list | | 8 | "I accidentally bought a digital product twice—what's the refund process?" | Mentions duplicate purchase exception or support escalation | Blanket "no refunds" without nuance | | 9 | "Are there any circumstances where digital products can be refunded?" | Lists specific exceptions (failed delivery, technical defect, accidental purchase within X hours) | Says "never" or describes unrelated policies | | 10 | "How do I request a refund for a digital product I can't use?" | Provides support contact, describes evidence needed (screenshots, error logs), sets expectations | Tells user to mail something or visit a store | ### Success Criteria for Each Test - **Precision:** Answer addresses the exact product type and channel mentioned - **Completeness:** Covers edge cases (technical failure, accidental purchase, etc.) - **Safety:** Explicitly states limitations if policy unclear ("Contact support for...") - **Contrast:** Distinguishes digital from physical when both could apply --- ## Implementation Priority **Week 1 (Quick Wins):** - Fix 3: Update prompt (1 hour) - Fix 1: Add metadata tags to policy docs (8 hours) **Week 2-3 (Core Retrieval):** - Fix 2: Adjust chunking for policy docs (4 hours + reindexing) - Fix 4: Implement hybrid search (16 hours) **Month 2 (Advanced):** - Fix 6: Deploy reranker (20 hours) - Fix 5: Query expansion logic (12 hours) **Ongoing:** - Fix 7: Negative feedback monitoring (2 hours/month) --- ## Expected Outcomes - **Immediate (Fix 3):** Model stops hallucinating physical policies for digital questions - **Short-term (Fixes 1-2):** Retrieval recall for digital product queries improves from ~20% to ~75% - **Medium-term (Fixes 4-6):** End-to-end answer accuracy on product-specific questions increases by 35-40% - **Long-term (Fix 7):** Systematic policy gaps identified and closed quarterly **Key Metric to Track:** Percentage of product-type-specific queries where top-1 retrieved chunk matches the queried product type (baseline: estimate 40% → target: 85%).

🌀 Claude