PromptBase
Upgrade
Close icon
General
Home
Marketplace
Create
Hire
Login
Chat
Sell
Explore

Prompt Details

Model
(claude-4-5-sonnet)
Token size
1,317
Example input
[The original user question that triggered the failure.]: * **QUESTION:** “Does our plan include SSO for contractors?” [ Retrieved passages provided to the model. Include chunk text and any available metadata source, chunk_id, timestamp, language, version]: * **CONTEXT:** Support articles + policy snippets [ The model’s response that is wrong, incomplete, or misleading.]: * **ANSWER:** Confident “Yes” with no citation [Notes about the RAG setup such as: - chunk size / overlap - embedding model - top-k - filters or rerankers - update cadence If unknown, write "unknown".]: * **SYSTEM_NOTES:** chunk=500, top-k=3, no metadata filters
Example output
# RAG Failure Diagnostic Report ## FAILURE CLASSIFICATION **FAILURE_TYPE:** `mixed_failure` (hallucination + missing_context) **EXPLANATION:** The model produced a confident affirmative answer ("Yes") without citing any supporting evidence from the retrieved context. This suggests either (1) the necessary information about SSO for contractors was not retrieved, (2) it was retrieved but the model failed to recognize it or misinterpreted it, or (3) the model hallucinated based on general knowledge rather than grounding its response in the provided passages. The lack of citation is a critical red flag indicating the answer cannot be verified against the source material. --- ## ROOT CAUSE ANALYSIS ### AFFECTED_LAYERS 1. Retrieval 2. Ranking_or_Reranking 3. Chunking_and_Metadata 4. Prompt_and_Instructions 5. Model_Behavior ### LAYER_DETAILS **Retrieval** - **What likely failed:** The query "Does our plan include SSO for contractors?" contains multiple semantic concepts (plan scope, SSO feature, contractor user type). With top-k=3, the system may have retrieved generic SSO documentation or plan overview pages that don't specifically address the contractor+SSO intersection. - **Why this produces the outcome:** If the retrieved chunks discuss SSO in general or plans in general but never mention contractors explicitly, the model has no grounding for the specific question asked. - **Observable signal:** The answer lacks citations. If relevant context existed in the retrieved set, a well-prompted model would typically reference it. **Ranking_or_Reranking** - **What likely failed:** No reranker is mentioned in system notes. With only semantic similarity and top-k=3, chunks mentioning "SSO" or "plan" but not "contractors" may rank higher than a chunk that mentions all three concepts together. - **Why this produces the outcome:** The most relevant passage (if it exists) may be ranked 4th or lower and excluded from context. - **Observable signal:** Unknown without seeing retrieval scores, but common pattern when relying solely on embedding similarity. **Chunking_and_Metadata** - **What likely failed:** 500-character chunks are quite small. A policy statement like "Enterprise plan includes SSO for employees; contractors require separate authentication" might be split across two chunks, with one mentioning SSO+plan and another mentioning contractors separately. - **Why this produces the outcome:** Context fragmentation prevents the model from connecting related statements, leading to incomplete or incorrect synthesis. - **Observable signal:** The question requires correlating three entities (plan, SSO, contractors). Small chunks increase the chance this correlation is broken. **Prompt_and_Instructions** - **What likely failed:** The system prompt likely lacks strict grounding instructions. It may not enforce citation requirements or penalties for unsupported claims. - **Why this produces the outcome:** Without explicit "only answer if context supports it" or "cite chunk_id for claims" instructions, the model defaults to helpful but ungrounded responses. - **Observable signal:** The confident "Yes" with no hedging or citation suggests the model was not constrained to stay within context boundaries. **Model_Behavior** - **What likely failed:** The model may have pattern-matched "SSO" questions to common knowledge (e.g., "enterprise plans usually include SSO") rather than verifying the specific retrieved content. - **Why this produces the outcome:** LLMs have strong priors from pre-training. When context is ambiguous or missing, they fill gaps with plausible-sounding general knowledge. - **Observable signal:** High confidence + no citation = hallucination indicator. --- ## IMPROVEMENT_SUGGESTIONS ### 1. Enforce Citation Requirements - **Target layer:** Prompt_and_Instructions - **Specific change:** Add system prompt instruction: "For every factual claim, cite the source chunk_id. If the context does not contain information to answer the question, respond with 'The provided documentation does not specify [X]. I cannot answer without additional context.'" - **Expected effect:** Forces model to ground answers in retrieved text or explicitly admit ignorance. - **Validation:** Re-run the query; expect either a cited answer or an "information not found" response. Manual review confirms citations match chunk content. ### 2. Increase Top-K and Add Reranking - **Target layer:** Ranking_or_Reranking - **Specific change:** Increase top-k from 3 to 10, then apply a cross-encoder reranker (e.g., `ms-marco-MiniLM`) to select the final 3–5 chunks based on query-passage relevance. - **Expected effect:** Captures more candidate passages initially, then reranks to surface chunks that mention multiple query concepts (plan + SSO + contractors). - **Validation:** Log retrieval scores before/after reranking. Check if contractor-specific SSO docs move into top-3. A/B test answer quality on similar queries. ### 3. Increase Chunk Size with Overlap - **Target layer:** Chunking_and_Metadata - **Specific change:** Increase chunk size to 1000–1500 characters with 200-character overlap. This keeps related policy statements together. - **Expected effect:** Reduces context fragmentation; increases likelihood that a single chunk contains all three concepts (plan, SSO, contractors). - **Validation:** Re-chunk the corpus and measure retrieval recall on a test set of multi-concept queries. Expect improved hit rate for complex questions. ### 4. Add Metadata Filters - **Target layer:** Chunking_and_Metadata - **Specific change:** Tag chunks with metadata (e.g., `user_type: [employees, contractors]`, `feature: [SSO, MFA]`, `plan_tier: [basic, enterprise]`). Enable hybrid search that filters by `user_type=contractors` when the query mentions contractors. - **Expected effect:** Narrows retrieval to contractor-relevant passages, reducing noise from employee-only policies. - **Validation:** Query "contractor SSO" and verify retrieved chunks are tagged with `user_type=contractors`. Measure precision improvement. ### 5. Implement Query Decomposition - **Target layer:** Retrieval - **Specific change:** Decompose complex queries into sub-queries: (a) "What plans include SSO?" (b) "Do contractors have access to SSO?" Then retrieve for each sub-query and merge results. - **Expected effect:** Handles multi-faceted questions by retrieving targeted passages for each facet. - **Validation:** Compare recall on composite queries (like the original) using single vs. decomposed retrieval. Expect higher recall with decomposition. ### 6. Add Uncertainty Detection - **Target layer:** Model_Behavior - **Specific change:** Fine-tune or prompt the model to output a confidence score or include hedging language ("Based on the retrieved context, it appears...") when evidence is weak. - **Expected effect:** Reduces overconfident hallucinations; signals to users when answers are uncertain. - **Validation:** Human raters score 50 answers for appropriate confidence calibration. Expect reduction in confident-but-wrong responses. ### 7. Monitor for Data Staleness - **Target layer:** Data_Freshness_and_Drift - **Specific change:** Add timestamp metadata to chunks. If a query implies current state ("our plan"), prioritize recently updated documents. Set up alerts when documents older than 90 days are frequently retrieved. - **Expected effect:** Ensures answers reflect current policies, not outdated information. - **Validation:** Simulate a policy change (e.g., contractors gain SSO access). Verify system retrieves and cites the updated doc within expected refresh window. --- ## TEST_CASES ### Test Case 1 - **Question:** "Can contractors use single sign-on with our enterprise plan?" - **Good answer:** Cites specific policy doc stating whether contractors are included in SSO coverage, or explicitly states no information is available. - **Bad answer:** Generic "Yes, SSO is included in enterprise plans" without contractor-specific confirmation. ### Test Case 2 - **Question:** "What authentication methods are available for external users?" - **Good answer:** Lists methods (SSO, MFA, password) with user-type distinctions (employees vs. contractors vs. vendors). - **Bad answer:** Lists methods without clarifying which user types can access each. ### Test Case 3 - **Question:** "Is SSO limited to full-time employees?" - **Good answer:** Directly addresses whether part-time, contractors, or other user types have SSO access, with citations. - **Bad answer:** Focuses only on full-time employees, ignores other user types. ### Test Case 4 - **Question:** "Does the basic plan support SSO for any user type?" - **Good answer:** Specifies plan tier requirements for SSO and which user types are eligible at each tier. - **Bad answer:** Says "SSO requires enterprise plan" without mentioning user-type restrictions. ### Test Case 5 - **Question:** "Are there different SSO policies for employees vs. contractors?" - **Good answer:** Compares policies explicitly, or states that policies are identical/differ, with evidence. - **Bad answer:** Discusses only one user type, leaving comparison unclear. ### Test Case 6 - **Question:** "What features are included in the contractor access package?" - **Good answer:** Lists features (including or excluding SSO) specific to contractors, with source references. - **Bad answer:** Lists general plan features without contractor-specific context. ### Test Case 7 - **Question:** "Do we need to upgrade our plan to enable contractor SSO?" - **Good answer:** Explains current plan's contractor SSO status and upgrade requirements if applicable. - **Bad answer:** Generic upgrade sales pitch without addressing current state. ### Test Case 8 - **Question:** "Is multi-factor authentication required for contractors if SSO is not available?" - **Good answer:** Addresses fallback authentication requirements for contractors, with policy citations. - **Bad answer:** Discusses MFA in general terms without contractor-specific guidance. ### Test Case 9 - **Question:** "Can guest users access SSO, or is it restricted to employees?" - **Good answer:** Clarifies SSO eligibility for guests, contractors, and employees separately. - **Bad answer:** Conflates user types or ignores guests entirely. ### Test Case 10 - **Question:** "When was the contractor authentication policy last updated?" - **Good answer:** Provides timestamp or version from metadata, or states information is unavailable. - **Bad answer:** Ignores recency, potentially citing outdated policy. --- ## ASSUMPTIONS_AND_GAPS ### Missing System Details 1. **Embedding model:** Unknown. The semantic similarity performance depends heavily on model quality. Older or domain-generic models may not capture nuanced distinctions like "contractor" vs. "employee." - **Impact on confidence:** Cannot assess if poor retrieval is due to weak embeddings vs. other factors. 2. **Overlap in chunking:** Unknown. If chunks have no overlap, context fragmentation is more severe. - **Impact on confidence:** Limits ability to pinpoint whether chunking strategy is the primary issue. 3. **Actual retrieved context content:** Not provided. Cannot verify if the right information exists in the corpus or was simply not retrieved. - **Impact on confidence:** High. All analysis assumes the information exists somewhere in the corpus. If it doesn't, this is a data coverage problem, not a retrieval problem. 4. **System prompt details:** Unknown. Cannot assess current grounding instructions or citation requirements. - **Impact on confidence:** Moderate. Generic prompts are common, but without seeing the actual prompt, citation enforcement recommendations are speculative. 5. **Reranking or filtering:** Confirmed absent, but hybrid search or metadata filtering capabilities are unknown. - **Impact on confidence:** Moderate. Affects feasibility of metadata-based improvements. 6. **Update cadence of knowledge base:** Unknown. Corpus may be stale, but frequency of updates is unclear. - **Impact on confidence:** Low for immediate issue, high for long-term reliability. 7. **Query preprocessing:** Unknown if queries are expanded, reformulated, or decomposed. - **Impact on confidence:** Moderate. Query engineering could address multi-concept retrieval challenges. 8. **Evaluation metrics:** Unknown if the system tracks precision@k, recall, or hallucination rates. - **Impact on confidence:** Moderate. Without baselines, improvement validation is harder to benchmark. ### Inherent Uncertainties - **User intent:** "Our plan" is ambiguous (which plan? current subscription? proposed plan?). Disambiguation logic is unknown. - **Corpus quality:** Even with perfect retrieval, if the documentation never addresses contractor SSO, the system cannot answer correctly. - **Answer history:** Unknown if this is a one-time failure or a recurring pattern. Pattern analysis would strengthen diagnosis. --- **End of Report**
🌀 Claude

Atomic Rag Failure Analyzer

Add to Cart
Instant accessInstant access
Usage rightsCommercial use
Money-back guaranteeMoney‑back
By purchasing this prompt, you agree to our terms of service
CLAUDE-4-5-SONNET
Tested icon
Guide icon
4 examples icon
Free credits icon
Analyze RAG failure cases using a structured diagnostic framework. This prompt classifies common retrieval failure types, maps symptoms to likely RAG pipeline layers, and generates practical improvement suggestions and validation tests based only on the provided inputs. **Best for:** production RAG, agents, internal tools, enterprise QA **Not for:** toy demos, one-shot prompt experiments
...more
Added over 1 month ago
Report
Browse Marketplace