Prompt Details
Model
(claude-4-7-opus)
Token size
4,118
Example input
[RAG_SYSTEM_DESCRIPTION]: Dense-only retrieval using text-embedding-3-large, Pinecone index, fixed 512-token chunks with 10% overlap. No reranker. Query passed directly to retrieval without transformation.
[FAILURE_SYMPTOMS]: Misses queries that use informal language to describe formally-worded policy content. Works well on exact policy titles.
[CORPUS_CHARACTERISTICS]: 85,000 government regulatory documents, primarily legal/policy text, English only. Medium corpus.
[DEPLOYMENT_CONTEXT]: State government court platform / HIGH
Example output
AUDIT HALTED — {RAG_SYSTEM_DESCRIPTION} is missing. Please provide:
---
Wait — re-reading the inputs: the user has supplied all four fields as bullet points outside the XML tags. The pre-flight check should evaluate the *content* provided, not the tag structure. All four fields are populated with substantive content. Proceeding.
---
# RAG RETRIEVAL STRATEGY AUDIT™ v1.2
**System under review:** State government court platform RAG pipeline
**Audit date:** 2026-05-17
---
## SECTION 1 — SYSTEM PROFILE SUMMARY
```
RETRIEVAL METHOD: Dense only
EMBEDDING MODEL: text-embedding-3-large (named)
INDEX TYPE: Pinecone (named)
RERANKER: Absent
CHUNKING STRATEGY: Fixed 512-token, 10% overlap (described)
QUERY HANDLING: Passthrough (no transformation)
CORPUS SIZE: Medium (85,000 documents)
CORPUS TYPE: Legal / policy (government regulatory)
DEPLOYMENT STAKES: HIGH (state government court platform)
```
**Unspecified fields flagged for Section 7:**
- Chunk count vs. document count is unstated (85K documents may yield significantly more chunks)
- No metadata filtering or hybrid flag confirmed
- No latency SLA or budget constraint stated
**Contradiction check:**
CONTRADICTION FLAG: The system is described as working well on exact policy titles, yet uses dense-only retrieval — dense embeddings typically *struggle* with exact-term matching (FM-02). Strong exact-title performance may indicate title text is lexically common enough to survive embedding, or the test set for "works well" is narrow. Meanwhile, the reported failure on informal-to-formal vocabulary mapping is consistent with dense-only behavior, but the co-reported exact-title success creates an asymmetric picture.
RESOLUTION: Conservative interpretation applied. Failure symptoms (informal query misses) are treated as genuine and representative. The exact-title success is treated as a narrow favorable condition, not evidence of general retrieval health. Proceeding with symptom evidence prioritized.
---
## SECTION 2 — FAILURE MODE DIAGNOSIS
**FM-01 SEMANTIC MISS**
*Dense retrieval fails when user query and document use different vocabulary for the same concept.*
Diagnosis: **PRESENT**
Evidence: Reported failure is precisely vocabulary mismatch — informal language queries fail to retrieve formally-worded policy content. This is the canonical FM-01 pattern in dense-only systems without query rewriting.
---
**FM-02 EXACT-TERM MISS**
*Vector search fails on product names, error codes, regulatory clause references, numerical values, identifiers, etc.*
Diagnosis: **PRESENT**
Evidence: Corpus is 85,000 government regulatory documents — legal/policy text is dense with clause references (e.g., "§ 47.3(b)"), statute numbers, case identifiers, and date-specific provisions. Dense-only retrieval with no sparse layer provides no structural guarantee on exact-term retrieval for these identifiers. The reported success on exact policy *titles* does not cover clause-level or identifier-level exact matching.
---
**FM-03 PRECISION FAILURE**
*Correct content is retrieved somewhere in top-k but ranked below lower-relevance chunks.*
Diagnosis: **INSUFFICIENT EVIDENCE**
Evidence: No reranker is present, which structurally increases FM-03 risk in medium-to-large corpora. However, no symptom describing top-k ordering problems (e.g., "right answer appears but LLM misses it") was reported. Cannot confirm or rule out without top-k inspection data.
---
**FM-04 QUERY-INDEX MISMATCH**
*User queries are vague, compound, or underspecified relative to how content is indexed.*
Diagnosis: **PRESENT**
Evidence: Passthrough query handling means informal, colloquial, or layperson queries hit the index without rewriting. Government court platform users (litigants, clerks, self-represented parties) are likely to query in everyday language against an index shaped entirely by formal legal register. Query and index exist in different linguistic registers by design.
---
**FM-05 CONTEXT ASSEMBLY FAILURE**
*Correct chunks retrieved but chunking boundaries or assembly causes the LLM to receive fragmented context.*
Diagnosis: **PRESENT**
Evidence: Fixed 512-token chunking with 10% overlap is a known boundary-risk configuration for legal/policy text, where a single regulatory provision, definition, or clause may span more than 512 tokens or straddle chunk boundaries; 10% overlap (≈51 tokens) is frequently insufficient to preserve cross-boundary semantic continuity in dense statutory language.
---
## SECTION 3 — RETRIEVAL STRATEGY COVERAGE MATRIX
```
FAILURE MODE | CURRENT STRATEGY ADDRESSES? | COVERAGE RATING
─────────────────────────────────────────────────────────────────
FM-01 Semantic | PARTIAL | GAP
FM-02 Exact | NO | BLIND SPOT
FM-03 Precision | NO | BLIND SPOT
FM-04 Query | NO | BLIND SPOT
FM-05 Assembly | NO | BLIND SPOT
─────────────────────────────────────────────────────────────────
BLIND SPOTS: 4 failure modes with no current coverage
GAPS: 1 failure mode with partial coverage
```
**Coverage notes:**
- FM-01 rated PARTIAL rather than NO because text-embedding-3-large has meaningful semantic generalization capability; it addresses some vocabulary variation but demonstrably not the informal-to-formal gap reported here.
- FM-03 rated BLIND SPOT: absence of a reranker means no mechanism exists to correct ranking errors after retrieval.
**Primary diagnostic finding:** This system has a single retrieval mechanism (dense passthrough) that partially addresses one of four confirmed failure modes, with no coverage for exact-term retrieval, query register mismatch, ranking correction, or chunk boundary fragmentation — a significant coverage deficit for a HIGH-stakes legal platform.
---
## SECTION 4 — STRATEGY RECOMMENDATIONS
### PATH A — MINIMAL INTERVENTION
**Target FMs:** FM-01, FM-04
**Changes:**
1. **Add HyDE (Hypothetical Document Embeddings) query transformation** — Before retrieval, prompt a small LLM (GPT-4o-mini or equivalent) to generate a hypothetical policy document excerpt matching the user query. Embed the hypothesis, not the raw query. This bridges the informal-to-formal register gap by generating formal-register text from informal input before the vector lookup. Implementable as a pre-retrieval step without touching Pinecone or the embedding model.
2. **Add query rewriting via a lightweight prompt template** — Classify incoming queries as informal/formal and apply a rewrite prompt that casts informal queries into legal register before embedding. Simpler than HyDE; lower latency; implement both and A/B test.
**FM Coverage:** 2 of 5 failure modes addressed after this path
**Complexity:** LOW
**Tradeoff:** FM-02 (exact-term miss on clause references and identifiers), FM-03 (precision ranking), and FM-05 (chunk boundary fragmentation) remain entirely unaddressed. Query rewriting adds a latency step and introduces a new failure point if the rewriter generates poor hypotheses.
---
### PATH B — BALANCED UPGRADE *(default recommendation)*
**Target FMs:** FM-01, FM-02, FM-03, FM-04
**Changes:**
1. **Introduce BM25 sparse retrieval alongside Pinecone dense retrieval (hybrid)** — Deploy BM25 via Elasticsearch or OpenSearch over the same corpus. Use Reciprocal Rank Fusion (RRF) or a weighted linear combination to merge dense and sparse result sets. This directly addresses FM-02 by providing lexical exact-match coverage for statute numbers, clause references, and identifiers that dense embeddings miss.
2. **Add query rewriting / HyDE for informal query handling** — As described in Path A. Addresses FM-01 and FM-04 by normalizing query register before the retrieval step.
3. **Add a cross-encoder reranker** (e.g., Cohere Rerank, BGE-reranker-large, or a fine-tuned cross-encoder on legal text) **over the merged top-k result set** — Reranking operates on the combined hybrid results to address FM-03, surfacing the highest-relevance chunks before context assembly.
**FM Coverage:** 4 of 5 failure modes addressed after this path
**Complexity:** MEDIUM
**Tradeoff:** FM-05 (chunk assembly fragmentation) remains unaddressed. Adding BM25 infrastructure requires an additional service (Elasticsearch/OpenSearch) alongside Pinecone, increasing operational surface area. Reranker adds latency (typically 100–400ms per query depending on top-k size and model); latency budget must be confirmed against any SLA.
---
### PATH C — FULL PRODUCTION HARDENING
**Target FMs:** FM-01, FM-02, FM-03, FM-04, FM-05
**Changes:**
1. **Hybrid retrieval (BM25 + dense) with RRF** — as described in Path B.
2. **Query rewriting / HyDE** — as described in Path B.
3. **Cross-encoder reranker over hybrid top-k** — as described in Path B.
4. **Re-chunk corpus using structure-aware segmentation** — Replace fixed 512-token chunking with a legal-document-aware chunker that respects section, subsection, and paragraph boundaries (e.g., using document structure markers, heading detection, or a library such as `unstructured` with legal document processing). Increase overlap to 15–20% *or* implement parent-child chunking (small chunks for retrieval, parent chunks for context delivery) to preserve provision-level coherence across chunk boundaries. This directly addresses FM-05.
5. **Add metadata enrichment at index time** — Extract and store statute number, document section, regulatory domain, and effective date as Pinecone metadata fields. Enable metadata pre-filtering to scope retrieval before vector search, reducing noise in high-density regulatory corpora.
**FM Coverage:** 5 of 5 failure modes addressed after this path
**Complexity:** HIGH
**Tradeoff:** Re-chunking requires a full re-index of 85,000 documents and revalidation of the retrieval pipeline end-to-end. Structure-aware chunking requires a preprocessing pipeline that handles the document format variation common in government document repositories (PDFs, HTML, DOCX). Operational cost increases with BM25 infrastructure plus reranker inference. Total implementation timeline estimated at 6–12 engineering weeks depending on team capacity.
---
**Recommended path for this system: PATH B**
Reason: Four of five confirmed failure modes fall within Path B's coverage, the remaining gap (FM-05) is lower-severity than the reporting symptoms warrant, and Path B avoids the full re-index risk of Path C while delivering hybrid retrieval and ranking correction appropriate for a HIGH-stakes court platform.
**Constraint-conflict check:**
No explicit latency SLA or budget constraint was stated in the deployment context. However:
CONSTRAINT NOTE (not a blocker): Reranker latency and dual-index infrastructure (Pinecone + Elasticsearch) add operational cost and query latency. If a latency SLA is subsequently confirmed at under 500ms end-to-end, the reranker model selection in Path B must be scoped accordingly (lighter cross-encoders or Cohere Rerank API with caching). This should be confirmed before Path B implementation begins.
---
## SECTION 5 — STAKES-CALIBRATED DEPLOYMENT GATE
**Stakes level: HIGH — State government court platform**
### PRE-DEPLOYMENT REQUIREMENTS
**GROUNDING VALIDATION** *(baselines — calibrate to legal domain before finalizing)*
```
□ Faithfulness ≥ 0.85 on domain test set (RAGAS)
□ Answer relevancy ≥ 0.80
□ Context precision ≥ 0.75
□ Context recall ≥ 0.70
□ Factual correctness ≥ 0.82
```
Domain calibration note: These are starting baselines. A state court platform delivering regulatory guidance to litigants or clerks may require faithfulness ≥ 0.90 given the legal consequence of incorrect citation or missed provision retrieval. Thresholds should be finalized in consultation with the accountable system owner and any applicable court IT governance standards before the evaluation suite is locked.
**DETERMINISTIC TESTING**
```
□ Sandboxed test suite covering FM-01 through FM-05 triggers
□ FM-01 test set: paired informal/formal query variants for
the same policy content (minimum 30 pairs)
□ FM-02 test set: queries containing statute numbers, clause
references (e.g., "§ 47.3(b)"), case IDs, effective dates
□ FM-04 test set: vague, layperson-phrased queries expected
from self-represented litigants
□ Adversarial queries: out-of-distribution inputs, queries
for policies not in corpus (should return appropriate
no-answer response, not hallucination)
□ Regression suite: minimum 50 labeled query-answer pairs
with ground-truth retrieval targets confirmed by a subject
matter expert (legal domain reviewer required)
```
**DEPLOYMENT BLOCKERS (current system, pre-Path B)**
The following confirmed failure modes are unaddressed by the current system and constitute deployment blockers for a HIGH-stakes context:
→ **FM-02 (Exact-Term Miss):** No sparse retrieval layer. Clause references and statute identifiers are not reliably retrievable.
→ **FM-04 (Query-Index Mismatch):** No query transformation. Informal queries from court users hit the index without normalization.
→ **FM-05 (Context Assembly Failure):** Fixed chunking does not respect legal document structure.
After Path B implementation, FM-05 remains a residual risk (not a blocker, but a known gap). Document in the system risk register and confirm with the accountable system owner before deployment. FM-05 escalates to a blocker if evaluation reveals provision fragmentation causing factual errors in generated answers.
**Do NOT deploy the current system as-described into the court platform production environment without addressing FM-02 and FM-04 at minimum.**
---
## SECTION 6 — CONFIDENCE REGISTER
```
CONFIDENCE RATINGS
─────────────────────────────────────────────────────────────────
FM diagnosis accuracy: MEDIUM
Basis: FM-01 and FM-04 are well-supported by the reported
symptom (informal query misses on formal content) and the
confirmed passthrough query handling. FM-02 is inferred from
corpus type (regulatory/legal) and absence of sparse retrieval,
not a reported symptom. FM-03 is structural inference from
absent reranker — no symptom confirms it. FM-05 is inferred
from chunking configuration, not confirmed by a reported symptom.
Path A recommendation: HIGH
Path B recommendation: HIGH
Path C recommendation: MEDIUM
Basis for Path C MEDIUM: Re-chunking quality depends heavily
on the actual document format distribution across 85,000
documents, which was not described. Structure-aware chunking
may yield inconsistent results if document formatting is
irregular across the corpus.
```
**REFUSED INFERENCES**
RI-01: Cannot determine whether FM-03 (precision failure) is actively degrading answer quality — no symptom describes top-k results containing the correct answer in a low-ranked position. Top-k result inspection would be required to confirm or rule out.
RI-02: Cannot evaluate actual chunk boundary quality without sample chunk output or knowledge of the document format distribution across the 85,000-document corpus. The FM-05 diagnosis is based on the known risk profile of fixed 512-token chunking applied to legal text, not observed fragmentation.
RI-03: Cannot assess whether the exact-title retrieval success reported generalizes to section-level or clause-level retrieval. The favorable result may reflect a narrow favorable condition (title text is lexically common) rather than general dense retrieval health in this corpus.
RI-04: Cannot confirm whether the Pinecone index is configured with metadata filtering that could partially address FM-02 for structured identifiers — no metadata strategy was described. If metadata filtering is already in use, FM-02 severity may be lower than assessed.
---
## SECTION 7 — GAP DISCLOSURE
**GAP-01: Chunk count vs. document count unspecified**
Audit impact: FM-05 assessment cannot confirm whether 85,000 documents produce a chunk count that creates meaningful top-k saturation or retrieval noise at medium corpus scale.
To resolve: Provide total chunk count after 512-token segmentation.
**GAP-02: Metadata filtering strategy unspecified**
Audit impact: FM-02 diagnosis may be overestimated if Pinecone metadata filtering is already applied to scope retrieval by document type, statute, or date. Conversely, absence of metadata filtering is a missed coverage opportunity for exact-term scoping.
To resolve: Confirm whether Pinecone metadata fields are populated and whether pre-filtering is applied at query time.
**GAP-03: Latency SLA unspecified**
Audit impact: Path B recommendation (hybrid + reranker) adds query latency. If a latency SLA exists, the reranker model selection and top-k size must be scoped accordingly. No constraint-conflict could be confirmed or ruled out.
To resolve: State end-to-end latency budget (p95 target in milliseconds).
**GAP-04: Document format distribution unspecified**
Audit impact: FM-05 and Path C (structure-aware chunking) recommendation confidence is reduced. Legal document structure chunking quality depends heavily on whether documents are consistently formatted PDFs, HTML, DOCX, or mixed formats.
To resolve: Describe the primary document formats in the 85,000-document corpus and any known formatting inconsistency.
**GAP-05: Top-k value unspecified**
Audit impact: FM-03 (precision failure) risk scales with top-k size. A larger top-k increases the probability that correct content is present but ranked low. Cannot assess precision failure severity without knowing how many results are passed to the LLM.
To resolve: State the top-k value used in retrieval and how many chunks are assembled into the final prompt context.
---
OPEN-WORLD NOTICE: This audit is bounded by the information provided. Absence of evidence for a failure mode is not evidence of absence. Fields marked INSUFFICIENT EVIDENCE require additional system description before those failure modes can be ruled out. In particular, FM-03 (precision failure) and the full severity of FM-05 (context assembly) cannot be assessed without top-k inspection data and sample chunk output respectively.
By purchasing this prompt, you agree to our terms of service
CLAUDE-4-7-OPUS
This prompt audits a described RAG system against a structured failure-mode taxonomy, diagnoses which retrieval gaps are present, and produces strategy recommendations calibrated to the actual failure evidence — not generic best practices.
...more
Added 1 week ago
