Prompt Details
Model
(claude-4-6-sonnet)
Token size
2,706
Example input
[AGENT_DATA]: Query: "What's the return policy for international orders?" → Retrieved chunks: Chunk 1: domestic return policy, 30 days Chunk 2: shipping rates table→ Response: "International orders can be returned within 30 days for a full refund including shipping costs." (shipping cost refund not in any chunk)
[AGENT_NAME]: E-commerce Support RAG Agent
[INSPECTION_FOCUS]: RAG Fidelity
[TEAM_CONTEXT]: Claude 3.5 Sonnet, custom retrieval pipeline, 2-person team, customer complaints about wrong refund info started 3 days ago
Example output
```
═══════════════════════════════════════════
AI AGENT INSPECTOR™ — DIAGNOSTIC REPORT
═══════════════════════════════════════════
INSPECTION SCOPE
- Modules activated: Module 2 (RAG Fidelity Auditor) + Module 5 (Eval Test Cases)
- Data analyzed: RAG interaction — query, 2 retrieved chunks, generated response
- Agent: E-commerce Support RAG Agent
- Date: March 13, 2026
═══════════════════════════════════════════
MODULE 2: RAG FIDELITY AUDIT
═══════════════════════════════════════════
5-DIMENSION SCORES
──────────────────
D1 — RETRIEVAL RELEVANCE: 3 / 10
D2 — CONTEXT SUFFICIENCY: 1 / 10
D3 — ANSWER FAITHFULNESS: 2 / 10
D4 — ANSWER COMPLETENESS: 3 / 10
D5 — ANSWER COHERENCE: 7 / 10
SCORE RATIONALE
────────────────
D1 (3): Neither chunk is relevant to the query. Chunk 1 covers domestic
returns only. Chunk 2 is a shipping rates table — noise, not signal.
Zero international return policy content was retrieved.
D2 (1): The context contains no information sufficient to answer the
query. The pipeline sent the model into generation with an empty
knowledge base for this topic. Minimum possible useful score.
D3 (2): Two claims in the response — neither is grounded in context.
D4 (3): The response appears complete in form, but is substantively
empty — and actively wrong. Structural coherence does not equal
factual completeness.
D5 (7): The sentence is well-formed and reads confidently. This is
the danger: a coherent hallucination is more harmful than an
obviously broken one.
──────────────────────────────────────────
FAITHFULNESS TRACE
──────────────────────────────────────────
CLAIM 1: "International orders can be returned within 30 days"
Source check → Chunk 1 mentions 30-day window, but DOMESTIC only
Verdict: ✗ MISREPRESENTATION — domestic policy applied to
international scope without basis
CLAIM 2: "for a full refund including shipping costs"
Source check → No chunk contains shipping cost refund language
Verdict: ✗ HALLUCINATION — fabricated detail with direct financial
consequence; customers may expect shipping reimbursement
your policy does not provide
──────────────────────────────────────────
PIPELINE STAGE DIAGNOSIS
──────────────────────────────────────────
ROOT CAUSE STAGE: RETRIEVAL
The generation model performed as expected — it synthesized from
available context. The problem is it had no valid context to work
from, so it confabulated plausibly.
Stage 1 — RETRIEVAL: ❌ FAILED
The query "international orders return policy" returned zero
international-specific chunks. Either:
(a) The knowledge base has no international return policy document
(b) The embedding similarity search is failing to match
"international returns" to the correct document
(c) The document exists but is not indexed
Stage 2 — CONTEXT ASSEMBLY: ⚠️ NO GUARDRAIL
No relevance threshold or fallback is in place. The pipeline
passed low-relevance domestic/shipping chunks to the model
rather than triggering a "no relevant context found" path.
This is the enabling failure.
Stage 3 — GENERATION: ⚠️ DOWNSTREAM FAILURE
Given insufficient context, Claude 3.5 Sonnet filled the gap with
a plausible-sounding answer. No grounding instruction told it to
refuse or flag missing information.
──────────────────────────────────────────
FINDINGS SUMMARY
──────────────────────────────────────────
FINDING RAG-01
Code: H1 (Factual Hallucination) + H3 (Reasoning Hallucination)
Severity: CRITICAL
Claim: "including shipping costs" — no source, invented detail
Impact: Customers filing refund claims for shipping costs your
policy may not cover → financial disputes, support escalations
Root cause: No retrieval fallback; generation without grounding
instruction
FINDING RAG-02
Code: M3 (Inaccurate Response) + S4 (Stale/Wrong Data Served)
Severity: HIGH
Claim: 30-day window applied to international orders
Impact: May be correct by accident, or may be wrong — you cannot
know from this trace. Either way, the agent is providing
policy information it has no authority to provide.
Root cause: Domestic policy chunk used as proxy for
international policy — retrieval mismatch
═══════════════════════════════════════════
MODULE 5: EVAL TEST CASES
═══════════════════════════════════════════
TEST-01
Name: international_return_window_grounded_response
Category: output_quality
Priority: P0
Input: "What is the return policy for international orders?"
Context: Knowledge base contains only domestic return policy docs
Expected behavior: Agent responds "I don't have information on
international return policies — please contact support" or
equivalent refusal/escalation
Failure indicator: Any specific timeframe or refund detail stated
without a sourced international policy chunk
Assertion type: not_contains("days") unless chunk explicitly
covers international scope
TEST-02
Name: shipping_cost_refund_claim_not_hallucinated
Category: output_quality
Priority: P0
Input: "Will I get my shipping costs refunded on an international return?"
Context: No chunk addresses shipping cost refunds
Expected behavior: Agent explicitly states it cannot confirm
shipping refund policy and directs to human support
Failure indicator: Any affirmative claim about shipping refunds
Assertion type: not_contains("shipping") + not_contains("refund")
in same sentence unless sourced
TEST-03
Name: low_relevance_chunks_trigger_fallback
Category: operational
Priority: P0
Input: Any policy query where top retrieved chunk relevance
score is below threshold (suggest: cosine sim < 0.75)
Context: Retrieval pipeline configured with score logging
Expected behavior: Pipeline returns "no sufficient context"
response template rather than passing weak chunks to model
Failure indicator: Model generates specific policy claims from
chunks with low relevance scores
Assertion type: score_above(relevance_threshold) before
generation is permitted
TEST-04
Name: domestic_policy_not_applied_to_international_query
Category: output_quality
Priority: P1
Input: "Do international return rules differ from domestic ones?"
Context: Only domestic return policy chunk available
Expected behavior: Agent distinguishes the two explicitly;
does not assert domestic rules apply internationally
Failure indicator: Response treats domestic policy as universal
Assertion type: not_contains("same") or not_contains("also apply")
without international-specific source
TEST-05
Name: positive_grounded_domestic_return_query
Category: output_quality
Priority: P1
Input: "What is the return policy for US orders?"
Context: Chunk 1 (domestic, 30-day policy) retrieved
Expected behavior: Agent states 30-day window, attributes to
policy; does not add unverified details
Failure indicator: Any claim not traceable to Chunk 1
Assertion type: contains("30 days") + matches_pattern(no
additional financial claims beyond chunk content)
═══════════════════════════════════════════
EXECUTIVE SUMMARY
═══════════════════════════════════════════
Total findings: 2
Critical: 1 | High: 1 | Medium: 0 | Low: 0
Estimated cost impact: Non-monetary but operationally significant —
customer disputes over fabricated shipping refunds are actively
occurring (complaints started 3 days ago; this trace explains why)
Top 3 priority fixes:
1. Add retrieval relevance threshold — if no chunk scores above
similarity cutoff, route to fallback response template instead
of generation. Addresses RAG-01 and RAG-02 at the pipeline level.
2. Add grounding instruction to system prompt — explicit directive:
"If the retrieved context does not contain the answer, say so.
Do not infer, extrapolate, or fill gaps from general knowledge."
One-line fix, immediate impact.
3. Audit knowledge base for international return policy document —
confirm it exists and is indexed. If missing, add it before
re-enabling the query path.
Eval test cases generated: 5
Overall agent health: 🔴
NOTE FOR YOUR TEAM: The 3-day complaint window aligns precisely
with this failure mode. The fix sequence above is achievable by
one engineer in a single session — start with the system prompt
grounding instruction (fix 2) as it can ship in minutes, then
implement the relevance threshold (fix 1) as the structural solution.
═══════════════════════════════════════════
```
By purchasing this prompt, you agree to our terms of service
CLAUDE-4-6-SONNET
Your agents ship bugs you cannot see. The Agent Inspector is a complete system prompt that turns Claude into a dedicated QA agent — feed it any agent's outputs and it runs a full observability analysis: anomaly classification, faithfulness checks, security scanning, cost analysis, and eval test case generation. All in one pass.
✅ Runs 5-layer analysis on any agent output in a single interaction
✅ Produces structured reports matching Langfuse, RAGAS, and Braintrust formats
✅ Generates regression
...more
Added over 1 month ago
