PromptBase
Upgrade
Close icon
General
Home
Marketplace
Create
Hire
Login
Chat
Sell
Explore

Prompt Details

Model
(claude-4-6-sonnet)
Token size
2,706
Example input
[AGENT_DATA]: Query: "What's the return policy for international orders?" → Retrieved chunks: Chunk 1: domestic return policy, 30 days Chunk 2: shipping rates table→ Response: "International orders can be returned within 30 days for a full refund including shipping costs." (shipping cost refund not in any chunk) [AGENT_NAME]: E-commerce Support RAG Agent [INSPECTION_FOCUS]: RAG Fidelity [TEAM_CONTEXT]: Claude 3.5 Sonnet, custom retrieval pipeline, 2-person team, customer complaints about wrong refund info started 3 days ago
Example output
``` ═══════════════════════════════════════════ AI AGENT INSPECTOR™ — DIAGNOSTIC REPORT ═══════════════════════════════════════════ INSPECTION SCOPE - Modules activated: Module 2 (RAG Fidelity Auditor) + Module 5 (Eval Test Cases) - Data analyzed: RAG interaction — query, 2 retrieved chunks, generated response - Agent: E-commerce Support RAG Agent - Date: March 13, 2026 ═══════════════════════════════════════════ MODULE 2: RAG FIDELITY AUDIT ═══════════════════════════════════════════ 5-DIMENSION SCORES ────────────────── D1 — RETRIEVAL RELEVANCE: 3 / 10 D2 — CONTEXT SUFFICIENCY: 1 / 10 D3 — ANSWER FAITHFULNESS: 2 / 10 D4 — ANSWER COMPLETENESS: 3 / 10 D5 — ANSWER COHERENCE: 7 / 10 SCORE RATIONALE ──────────────── D1 (3): Neither chunk is relevant to the query. Chunk 1 covers domestic returns only. Chunk 2 is a shipping rates table — noise, not signal. Zero international return policy content was retrieved. D2 (1): The context contains no information sufficient to answer the query. The pipeline sent the model into generation with an empty knowledge base for this topic. Minimum possible useful score. D3 (2): Two claims in the response — neither is grounded in context. D4 (3): The response appears complete in form, but is substantively empty — and actively wrong. Structural coherence does not equal factual completeness. D5 (7): The sentence is well-formed and reads confidently. This is the danger: a coherent hallucination is more harmful than an obviously broken one. ────────────────────────────────────────── FAITHFULNESS TRACE ────────────────────────────────────────── CLAIM 1: "International orders can be returned within 30 days" Source check → Chunk 1 mentions 30-day window, but DOMESTIC only Verdict: ✗ MISREPRESENTATION — domestic policy applied to international scope without basis CLAIM 2: "for a full refund including shipping costs" Source check → No chunk contains shipping cost refund language Verdict: ✗ HALLUCINATION — fabricated detail with direct financial consequence; customers may expect shipping reimbursement your policy does not provide ────────────────────────────────────────── PIPELINE STAGE DIAGNOSIS ────────────────────────────────────────── ROOT CAUSE STAGE: RETRIEVAL The generation model performed as expected — it synthesized from available context. The problem is it had no valid context to work from, so it confabulated plausibly. Stage 1 — RETRIEVAL: ❌ FAILED The query "international orders return policy" returned zero international-specific chunks. Either: (a) The knowledge base has no international return policy document (b) The embedding similarity search is failing to match "international returns" to the correct document (c) The document exists but is not indexed Stage 2 — CONTEXT ASSEMBLY: ⚠️ NO GUARDRAIL No relevance threshold or fallback is in place. The pipeline passed low-relevance domestic/shipping chunks to the model rather than triggering a "no relevant context found" path. This is the enabling failure. Stage 3 — GENERATION: ⚠️ DOWNSTREAM FAILURE Given insufficient context, Claude 3.5 Sonnet filled the gap with a plausible-sounding answer. No grounding instruction told it to refuse or flag missing information. ────────────────────────────────────────── FINDINGS SUMMARY ────────────────────────────────────────── FINDING RAG-01 Code: H1 (Factual Hallucination) + H3 (Reasoning Hallucination) Severity: CRITICAL Claim: "including shipping costs" — no source, invented detail Impact: Customers filing refund claims for shipping costs your policy may not cover → financial disputes, support escalations Root cause: No retrieval fallback; generation without grounding instruction FINDING RAG-02 Code: M3 (Inaccurate Response) + S4 (Stale/Wrong Data Served) Severity: HIGH Claim: 30-day window applied to international orders Impact: May be correct by accident, or may be wrong — you cannot know from this trace. Either way, the agent is providing policy information it has no authority to provide. Root cause: Domestic policy chunk used as proxy for international policy — retrieval mismatch ═══════════════════════════════════════════ MODULE 5: EVAL TEST CASES ═══════════════════════════════════════════ TEST-01 Name: international_return_window_grounded_response Category: output_quality Priority: P0 Input: "What is the return policy for international orders?" Context: Knowledge base contains only domestic return policy docs Expected behavior: Agent responds "I don't have information on international return policies — please contact support" or equivalent refusal/escalation Failure indicator: Any specific timeframe or refund detail stated without a sourced international policy chunk Assertion type: not_contains("days") unless chunk explicitly covers international scope TEST-02 Name: shipping_cost_refund_claim_not_hallucinated Category: output_quality Priority: P0 Input: "Will I get my shipping costs refunded on an international return?" Context: No chunk addresses shipping cost refunds Expected behavior: Agent explicitly states it cannot confirm shipping refund policy and directs to human support Failure indicator: Any affirmative claim about shipping refunds Assertion type: not_contains("shipping") + not_contains("refund") in same sentence unless sourced TEST-03 Name: low_relevance_chunks_trigger_fallback Category: operational Priority: P0 Input: Any policy query where top retrieved chunk relevance score is below threshold (suggest: cosine sim < 0.75) Context: Retrieval pipeline configured with score logging Expected behavior: Pipeline returns "no sufficient context" response template rather than passing weak chunks to model Failure indicator: Model generates specific policy claims from chunks with low relevance scores Assertion type: score_above(relevance_threshold) before generation is permitted TEST-04 Name: domestic_policy_not_applied_to_international_query Category: output_quality Priority: P1 Input: "Do international return rules differ from domestic ones?" Context: Only domestic return policy chunk available Expected behavior: Agent distinguishes the two explicitly; does not assert domestic rules apply internationally Failure indicator: Response treats domestic policy as universal Assertion type: not_contains("same") or not_contains("also apply") without international-specific source TEST-05 Name: positive_grounded_domestic_return_query Category: output_quality Priority: P1 Input: "What is the return policy for US orders?" Context: Chunk 1 (domestic, 30-day policy) retrieved Expected behavior: Agent states 30-day window, attributes to policy; does not add unverified details Failure indicator: Any claim not traceable to Chunk 1 Assertion type: contains("30 days") + matches_pattern(no additional financial claims beyond chunk content) ═══════════════════════════════════════════ EXECUTIVE SUMMARY ═══════════════════════════════════════════ Total findings: 2 Critical: 1 | High: 1 | Medium: 0 | Low: 0 Estimated cost impact: Non-monetary but operationally significant — customer disputes over fabricated shipping refunds are actively occurring (complaints started 3 days ago; this trace explains why) Top 3 priority fixes: 1. Add retrieval relevance threshold — if no chunk scores above similarity cutoff, route to fallback response template instead of generation. Addresses RAG-01 and RAG-02 at the pipeline level. 2. Add grounding instruction to system prompt — explicit directive: "If the retrieved context does not contain the answer, say so. Do not infer, extrapolate, or fill gaps from general knowledge." One-line fix, immediate impact. 3. Audit knowledge base for international return policy document — confirm it exists and is indexed. If missing, add it before re-enabling the query path. Eval test cases generated: 5 Overall agent health: 🔴 NOTE FOR YOUR TEAM: The 3-day complaint window aligns precisely with this failure mode. The fix sequence above is achievable by one engineer in a single session — start with the system prompt grounding instruction (fix 2) as it can ship in minutes, then implement the relevance threshold (fix 1) as the structural solution. ═══════════════════════════════════════════ ```
🌀 Claude

Agent Inspector

Add to Cart
Instant accessInstant access
Usage rightsCommercial use
Money-back guaranteeMoney‑back
By purchasing this prompt, you agree to our terms of service
CLAUDE-4-6-SONNET
Tested icon
Guide icon
4 examples icon
Free credits icon
Your agents ship bugs you cannot see. The Agent Inspector is a complete system prompt that turns Claude into a dedicated QA agent — feed it any agent's outputs and it runs a full observability analysis: anomaly classification, faithfulness checks, security scanning, cost analysis, and eval test case generation. All in one pass. ✅ Runs 5-layer analysis on any agent output in a single interaction ✅ Produces structured reports matching Langfuse, RAGAS, and Braintrust formats ✅ Generates regression
...more
Added over 1 month ago
Report
Browse Marketplace