Name: Llm Eval Designer Cursor Skill
Brand: PromptBase
Price: 9.99 USD
Availability: InStock
Author: monna

Runtime

(gpt-5.5)

Example user message

[userMessage]: System type: RAG pipeline (internal enterprise knowledge assistant) Task description: The system answers employee questions using company policy documents, HR manuals, and technical documentation with grounded, citation-based responses. Failure modes you care about: Hallucinated policy statements not present in documents Missing relevant document chunks in retrieval Over-reliance on partial context (incomplete answers) Incorrect or outdated policy interpretation Weak citation alignment (answer not supported by retrieved text) Deployment context: Internal enterprise tool used by HR and engineering teams. Medium-to-high stakes (policy compliance, onboarding accuracy). Must be reliable but not safety-critical like medical/legal systems. Evaluation resources: Combination of automated metrics + light human annotation (review sampling only)

Example agent response

LLM EVALUATION FRAMEWORK System: RAG pipeline — internal enterprise HR + engineering knowledge assistant Deployment: Medium–high stakes internal tool (policy compliance, onboarding accuracy, operational guidance) ──────────────────────────────────────────────────────────── ## SELECTED METRICS (7 dimensions) --- ╔══════════════════════════════════════════════════════════════════╗ ║ EVAL DIMENSION CARD ║ ╠══════════════════════════════════════════════════════════════════╣ ║ ║ ║ DIMENSION: Context Precision ║ ║ CATEGORY: Retrieval ║ ║ ║ ║ METRIC: Context Precision (RAGAS-style) ║ ║ IMPLEMENTATION: Automated (LLM-judged relevance per chunk) ║ ║ ║ ║ THRESHOLD: Production min: 0.80 · Alert at: 0.70 ║ ║ [MONNA-Analysis-2026] ║ ║ ║ ║ ADDRESSES: Missing irrelevant or noisy retrieved chunks ║ ║ leading to diluted answers ║ ║ ║ ║ CONFIDENCE: ● WELL-VALIDATED ║ ║ ║ ╚══════════════════════════════════════════════════════════════════╝ --- ╔══════════════════════════════════════════════════════════════════╗ ║ EVAL DIMENSION CARD ║ ╠══════════════════════════════════════════════════════════════════╣ ║ ║ ║ DIMENSION: Context Recall ║ ║ CATEGORY: Retrieval ║ ║ ║ ║ METRIC: Context Recall (RAGAS-style) ║ ║ IMPLEMENTATION: Automated + sampled human verification ║ ║ ║ ║ THRESHOLD: Production min: 0.85 · Alert at: 0.75 ║ ║ [MONNA-Analysis-2026] ║ ║ ║ ║ ADDRESSES: Missing relevant policy sections in retrieval ║ ║ leading to incomplete or wrong answers ║ ║ ║ ║ CONFIDENCE: ● WELL-VALIDATED ║ ║ ║ ╚══════════════════════════════════════════════════════════════════╝ --- ╔══════════════════════════════════════════════════════════════════╗ ║ EVAL DIMENSION CARD ║ ╠══════════════════════════════════════════════════════════════════╣ ║ ║ ║ DIMENSION: Answer Faithfulness ║ ║ CATEGORY: Generation ║ ║ ║ ║ METRIC: Faithfulness Score (context-grounded checking) ║ ║ IMPLEMENTATION: Automated LLM-based entailment vs context ║ ║ ║ ║ THRESHOLD: Production min: 0.90 · Alert at: 0.80 ║ ║ [MONNA-Analysis-2026] ║ ║ ║ ║ ADDRESSES: Hallucinated HR policy statements not present ║ ║ in retrieved documents ║ ║ ║ ║ CONFIDENCE: ● WELL-VALIDATED ║ ║ ║ ╚══════════════════════════════════════════════════════════════════╝ --- ╔══════════════════════════════════════════════════════════════════╗ ║ EVAL DIMENSION CARD ║ ╠══════════════════════════════════════════════════════════════════╣ ║ ║ ║ DIMENSION: Answer Relevance ║ ║ CATEGORY: Generation ║ ║ ║ ║ METRIC: Query-Answer Relevance Score ║ ║ IMPLEMENTATION: Automated LLM judge + periodic human audit ║ ║ ║ ║ THRESHOLD: Production min: 0.85 · Alert at: 0.75 ║ ║ [MONNA-Analysis-2026] ║ ║ ║ ║ ADDRESSES: Answers that are correct but do not actually ║ ║ address employee question intent ║ ║ ║ ║ CONFIDENCE: ● WELL-VALIDATED ║ ║ ║ ╚══════════════════════════════════════════════════════════════════╝ --- ╔══════════════════════════════════════════════════════════════════╗ ║ EVAL DIMENSION CARD ║ ╠══════════════════════════════════════════════════════════════════╣ ║ ║ ║ DIMENSION: Answer Completeness ║ ║ CATEGORY: Generation ║ ║ ║ ║ METRIC: Coverage vs Expected Policy Coverage ║ ║ IMPLEMENTATION: Hybrid (LLM rubric + human sampling review) ║ ║ ║ ║ THRESHOLD: Production min: 0.80 · Alert at: 0.70 ║ ║ [MONNA-Analysis-2026] ║ ║ ║ ║ ADDRESSES: Partial answers that omit key policy clauses ║ ║ or steps required for compliance ║ ║ ║ ║ CONFIDENCE: ○ CONTEXT-DEPENDENT ║ ║ ║ ╚══════════════════════════════════════════════════════════════════╝ --- ╔══════════════════════════════════════════════════════════════════╗ ║ EVAL DIMENSION CARD ║ ╠══════════════════════════════════════════════════════════════════╣ ║ ║ ║ DIMENSION: Policy Version Accuracy ║ ║ CATEGORY: Retrieval / Generation ║ ║ ║ ║ METRIC: Version Match Rate (doc metadata alignment) ║ ║ IMPLEMENTATION: Automated (doc ID + timestamp validation) ║ ║ ║ ║ THRESHOLD: Production min: 0.95 · Alert at: 0.90 ║ ║ [MONNA-Analysis-2026] ║ ║ ║ ║ ADDRESSES: Use of outdated HR policy versions or mixed ║ ║ policy references ║ ║ ║ ║ CONFIDENCE: ○ CONTEXT-DEPENDENT ║ ║ ║ ╚══════════════════════════════════════════════════════════════════╝ --- ╔══════════════════════════════════════════════════════════════════╗ ║ EVAL DIMENSION CARD ║ ╠══════════════════════════════════════════════════════════════════╣ ║ ║ ║ DIMENSION: Citation Alignment Quality ║ ║ CATEGORY: Generation / Grounding ║ ║ ║ ║ METRIC: Citation Support Score (span overlap + LLM) ║ ║ IMPLEMENTATION: Hybrid (automated overlap + human audit) ║ ║ ║ ║ THRESHOLD: Production min: 0.85 · Alert at: 0.75 ║ ║ [MONNA-Analysis-2026] ║ ║ ║ ║ ADDRESSES: Answers that cite documents not actually ║ ║ supporting the claim ║ ║ ║ ║ CONFIDENCE: ○ CONTEXT-DEPENDENT ║ ║ ║ ╚══════════════════════════════════════════════════════════════════╝ --- ## RUBRICS (human-scored dimensions) ### 1. Answer Completeness Dimension: Answer Completeness Definition: Measures whether the response includes all required policy steps, conditions, and exceptions needed for correct employee action. Scale: 1–5 Score 5 — Fully covers all relevant policy clauses, including exceptions, edge cases, and procedural steps Score 3 — Covers main policy guidance but misses at least one important condition or exception Score 1 — Major omissions; response

Skill Details

Llm Eval Designer