Skill Details
Runtime
(gpt-5.5)
Example user message
[userMessage]: System type: RAG pipeline (internal enterprise knowledge assistant) Task description: The system answers employee questions using company policy documents, HR manuals, and technical documentation with grounded, citation-based responses. Failure modes you care about: Hallucinated policy statements not present in documents Missing relevant document chunks in retrieval Over-reliance on partial context (incomplete answers) Incorrect or outdated policy interpretation Weak citation alignment (answer not supported by retrieved text) Deployment context: Internal enterprise tool used by HR and engineering teams. Medium-to-high stakes (policy compliance, onboarding accuracy). Must be reliable but not safety-critical like medical/legal systems. Evaluation resources: Combination of automated metrics + light human annotation (review sampling only)
Example agent response
LLM EVALUATION FRAMEWORK
System: RAG pipeline — internal enterprise HR + engineering knowledge assistant
Deployment: Medium–high stakes internal tool (policy compliance, onboarding accuracy, operational guidance)
────────────────────────────────────────────────────────────
## SELECTED METRICS (7 dimensions)
---
╔══════════════════════════════════════════════════════════════════╗
║ EVAL DIMENSION CARD ║
╠══════════════════════════════════════════════════════════════════╣
║ ║
║ DIMENSION: Context Precision ║
║ CATEGORY: Retrieval ║
║ ║
║ METRIC: Context Precision (RAGAS-style) ║
║ IMPLEMENTATION: Automated (LLM-judged relevance per chunk) ║
║ ║
║ THRESHOLD: Production min: 0.80 · Alert at: 0.70 ║
║ [MONNA-Analysis-2026] ║
║ ║
║ ADDRESSES: Missing irrelevant or noisy retrieved chunks ║
║ leading to diluted answers ║
║ ║
║ CONFIDENCE: ● WELL-VALIDATED ║
║ ║
╚══════════════════════════════════════════════════════════════════╝
---
╔══════════════════════════════════════════════════════════════════╗
║ EVAL DIMENSION CARD ║
╠══════════════════════════════════════════════════════════════════╣
║ ║
║ DIMENSION: Context Recall ║
║ CATEGORY: Retrieval ║
║ ║
║ METRIC: Context Recall (RAGAS-style) ║
║ IMPLEMENTATION: Automated + sampled human verification ║
║ ║
║ THRESHOLD: Production min: 0.85 · Alert at: 0.75 ║
║ [MONNA-Analysis-2026] ║
║ ║
║ ADDRESSES: Missing relevant policy sections in retrieval ║
║ leading to incomplete or wrong answers ║
║ ║
║ CONFIDENCE: ● WELL-VALIDATED ║
║ ║
╚══════════════════════════════════════════════════════════════════╝
---
╔══════════════════════════════════════════════════════════════════╗
║ EVAL DIMENSION CARD ║
╠══════════════════════════════════════════════════════════════════╣
║ ║
║ DIMENSION: Answer Faithfulness ║
║ CATEGORY: Generation ║
║ ║
║ METRIC: Faithfulness Score (context-grounded checking) ║
║ IMPLEMENTATION: Automated LLM-based entailment vs context ║
║ ║
║ THRESHOLD: Production min: 0.90 · Alert at: 0.80 ║
║ [MONNA-Analysis-2026] ║
║ ║
║ ADDRESSES: Hallucinated HR policy statements not present ║
║ in retrieved documents ║
║ ║
║ CONFIDENCE: ● WELL-VALIDATED ║
║ ║
╚══════════════════════════════════════════════════════════════════╝
---
╔══════════════════════════════════════════════════════════════════╗
║ EVAL DIMENSION CARD ║
╠══════════════════════════════════════════════════════════════════╣
║ ║
║ DIMENSION: Answer Relevance ║
║ CATEGORY: Generation ║
║ ║
║ METRIC: Query-Answer Relevance Score ║
║ IMPLEMENTATION: Automated LLM judge + periodic human audit ║
║ ║
║ THRESHOLD: Production min: 0.85 · Alert at: 0.75 ║
║ [MONNA-Analysis-2026] ║
║ ║
║ ADDRESSES: Answers that are correct but do not actually ║
║ address employee question intent ║
║ ║
║ CONFIDENCE: ● WELL-VALIDATED ║
║ ║
╚══════════════════════════════════════════════════════════════════╝
---
╔══════════════════════════════════════════════════════════════════╗
║ EVAL DIMENSION CARD ║
╠══════════════════════════════════════════════════════════════════╣
║ ║
║ DIMENSION: Answer Completeness ║
║ CATEGORY: Generation ║
║ ║
║ METRIC: Coverage vs Expected Policy Coverage ║
║ IMPLEMENTATION: Hybrid (LLM rubric + human sampling review) ║
║ ║
║ THRESHOLD: Production min: 0.80 · Alert at: 0.70 ║
║ [MONNA-Analysis-2026] ║
║ ║
║ ADDRESSES: Partial answers that omit key policy clauses ║
║ or steps required for compliance ║
║ ║
║ CONFIDENCE: ○ CONTEXT-DEPENDENT ║
║ ║
╚══════════════════════════════════════════════════════════════════╝
---
╔══════════════════════════════════════════════════════════════════╗
║ EVAL DIMENSION CARD ║
╠══════════════════════════════════════════════════════════════════╣
║ ║
║ DIMENSION: Policy Version Accuracy ║
║ CATEGORY: Retrieval / Generation ║
║ ║
║ METRIC: Version Match Rate (doc metadata alignment) ║
║ IMPLEMENTATION: Automated (doc ID + timestamp validation) ║
║ ║
║ THRESHOLD: Production min: 0.95 · Alert at: 0.90 ║
║ [MONNA-Analysis-2026] ║
║ ║
║ ADDRESSES: Use of outdated HR policy versions or mixed ║
║ policy references ║
║ ║
║ CONFIDENCE: ○ CONTEXT-DEPENDENT ║
║ ║
╚══════════════════════════════════════════════════════════════════╝
---
╔══════════════════════════════════════════════════════════════════╗
║ EVAL DIMENSION CARD ║
╠══════════════════════════════════════════════════════════════════╣
║ ║
║ DIMENSION: Citation Alignment Quality ║
║ CATEGORY: Generation / Grounding ║
║ ║
║ METRIC: Citation Support Score (span overlap + LLM) ║
║ IMPLEMENTATION: Hybrid (automated overlap + human audit) ║
║ ║
║ THRESHOLD: Production min: 0.85 · Alert at: 0.75 ║
║ [MONNA-Analysis-2026] ║
║ ║
║ ADDRESSES: Answers that cite documents not actually ║
║ supporting the claim ║
║ ║
║ CONFIDENCE: ○ CONTEXT-DEPENDENT ║
║ ║
╚══════════════════════════════════════════════════════════════════╝
---
## RUBRICS (human-scored dimensions)
### 1. Answer Completeness
Dimension: Answer Completeness
Definition: Measures whether the response includes all required policy steps, conditions, and exceptions needed for correct employee action.
Scale: 1–5
Score 5 — Fully covers all relevant policy clauses, including exceptions, edge cases, and procedural steps
Score 3 — Covers main policy guidance but misses at least one important condition or exception
Score 1 — Major omissions; response
By purchasing this skill, you agree to our terms of service
GPT-5.5
top guessing how good your LLM is. Design a rigorous eval framework
in one session — metrics, rubrics, thresholds, and test suite structure.
...more
Added 2 days ago
