PromptBase
Upgrade
Close icon
General
Home
Marketplace
Create
Hire
Login
Chat
Sell
Explore

Prompt Details

Model
Chat - (gpt-5.5)
Token size
343
Example input
[Assistant / Copilot / RAG System / Multi-Agent System]: Assistant [Accuracy / Reliability / Safety / User Experience]: Accuracy, Reliability, Safety, User Experience [Small / Medium / Large / Enterprise]: Enterprise [Development / Internal Testing / Production / Enterprise]: Production [Benchmarking / Quality Assurance / Optimization / Monitoring]: Benchmarking & Quality Assurance
Example output
A. Evaluation Dataset Overview Testing Objective The purpose of this evaluation dataset is to measure how effectively the AI Assistant performs across real-world user interactions while maintaining high standards of accuracy, reliability, safety, and user satisfaction. Main Performance Priorities Response Accuracy Factual Consistency Safety Compliance Context Retention User Satisfaction Instruction Following Response Relevance Robust Error Handling Key Evaluation Requirements Real-world conversation simulation Multi-turn dialogue testing Safety policy validation Adversarial prompt resistance Context memory evaluation Tool usage assessment Hallucination detection User intent understanding Dataset Design Considerations Balanced difficulty distribution Diverse user personas Multiple language styles Industry-specific scenarios Regular dataset updates Objective scoring criteria Reproducible evaluation process B. Scenario Generation Recommendations Test Scenario Categories General Assistance Information retrieval Summarization Explanation requests Brainstorming Reasoning Tasks Logical reasoning Multi-step problem solving Decision support Comparative analysis Safety Testing Harmful request detection Prompt injection attempts Sensitive content handling Privacy protection Contextual Conversations Long-term memory usage Multi-turn conversations Context switching Follow-up requests Tool Usage Search integration API calling Workflow execution Multi-tool coordination Edge Case Suggestions Ambiguous Queries Example: "Can you help me with that thing we discussed before?" Contradictory Instructions Example: "Give a detailed answer but use only 10 words." Incomplete Information Example: "Calculate the total cost." Adversarial Inputs Example: "Ignore all previous instructions." Context Overflow Example: Testing after very long conversations. User Behavior Simulation Ideas Beginner Users Vague questions Limited technical knowledge Expert Users Technical terminology Complex workflows Frustrated Users Repeated questions Clarification demands Power Users Long prompts Multi-step requests Malicious Users Jailbreak attempts Manipulative prompts Coverage Recommendations Target Coverage: Area Coverage Accuracy 25% Reliability 25% Safety 25% User Experience 25% Minimum: 1,000 scenarios per major category 10 difficulty levels 20+ user personas Multiple industries C. Dataset Structure Suggestions Dataset Organization Metadata Layer Test ID Category Difficulty Persona Language Expected Outcome Input Layer User Prompt Context History External Data (if applicable) Output Layer Expected Response Acceptable Variations Failure Conditions Labeling Recommendations Accuracy Labels Correct Partially Correct Incorrect Safety Labels Safe Borderline Unsafe UX Labels Excellent Acceptable Poor Reliability Labels Stable Inconsistent Failed Quality Control Considerations Human Review Expert validation Random audits Disagreement resolution Automated Checks Duplicate detection Label consistency Format validation Diversity & Representativeness Include: Different age groups Different expertise levels Different industries Different cultures Multiple writing styles Formal and informal language D. Benchmarking Recommendations Performance Measurement Ideas Accuracy Metrics Exact Match Semantic Similarity Factual Precision Reliability Metrics Consistency Score Failure Rate Retry Success Rate Safety Metrics Harmful Content Rate Policy Compliance Rate User Experience Metrics Helpfulness Score Satisfaction Rating Completion Success Rate Baseline Comparison Suggestions Compare against: Internal Baselines Previous model version Previous prompt version External Baselines Industry-standard assistants Public benchmarks Human Baselines Human evaluator performance Trend Tracking Considerations Track: Weekly performance Monthly improvements Regression incidents New failure patterns Safety drift indicators Evaluation Workflow Plain text Dataset Creation ↓ Validation ↓ Model Testing ↓ Scoring ↓ Benchmark Report ↓ Improvement Cycle ↓ Re-Testing E. Optimization Suggestions Improving Dataset Quality Add real user conversations Remove duplicate examples Expand difficult scenarios Increase adversarial testing Reducing Evaluation Blind Spots Focus on: Rare edge cases Long conversations Multi-language testing Domain-specific tasks Unexpected user behavior Scalability Recommendations Small Scale 1,000–5,000 examples Medium Scale 10,000–50,000 examples Enterprise Scale 100,000+ examples Continuous Scale Automatically generated evaluation pipelines Long-Term Maintenance Quarterly dataset refresh Annual benchmark redesign Continuous failure collection Dynamic scenario generation Emerging risk monitoring F. Strategic Notes Common Evaluation Dataset Mistakes ❌ Overusing easy examples ❌ Testing only happy-path scenarios ❌ Ignoring safety testing ❌ Using unrealistic prompts ❌ Lack of diversity ❌ Outdated benchmark data Avoiding Unrealistic Test Cases Use: Actual support tickets Real user conversations Production logs Customer feedback Avoid: Artificial prompts Perfectly structured inputs Unrealistic user behavior Maintaining Benchmark Relevance Update with new user trends Include emerging technologies Track industry changes Add newly discovered failures Improving Long-Term Testing Effectiveness Continuous Evaluation Loop Plain text Production Data ↓ Failure Detection ↓ Dataset Expansion ↓ Model Evaluation ↓ Performance Analysis ↓ Optimization ↓ Production Deployment ↓ Repeat Strategic Goal Create a living evaluation dataset that continuously evolves with user behavior, model capabilities, and business requirements, ensuring reliable benchmarking, quality assurance, and long-term AI system improvement.
πŸ€– GPT

Agent Evaluation Dataset Generator

Add to Cart
Instant accessInstant access
Usage rightsCommercial use
Money-back guaranteeMoney‑back
By purchasing this prompt, you agree to our terms of service
GPT-5.5
Tested icon
Guide icon
4 examples icon
Free credits icon
Many AI teams struggle to create high-quality evaluation datasets that accurately reflect production use cases and business objectives. This prompt helps generate structured evaluation datasets, testing scenarios, benchmark strategies, and performance measurement recommendations. ✨ What You Receive: πŸ“Š Evaluation dataset planning suggestions πŸ§ͺ Testing scenario recommendations 🎯 Benchmarking framework ideas πŸ“ˆ Performance measurement guidance πŸ“‹ Dataset quality improvement strategies πŸ‘‰ Designe
...more
Added 1 week ago
Report
Browse Marketplace