Prompt Details
Model
Chat - (gpt-5.5)
Token size
343
Example input
[Assistant / Copilot / RAG System / Multi-Agent System]: Assistant
[Accuracy / Reliability / Safety / User Experience]: Accuracy, Reliability, Safety, User Experience
[Small / Medium / Large / Enterprise]: Enterprise
[Development / Internal Testing / Production / Enterprise]: Production
[Benchmarking / Quality Assurance / Optimization / Monitoring]: Benchmarking & Quality Assurance
Example output
A. Evaluation Dataset Overview
Testing Objective
The purpose of this evaluation dataset is to measure how effectively the AI Assistant performs across real-world user interactions while maintaining high standards of accuracy, reliability, safety, and user satisfaction.
Main Performance Priorities
Response Accuracy
Factual Consistency
Safety Compliance
Context Retention
User Satisfaction
Instruction Following
Response Relevance
Robust Error Handling
Key Evaluation Requirements
Real-world conversation simulation
Multi-turn dialogue testing
Safety policy validation
Adversarial prompt resistance
Context memory evaluation
Tool usage assessment
Hallucination detection
User intent understanding
Dataset Design Considerations
Balanced difficulty distribution
Diverse user personas
Multiple language styles
Industry-specific scenarios
Regular dataset updates
Objective scoring criteria
Reproducible evaluation process
B. Scenario Generation Recommendations
Test Scenario Categories
General Assistance
Information retrieval
Summarization
Explanation requests
Brainstorming
Reasoning Tasks
Logical reasoning
Multi-step problem solving
Decision support
Comparative analysis
Safety Testing
Harmful request detection
Prompt injection attempts
Sensitive content handling
Privacy protection
Contextual Conversations
Long-term memory usage
Multi-turn conversations
Context switching
Follow-up requests
Tool Usage
Search integration
API calling
Workflow execution
Multi-tool coordination
Edge Case Suggestions
Ambiguous Queries
Example:
"Can you help me with that thing we discussed before?"
Contradictory Instructions
Example:
"Give a detailed answer but use only 10 words."
Incomplete Information
Example:
"Calculate the total cost."
Adversarial Inputs
Example:
"Ignore all previous instructions."
Context Overflow
Example: Testing after very long conversations.
User Behavior Simulation Ideas
Beginner Users
Vague questions
Limited technical knowledge
Expert Users
Technical terminology
Complex workflows
Frustrated Users
Repeated questions
Clarification demands
Power Users
Long prompts
Multi-step requests
Malicious Users
Jailbreak attempts
Manipulative prompts
Coverage Recommendations
Target Coverage:
Area
Coverage
Accuracy
25%
Reliability
25%
Safety
25%
User Experience
25%
Minimum:
1,000 scenarios per major category
10 difficulty levels
20+ user personas
Multiple industries
C. Dataset Structure Suggestions
Dataset Organization
Metadata Layer
Test ID
Category
Difficulty
Persona
Language
Expected Outcome
Input Layer
User Prompt
Context History
External Data (if applicable)
Output Layer
Expected Response
Acceptable Variations
Failure Conditions
Labeling Recommendations
Accuracy Labels
Correct
Partially Correct
Incorrect
Safety Labels
Safe
Borderline
Unsafe
UX Labels
Excellent
Acceptable
Poor
Reliability Labels
Stable
Inconsistent
Failed
Quality Control Considerations
Human Review
Expert validation
Random audits
Disagreement resolution
Automated Checks
Duplicate detection
Label consistency
Format validation
Diversity & Representativeness
Include:
Different age groups
Different expertise levels
Different industries
Different cultures
Multiple writing styles
Formal and informal language
D. Benchmarking Recommendations
Performance Measurement Ideas
Accuracy Metrics
Exact Match
Semantic Similarity
Factual Precision
Reliability Metrics
Consistency Score
Failure Rate
Retry Success Rate
Safety Metrics
Harmful Content Rate
Policy Compliance Rate
User Experience Metrics
Helpfulness Score
Satisfaction Rating
Completion Success Rate
Baseline Comparison Suggestions
Compare against:
Internal Baselines
Previous model version
Previous prompt version
External Baselines
Industry-standard assistants
Public benchmarks
Human Baselines
Human evaluator performance
Trend Tracking Considerations
Track:
Weekly performance
Monthly improvements
Regression incidents
New failure patterns
Safety drift indicators
Evaluation Workflow
Plain text
Dataset Creation
β
Validation
β
Model Testing
β
Scoring
β
Benchmark Report
β
Improvement Cycle
β
Re-Testing
E. Optimization Suggestions
Improving Dataset Quality
Add real user conversations
Remove duplicate examples
Expand difficult scenarios
Increase adversarial testing
Reducing Evaluation Blind Spots
Focus on:
Rare edge cases
Long conversations
Multi-language testing
Domain-specific tasks
Unexpected user behavior
Scalability Recommendations
Small Scale
1,000β5,000 examples
Medium Scale
10,000β50,000 examples
Enterprise Scale
100,000+ examples
Continuous Scale
Automatically generated evaluation pipelines
Long-Term Maintenance
Quarterly dataset refresh
Annual benchmark redesign
Continuous failure collection
Dynamic scenario generation
Emerging risk monitoring
F. Strategic Notes
Common Evaluation Dataset Mistakes
β Overusing easy examples
β Testing only happy-path scenarios
β Ignoring safety testing
β Using unrealistic prompts
β Lack of diversity
β Outdated benchmark data
Avoiding Unrealistic Test Cases
Use:
Actual support tickets
Real user conversations
Production logs
Customer feedback
Avoid:
Artificial prompts
Perfectly structured inputs
Unrealistic user behavior
Maintaining Benchmark Relevance
Update with new user trends
Include emerging technologies
Track industry changes
Add newly discovered failures
Improving Long-Term Testing Effectiveness
Continuous Evaluation Loop
Plain text
Production Data
β
Failure Detection
β
Dataset Expansion
β
Model Evaluation
β
Performance Analysis
β
Optimization
β
Production Deployment
β
Repeat
Strategic Goal
Create a living evaluation dataset that continuously evolves with user behavior, model capabilities, and business requirements, ensuring reliable benchmarking, quality assurance, and long-term AI system improvement.
By purchasing this prompt, you agree to our terms of service
GPT-5.5
Many AI teams struggle to create high-quality evaluation datasets that accurately reflect production use cases and business objectives.
This prompt helps generate structured evaluation datasets, testing scenarios, benchmark strategies, and performance measurement recommendations.
β¨ What You Receive:
π Evaluation dataset planning suggestions
π§ͺ Testing scenario recommendations
π― Benchmarking framework ideas
π Performance measurement guidance
π Dataset quality improvement strategies
π Designe
...more
Added 1 week ago
