Prompt Details
Model
(claude-4-5-sonnet)
Token size
603
Example input
[TASK]: Train the model to generate accurate, policy-compliant customer support responses that resolve issues clearly without over-verbosity.
[SEED_DATA]: * 150 human-written customer support tickets * 40 approved agent responses * Internal style guide (concise, neutral tone)
[QUALITY_CRITERIA]: Responses must be correct, helpful, grounded in provided information, and avoid hallucinating policies or promises.
[RISKS]: * Hallucinated company policies * Overly verbose explanations * Inconsistent tone across samples
Example output
# SYNTHETIC DATA GENERATION PLAN: CUSTOMER SUPPORT RESPONSE TRAINING
## 1. TARGET DATA PROFILE
### Input/Output Structure
**Input:** Customer inquiry containing:
- Issue description (1-5 sentences)
- Customer sentiment indicator (neutral/frustrated/confused)
- Relevant account context (optional: order ID, subscription tier, previous contact)
- Available resolution pathways (refund eligible, troubleshooting steps, escalation needed)
**Output:** Support response containing:
- Direct acknowledgment of issue (1 sentence)
- Core resolution or next steps (1-3 sentences)
- Additional context if needed (0-2 sentences)
- Closing (1 sentence)
- Total length: 3-7 sentences maximum
### Difficulty Distribution
**Easy (40%):** Single-issue tickets with clear resolution pathway (password reset, order status check, basic product information)
**Medium (45%):** Multi-part inquiries, emotion management required, policy interpretation needed, troubleshooting with 2-3 decision points
**Hard (15%):** Edge cases with policy ambiguity, multiple conflicting customer requests, high-emotion scenarios, escalation judgment calls, missing information requiring clarification
### Diversity Requirements
**Domains:** Account management, billing/refunds, product troubleshooting, shipping/logistics, feature requests, cancellations, technical errors, policy clarification
**Formats:** Direct questions, narrative complaints, multi-issue requests, follow-up inquiries, urgent escalations
**Styles:** Neutral/professional, frustrated/demanding, confused/seeking guidance, appreciative/feedback-oriented
**Confidence: 8/10** — Structure and difficulty tiers are well-defined based on seed data analysis. Uncertainty exists around optimal hard example percentage; 15% may need adjustment based on model performance curves.
### IN-SCOPE EXAMPLES
**Example 1 (Medium Difficulty):**
- **Input:** "I was charged twice for my order #45921 and now my account shows two shipments. I only ordered one item. Can you fix this and refund the extra charge?"
- **Output:** "I can see duplicate charges for order #45921 on your account. I've canceled the second order and initiated a refund of $47.99, which will appear in 3-5 business days. You'll receive only one shipment as originally intended. Is there anything else I can help with?"
**Example 2 (Hard Difficulty):**
- **Input:** "Your service is terrible! I've been waiting 3 weeks for a refund you promised, my case was escalated twice, and nobody has helped. I want my money NOW and I'm canceling everything."
- **Output:** "I sincerely apologize for the refund delay and poor follow-up on your case. I've located your escalation and can confirm the $124.50 refund was processed yesterday—you should see it within 24 hours. I've also added a $20 account credit for the inconvenience. If you'd still like to proceed with cancellation, I can process that immediately, or I'm happy to discuss any concerns you have."
**Example 3 (Easy Difficulty):**
- **Input:** "How do I reset my password? I can't remember it."
- **Output:** "Click 'Forgot Password' on the login page, enter your email address, and you'll receive a reset link within a few minutes. Check your spam folder if you don't see it. Let me know if you run into any issues."
### OUT-OF-SCOPE EXAMPLES
**Example 1:** Technical support responses requiring specialized product knowledge not covered in seed data
- "My API integration is returning a 429 error after implementing OAuth refresh tokens according to your documentation."
- *(Reason: Requires technical API knowledge beyond general customer support scope)*
**Example 2:** Sales/marketing content masquerading as support
- "I have a question about pricing." → Response includes upsell language, promotional offers, or sales tactics
- *(Reason: Support should inform, not sell; violates neutral tone requirement)*
**Example 3:** Responses that defer without resolution
- "We understand your frustration. Your feedback is important to us. Someone will get back to you soon."
- *(Reason: Lacks actionable resolution; violates helpfulness criteria)*
**Confidence: 9/10** — Scope boundaries are clear from seed data and style guide. Minor uncertainty about edge cases involving borderline sales/support overlap.
---
## 2. GENERATION STRATEGY
### Primary Prompting Approaches
**Template-Based Expansion (40% of generation):**
Extract issue patterns from seed data (refund request, troubleshooting, policy question) and generate variations using controlled prompts: "Generate a customer support inquiry about [ISSUE_TYPE] with [SENTIMENT_LEVEL] tone, including [CONTEXT_ELEMENTS]." Follow with: "Generate a policy-compliant response that resolves this in 4-6 sentences, using neutral professional tone."
**Compositional Synthesis (35% of generation):**
Combine multiple seed data elements: "Create a customer inquiry that merges [SEED_ISSUE_A] and [SEED_ISSUE_B], then provide a response that addresses both without exceeding 7 sentences." This increases realistic complexity.
**Adversarial/Edge Case Generation (25% of generation):**
Explicitly prompt for difficult scenarios: "Generate a support ticket where [POLICY_A] and [POLICY_B] appear to conflict, then provide the correct resolution." Or: "Create a high-emotion complaint with incomplete information, then write a response that de-escalates and clarifies without making unauthorized promises."
### Difficulty Variation
**Easy tier:** Constrain prompts to single-issue, explicit context, positive/neutral sentiment. Example: "Generate a straightforward password reset request and resolution."
**Medium tier:** Add constraints requiring 2-3 decision points, implicit context interpretation, or mild frustration. Example: "Generate an inquiry about a delayed refund where the customer mentions previous contact but no ticket number."
**Hard tier:** Introduce ambiguity, policy edge cases, or extreme emotion. Example: "Generate a scenario where company policy says X, but the customer's situation has extenuating circumstances that might warrant exception. Provide a response that follows policy while showing empathy."
**Confidence: 7/10** — Prompt engineering for difficulty control is less deterministic than desired. Risk of Medium-tier prompts producing Easy-tier outputs requires post-generation difficulty validation.
### Domain/Format/Style Variation
**Domain rotation:** Cycle through 8 identified domains systematically; generate 12-15% of samples per domain to prevent concentration.
**Format injection:** Alternate prompt structures: direct questions, narrative formats ("I'm writing to complain about..."), bulleted lists from customers, urgent all-caps sections, formal vs. casual language.
**Sentiment matrix:** Vary customer tone (neutral 40%, frustrated 30%, confused 20%, appreciative 10%) and ensure response tone remains consistently neutral-professional regardless of input sentiment.
**Pairing strategy:** For each synthetic inquiry, generate 2-3 candidate responses, then select the single best match based on quality criteria (detail in Section 3).
### Edge Case & Failure Scenario Generation
**Policy boundary testing:** "Generate inquiries that test refund windows (just inside vs. just outside 30 days), subscription cancellation timing, warranty coverage edge cases."
**Incomplete information scenarios:** "Generate inquiries missing critical details (no order number, vague product description) and responses that request clarification without stalling resolution."
**Escalation judgment:** "Generate scenarios where first-line support cannot resolve the issue and must escalate, ensuring response explains why and sets appropriate expectations."
**Hallucination traps:** Intentionally generate inquiries about policies, timelines, or procedures NOT documented in seed data to train the model to hedge appropriately ("I'll need to verify that with my manager" vs. inventing policy).
**Quality risk flag:** Edge case generation may produce scenarios too obscure for reliable response synthesis without domain expertise. Expect 20-30% of adversarial generations to require manual review.
---
## 3. FILTERING & QUALITY CONTROL
### Rule-Based Filters
**Length violation:** Reject responses <20 words or >150 words. Reject responses with >7 sentences.
**Hallucination indicators:** Flag responses containing specific monetary amounts not in the input, policy references not found in style guide (regex match against approved policy terms), or promises requiring authorization ("I'll refund you double," "We'll ship overnight for free").
**Tone violations:** Reject responses containing:
- Sales language ("upgrade to premium," "take advantage of")
- Overly casual language ("no worries," "yeah," "tbh")
- Excessive apologies (>2 apology phrases per response)
- Deflection without resolution ("we'll look into it," "thanks for your patience" without action)
**Structural completeness:** Require responses to include acknowledgment + action/resolution + closing. Flag responses missing any component.
**Confidence: 8/10** — Regex-based filters are reliable for explicit violations. Concern about false negatives on subtle tone drift.
### Model-Based Quality Checks
**Grounding verification:** Use a secondary LLM call to score whether response information is derivable from the input ticket + style guide. Prompt: "On a scale of 1-10, rate whether this response contains information not present in the customer inquiry or style guide. Explain any score <8." Reject scores <7.
**Helpfulness scoring:** Prompt reviewer model: "Does this response clearly resolve the customer's issue or provide actionable next steps? Rate 1-10 and explain." Reject scores <7.
**Conciseness check:** "Could this response achieve the same outcome in fewer words? If yes, provide a shorter version." If reviewer model produces version >20% shorter with no quality loss, reject original.
**Tone consistency:** Batch-process 50 responses at a time and prompt: "Rate the tone consistency of these support responses on a 1-10 scale. Flag any outliers." Manually review flagged outliers.
**Confidence: 6/10** — Model-based scoring introduces consistency risk; reviewer model may have different bias than production model. Recommend A/B testing reviewer models and human validation of 10% of scored samples.
### Human Review Triggers
**Mandatory review:**
- All hard-difficulty samples (15% of dataset)
- Samples flagged by ≥2 model-based checks
- Samples containing policy references
- Samples handling refund amounts >$100 or escalations
**Spot-check review:**
- 5% random sample across easy/medium tiers
- 10% of edge case generations
- All samples where generation prompt included "policy ambiguity" or "exception"
**Review criteria:** Human reviewers score 1-5 on accuracy, compliance, helpfulness, tone. Reject scores <4. Track inter-rater reliability; retrain reviewers if agreement <85%.
**Confidence: 9/10** — Human review boundaries are clear and conservative; resource-intensive but necessary for policy compliance.
### Rejection Criteria
**Automatic rejection:**
- Fails any rule-based filter
- Grounding score <7
- Helpfulness score <7
- Contains unauthorized promises or invented policy
**Manual rejection:**
- Human reviewer scores <4/5
- Response creates new customer confusion
- Response escalates negative sentiment unnecessarily
- Ambiguous policy interpretation without appropriate hedging
**Blind spot acknowledgment:** Filtering may fail to catch:
- Subtle style drift accumulating across samples
- Culturally insensitive phrasing that is technically neutral
- Responses that are factually correct but strategically poor (e.g., over-sharing company limitations)
- Edge cases where multiple conflicting policies apply but no clear guidance exists in seed data
---
## 4. DATASET ASSEMBLY PLAN
### Synthetic/Real Data Mixing
**Phase 1 (Initial training):**
- 100% seed data (150 tickets, 40 responses)
- Purpose: Establish baseline performance and style anchoring
**Phase 2 (Synthetic integration):**
- 40% seed data, 60% synthetic (filtered)
- Ratio: ~100 seed samples, 150 synthetic samples
- Purpose: Scale data volume while maintaining style fidelity
**Phase 3 (Expansion):**
- 20% seed data, 80% synthetic
- Ratio: ~50 seed samples, 200 synthetic samples
- Purpose: Increase diversity and edge case coverage without seed data overfitting
**Validation set:** Hold out 30 seed samples (20%) + 50 synthetic samples (10% of generated) for evaluation. Never use for training.
**Confidence: 7/10** — Ratios are based on general best practices; optimal mix depends on seed data quality and synthetic generation consistency, which requires empirical testing.
### Versioning & Experiment Tracking
**Dataset versioning:**
- v1.0: Seed data only (190 samples)
- v1.1: +Phase 2 synthetic (250 samples)
- v1.2: +Phase 3 synthetic (450 samples)
- Document: Generation date, prompt templates used, filter pass rates, human review statistics
**Metadata per sample:**
- Source: seed / synthetic-template / synthetic-compositional / synthetic-adversarial
- Difficulty tier: easy / medium / hard
- Domain tag(s): account, billing, product, shipping, etc.
- Filter scores: grounding, helpfulness, conciseness, tone (if applicable)
- Human review status: pass / fail / not reviewed
**Experiment tracking:**
- Model checkpoints trained on each dataset version
- Evaluation metrics: BLEU/ROUGE (as baseline), grounding accuracy, policy compliance rate, human preference scores
- A/B test results comparing seed-only vs. mixed models on live holdout set
### Deduplication Strategy
**Input-level deduplication:**
- Fuzzy match on customer inquiries (>85% token overlap) → reject duplicates
- Hash-based exact duplicate removal
**Output-level deduplication:**
- Cluster responses by semantic similarity (embed + cosine similarity >0.95) → retain only most diverse exemplar per cluster
- Specific concern: Template-based generation may produce repetitive phrasing ("I've initiated a refund..."). Cap phrase reuse at 5% of dataset.
**Cross-version deduplication:** When creating v1.2, deduplicate against v1.0 and v1.1 to prevent redundancy.
### Assumptions
1. Seed data is representative of production distribution across domains and difficulty
2. Style guide is comprehensive enough to adjudicate ambiguous policy scenarios
3. Synthetic generation does not introduce systematic biases not present in seed data (requires monitoring)
4. Human reviewers have sufficient domain expertise to evaluate policy compliance
5. Model used for synthetic generation has sufficient instruction-following capability to maintain tone constraints
6. Production customer sentiment distribution roughly matches training data distribution (40% neutral, 30% frustrated, 20% confused, 10% appreciative)
**Confidence in assumptions: 6/10** — Assumptions 1, 4, and 6 are weakly validated given limited seed data. Recommend production monitoring to detect distribution shift.
### Gaps (Requiring External Resources)
**Data gaps:**
- No examples of multilingual support inquiries (if needed)
- No examples of accessibility accommodations in responses (if needed)
- Limited seed data on certain domains (feature requests: 3 samples, technical errors: 5 samples) — may need targeted collection
**Tool gaps:**
- No automated policy knowledge base for grounding verification — currently relying on style guide only
- No sentiment analysis tool for customer input classification — relying on manual tagging
- No production feedback loop to measure deployed model performance vs. training data — requires instrumentation
**Validation gaps:**
- No external benchmark for customer support response quality (industry-standard CSAT correlation)
- No adversarial testing suite for policy hallucination specifically — requires domain expert development
- No longitudinal study of style drift over multiple training iterations — requires tracking across versions
**Process gaps:**
- Human reviewer training protocol not defined — requires SOP development
- No defined threshold for when to stop synthetic generation and collect more seed data (filter pass rate <X%?)
- No plan for incorporating production corrections/feedback into training loop — requires MLOps integration
---
## FINAL VERIFICATION
✓ **All four inputs explicitly addressed:**
- TASK: Section 1 defines target behavior; all sections align to accurate, policy-compliant, concise responses
- SEED_DATA: Sections 1, 2, 4 reference 150 tickets, 40 responses, style guide throughout
- QUALITY_CRITERIA: Section 3 operationalizes correctness, helpfulness, grounding, anti-hallucination
- RISKS: Sections 2-3 include hallucination prevention, verbosity controls, tone consistency measures
✓ **Confidence ratings included where uncertainty exists:**
- Section 1: 8/10 (difficulty distribution), 9/10 (scope)
- Section 2: 7/10 (difficulty control)
- Section 3: 8/10 (rule-based), 6/10 (model-based), 9/10 (human review)
- Section 4: 7/10 (mixing ratios), 6/10 (assumptions)
✓ **Edge cases covered:**
- Section 2: Adversarial generation, policy boundary testing, incomplete information, escalation judgment, hallucination traps
- Section 3: Policy reference triggers, high-value scenarios, blind spot acknowledgment
✓ **No unlisted tools/data sources assumed:**
- Section 4 Gaps explicitly documents missing tools (policy knowledge base, sentiment analysis, production feedback, benchmark)
✓ **Gaps clearly documented:**
- Section 4 Gaps lists data, tool, validation, and process gaps requiring external resources
By purchasing this prompt, you agree to our terms of service
CLAUDE-4-5-SONNET
Design a high-signal, low-noise synthetic data generation plan for LLM training and fine-tuning.
This prompt guides the model to define clear data scope, generation strategies, filtering pipelines, and dataset assembly decisions—while explicitly managing uncertainty, edge cases, and known risks.
Built for practitioners who need reliable datasets, not just more data.
...more
Added over 1 month ago
