Prompt Details
Model
Chat - (gpt-5.2)
Token size
598
Example input
[Documentation / Blog / Ecommerce / Knowledge Base]: Documentation
[Small / Medium / Large]: Medium (2,000β5,000 pages)
[Static / Weekly / Daily]: Weekly
[Structured Docs / Mixed Pages / Dynamic Content]: Structured Docs
[Chatbot / Internal Search / Customer Support]: Case: Customer Support Chatbot
Example output
1οΈβ£ Website Crawling Strategy
Use sitemap.xml as the primary discovery source
Secondary discovery through internal link traversal
Crawl only documentation directories (example: /docs/, /guides/)
Exclude login pages, marketing pages, and duplicate URLs
Crawl depth rule: maximum 3 internal link levels from root documentation page
Use change detection through Last-Modified headers or page hash comparison
---
2οΈβ£ Content Cleaning & Normalization
Remove navigation menus, sidebars, and footer content
Strip cookie banners, ads, and UI elements
Extract only article body content and headings
Convert HTML to clean markdown or structured JSON
Normalize whitespace, punctuation, and encoding
Standardize code blocks, tables, and bullet lists
---
3οΈβ£ Document Structuring
Split pages based on heading hierarchy (H1 β H2 β H3)
Maintain parent-child relationship between sections
Store document structure in hierarchical format
Example structure
Page Title
Section
Subsection
Paragraph blocks
Metadata tags
Page URL
Section title
Document category
Last updated timestamp
Content language
---
4οΈβ£ Chunking Framework
Chunk size target: 400β600 tokens
Overlap strategy: 80β100 tokens overlap between chunks
Section-aware chunking so chunks never break semantic boundaries
Code snippets stored as separate chunks when possible
Large sections split while preserving heading context
---
5οΈβ£ Embedding Preparation
Convert cleaned chunks to plain text format
Remove duplicate whitespace and unnecessary markup
Preserve headings inside chunk text for context
Embedding model considerations
High semantic understanding
Balanced dimensionality (768β1536 dimensions)
Normalize vectors before indexing to improve similarity search
---
6οΈβ£ Vector Indexing Strategy
Namespace segmentation by document category
Example
product_docs
api_docs
troubleshooting_guides
Metadata filtering logic
Filter by product version
Filter by documentation category
Filter by recency (recent updates prioritized)
Freshness mechanism
Replace vectors for updated pages
Mark outdated vectors for deletion
---
7οΈβ£ Retrieval Optimization
Query routing logic
Detect query intent
troubleshooting
feature explanation
configuration steps
Context assembly rules
Retrieve Top-5 chunks initially
Group results by document section
Merge adjacent chunks from the same document
Relevance improvements
Hybrid retrieval
semantic vector search
keyword search fallback
Apply re-ranking layer to prioritize most relevant chunks
---
8οΈβ£ Update & Synchronization
Incremental crawling workflow
1. Weekly crawler scans sitemap
2. Detect new URLs
3. Compare content hash for changed pages
4. Process only modified pages
Content change detection
Hash comparison of cleaned text
Last updated metadata check
Re-indexing triggers
Page content change
New documentation pages
Deleted pages removed from vector store
---
9οΈβ£ Risk & Failure Analysis
Duplicate content risk
Same article accessible from multiple URLs
Mitigation
Canonical URL enforcement
Retrieval noise risk
Large generic chunks causing irrelevant retrieval
Mitigation
Section-aware chunking
Index drift risk
Old vectors remain after page updates
Mitigation
Versioned indexing with automatic replacement
---
π Executive Pipeline Blueprint
Most critical pipeline stage
Document structuring and chunking
Poor structure leads to irrelevant retrieval
Biggest scalability risk
Large-scale crawling and re-indexing overhead
First optimization to implement
Incremental indexing with change detection
This reduces unnecessary embedding generation and indexing load.
---
Final Outcome
This pipeline converts structured documentation into:
Clean knowledge chunks
Embedded semantic vectors
Optimized retrieval context
Result: A reliable RAG-powered customer support chatbot that answers complex multi-step documentation queries accurately. π
By purchasing this prompt, you agree to our terms of service
GPT-5.2
Many organizations want to turn their website content into an AI assistant or knowledge base, but raw web pages are not structured for retrieval.
This prompt designs a pipeline that crawls, cleans, chunks, embeds, and indexes website data for accurate AI retrieval.
Buyer Benefits
π Website crawling structure
π§Ή Data cleaning & normalization logic
π Smart document chunking strategy
π§ Vector embedding preparation
β‘ Retrieval optimization planning
π Use this prompt before building a website.
...more
Added over 1 month ago
