Prompt Details
Model
(claude-4-7-opus)
Token size
479
Example input
[SOURCE_DATA]: MongoDB replica set (Users and Orders collections)
[DESTINATION_WAREHOUSE]: Snowflake
[TRANSFORMATION_LOGIC]: Anonymize PII, flatten JSON arrays, join users with orders, calculate daily sales.
[ORCHESTRATION_TOOL]: Apache Airflow
Example output
# Data Pipeline Architecture Blueprint
## PHASE 1: PIPELINE ARCHITECTURE & STRATEGY
**Pattern:** ELT (Extract, Load, Transform). We will extract raw JSON from MongoDB, load it directly into Snowflake Variant columns, and use Snowflake's massive compute to perform the transformations.
**Extraction Strategy:** Incremental Load. We will use an updated_at timestamp cursor (watermark) in MongoDB to only pull documents created or modified since the last successful pipeline run.
## PHASE 2: DATA MODELING & LAYERS
We will adopt a Medallion Architecture approach within Snowflake:
- **Bronze Layer (Raw):** Stores the exact JSON documents extracted from MongoDB in a single VARIANT column, plus an ingestion_timestamp.
- **Silver Layer (Cleaned):** A Snowflake task parses the JSON. Here, we apply the requested logic: We anonymize PII by hashing the email_address and phone_number fields using SHA-256. We flatten nested arrays (like order_items) into tabular format.
- **Gold Layer (Business):** We join the cleaned Users table with the Orders table to create an aggregated daily_sales_by_user materialized view.
## PHASE 3: ORCHESTRATION & EXECUTION
**Orchestration Tool:** Apache Airflow.
**Conceptual DAG (Directed Acyclic Graph):**
1. `check_mongo_connection`: Sensor to ensure source availability.
2. `get_last_watermark`: Fetches the max updated_at from Snowflake Bronze.
3. `extract_and_load_to_s3`: Uses a MongoToS3Operator to dump the incremental JSON batch.
4. `copy_s3_to_snowflake`: Uses S3ToSnowflakeOperator (COPY INTO command) to load the Bronze layer.
5. `run_silver_transform`: Executes Snowflake SQL to flatten JSON and hash PII.
6. `run_gold_transform`: Executes the aggregation queries.
## PHASE 4: DATA QUALITY & FAILURE HANDLING
**Data Quality Tests:**
- Uniqueness Check: Ensure order_id remains completely unique in the Silver layer after flattening.
- Freshness Check: Alert if the max ingestion_timestamp in the Gold layer is older than 26 hours.
**Failure Handling & Idempotency:**
Airflow tasks will use execution_date templating. If the pipeline fails on Tuesday, rerunning Tuesday's DAG will explicitly filter for Tuesday's data and use a MERGE (Upsert) statement in Snowflake based on the document _id, ensuring zero data duplication regardless of how many times the job is retried.
By purchasing this prompt, you agree to our terms of service
CLAUDE-4-7-OPUS
Stop building fragile data scripts that break silently. This prompt acts as a Principal Data Engineer, designing robust ETL/ELT pipelines for your data stack. It defines extraction strategies (incremental vs full), transformation logic, staging layers, and orchestration DAGs for tools like Airflow or dbt. Perfect for modernizing data architectures with idempotency, backfilling, and strict data quality checks built-in.
...more
Added 1 week ago
