Prompt Details
Model
Chat - (gpt-5.2)
Token size
301
Example input
[DATA_SOURCE]: AWS S3 (JSON remittance logs, hourly) + Kafka (CDC from PostgreSQL)
[TARGET_SYSTEM]: Snowflake finance-grade analytics warehouse
[KEY_TRANSFORMS]: Currency normalization, sanctions enrichment, PII tokenization, deduplication, SCD Type 2
[SLA_AND_QA]: 99.5% availability, <10-min latency, 0.1% error tolerance, automated validation, 24-hr SLA
Example output
Below is an implementation-grade specification aligned to 99.5% monthly availability, <10-minute settlement latency, ≤0.1% reconciliation error tolerance, automated Great Expectations validation, and 24-hour incident resolution SLA.
---
1) YAML Pipeline Specification
pipeline:
name: remittance_analytics_pipeline
version: 1.0
owner: data_platform
availability_target: 99.5_percent_monthly
settlement_latency_sla: "<10_minutes"
reconciliation_error_tolerance: "0.1_percent"
sources:
s3_batch:
type: aws_s3
format: json
schedule: hourly
path: s3://remittance-logs/YYYY/MM/DD/HH/
schema_registry: internal_glue_catalog
expected_volume_per_hour: 50k-200k_records
kafka_cdc:
type: kafka
topic: payments.cdc.postgres
connector: debezium_postgres
delivery_semantics: at_least_once
expected_tps: 200-500
staging_layer:
storage: snowflake_raw
tables:
- raw_remittance_logs
- raw_payments_cdc
ingestion_mode:
s3_batch: snowpipe_auto_ingest
kafka_cdc: snowflake_kafka_connector
canonical_model:
dimensions:
dim_customer:
surrogate_key: customer_sk
natural_key: customer_id
scd_type: 2
effective_from: timestamp
effective_to: timestamp
current_flag: boolean
dim_currency:
surrogate_key: currency_sk
natural_key: currency_code
dim_sanctions_status:
surrogate_key: sanctions_sk
attributes: [match_score, list_name, screened_at]
facts:
fact_transaction:
grain: one_row_per_unique_transaction_event
composite_dedupe_key:
- transaction_id
- event_type
- event_timestamp
measures:
- amount_original
- amount_usd
- fx_rate_applied
- fee_amount
fks:
- customer_sk
- currency_sk
- sanctions_sk
transformations:
currency_normalization:
priority_order:
- use_fx_rate_from_transaction_if_present
- else_lookup_internal_daily_fx_table
- else_flag_for_manual_review
rule: "amount_usd = amount_original * fx_rate"
rounding: 2_decimal_bankers_rounding
pii_tokenization:
method: sha256_with_salt
fields: [customer_name, email, phone, id_number]
token_store: vault_hsm_backed
sanctions_enrichment:
provider: ofac_un_consolidated_list
match_logic: fuzzy_name_match_threshold_0.85
duplicate_suppression:
strategy: merge_on_composite_key
data_quality:
framework: great_expectations
checks:
- non_null: [transaction_id, amount_original, currency_code]
- range_check: amount_original > 0
- uniqueness: composite_dedupe_key
- fx_rate_reasonable_range: 0.0001 - 1000
cdc_strategy:
snapshot_mode: initial_snapshot_then_incremental
watermark_column: updated_at
backfill:
strategy: time_range_replay_from_s3_archives
idempotent_merge: true
schema_evolution:
approach: additive_only_non_breaking
contract_tests: required_before_deploy
backward_compatibility_window: 30_days
idempotency:
guarantee: merge_into_with_deterministic_hash_key
hash_key: sha256(transaction_id||event_type||event_timestamp)
observability:
metrics:
- ingestion_lag_seconds
- duplicate_rate_percent
- fx_missing_rate_percent
- dq_failure_rate
alerting: pagerduty_on_threshold_breach
security:
encryption:
at_rest: snowflake_tde
in_transit: tls_1_2_plus
access_control: role_based_access_control
pii_access: masked_views
retention_policy:
raw: 90_days
curated: 7_years_finance_requirement
fault_tolerance:
retry_policy: exponential_backoff_5_retries
dlq_topic: payments_dead_letter
replay_supported: true
---
2) CSV-Ready ETL Mapping Table (Sample Rows)
source_system,source_field,target_table,target_field,transformation_rule,example_source,example_target
S3_JSON,txn_id,fact_transaction,transaction_id,Direct Map,"TX123","TX123"
S3_JSON,amount,fact_transaction,amount_original,Cast Decimal(18,4),"1500.50","1500.50"
S3_JSON,currency,fact_transaction,currency_sk,Lookup dim_currency,"NGN","23"
S3_JSON,amount,fact_transaction,amount_usd,"amount_original * fx_rate","1500.50 NGN","1.92"
Kafka_CDC,customer_email,dim_customer,email_token,"SHA256(email+salt)","user@email.com","A94A8FE5..."
Kafka_CDC,updated_at,dim_customer,effective_from,Direct Map,"2026-02-28T09:00:00Z","2026-02-28T09:00:00Z"
Kafka_CDC,name,dim_sanctions_status,match_score,FuzzyMatch(),"Ali Musa","0.91"
S3_JSON,txn_id,fact_transaction,hash_key,"SHA256(txn_id||event_type||event_ts)","TX123|SETTLED|ts","8C9F..."
---
3) Runnable Runbook
A. Snowflake Table DDL (Canonical Layer)
CREATE TABLE dim_customer (
customer_sk NUMBER AUTOINCREMENT,
customer_id STRING,
email_token STRING,
effective_from TIMESTAMP,
effective_to TIMESTAMP,
current_flag BOOLEAN
);
CREATE TABLE fact_transaction (
transaction_id STRING,
event_type STRING,
event_timestamp TIMESTAMP,
amount_original NUMBER(18,4),
amount_usd NUMBER(18,4),
fx_rate_applied NUMBER(18,6),
hash_key STRING,
customer_sk NUMBER
);
B. Idempotent MERGE (Duplicate Suppression)
MERGE INTO fact_transaction tgt
USING staging_transactions src
ON tgt.hash_key = src.hash_key
WHEN NOT MATCHED THEN
INSERT (...columns...)
VALUES (...values...);
C. Great Expectations CLI
great_expectations checkpoint run remittance_checkpoint
D. Backfill Replay
aws s3 cp s3://remittance-logs/2026/01/01/ ./ --recursive
snowflake ingest --stage raw_stage --pattern ".*2026/01/01.*"
E. Performance Benchmark Target
500 TPS sustained streaming ingest
<2 min raw-to-curated transform
≤0.05% duplicate rate
DQ runtime <5 min per batch
---
4) Executive Summary
Architecture Summary
Hybrid batch (S3 hourly) + streaming (Kafka CDC) pipeline into Snowflake using deterministic merge logic and canonical dimensional modeling.
Estimated Timeline
Design & Contracts: 3 weeks
Build & Integrations: 6–8 weeks
Hardening & Compliance Validation: 4 weeks
Total: ~12–15 weeks
Estimated Monthly Cost (Rough Order)
Snowflake: $8k–$20k (compute + storage, mid-volume)
Kafka infra: $3k–$8k
S3 + Data transfer: $1k–$3k
Sanctions API: $2k–$10k depending on provider
Total: $14k–$41k/month
Commercial Risks
FX rate dependency inconsistencies
Sanctions false positives (operational overhead)
CDC schema drift breaking ingestion
Network constraints increasing lag
---
Acceptance Criteria
1. End-to-end latency <10 min P95
2. 99.5% uptime monthly
3. Reconciliation variance ≤0.1%
4. 100% PII tokenized before curated layer
5. DQ failure rate <1%
---
Automated Tests
Unit
Hash key deterministic across runs
FX conversion correct to 2 decimal places
Tokenization irreversible
Integration
CDC replay produces no duplicates
Backfill merge does not override current_flag incorrectly
Sanctions enrichment logs match_score ≥ threshold
---
Security & Compliance Checklist
TLS enforced end-to-end
AES-256 encryption at rest
Role-based access + least privilege
Column masking for PII
Key rotation every 90 days
Audit logs immutable (7 years retention)
Data residency confirmed
---
Operations Implementation Checklist
[ ] Create Snowflake roles & warehouses
[ ] Configure Kafka connector with DLQ
[ ] Deploy Great Expectations suites
[ ] Configure PagerDuty alerts
[ ] Run synthetic load test
[ ] Document replay playbook
[ ] Validate reconciliation report
---
Realistic Limitations
99.5% availability excludes upstream outages
Fuzzy sanctions matching can create false positives
Near-real-time latency dependent on Snowflake warehouse sizing
Backfill reprocessing may temporarily breach SLA
---
If desired, I can next produce:
Full ER diagram (logical + physical)
Detailed SCD Type 2 merge SQL
Detailed CDC event envelope schema
Cost-optimized variant for constrained African network environments
Terraform IaC blueprint
By purchasing this prompt, you agree to our terms of service
GPT-5.2
Built for engineers and product teams, this prompt produces an implementation-ready data pipeline specification that shortens design cycles and reduces integration risk.
It tackles ambiguous mappings, schema drift, brittle backfills, and compliance gaps by generating ER/field-level mappings, deterministic transforms, CDC/backfill strategies, and repair playbooks.
Ideal for migrations, analytics platforms, regulatory compliance projects, and vendor evaluations where reproducible specs, contract..
...more
Added over 1 month ago
