Ahmed Hassan·
Lambda pipeline with an Airflow DAG and a Great Expectations gate blocking bad data from gold
Designs end-to-end data pipelines with ingestion, processing, transformation, and serving layers optimized for batch, streaming, or lambda architectures.
Data Pipeline Architecture Design
You are a Data Engineering Architect who built pipelines processing petabytes at Netflix and Spotify. Design a comprehensive data pipeline architecture.
**Data Sources**: {{data_sources}} (databases, APIs, event streams, files, IoT devices and their formats/volumes)
**Processing Pattern**: {{processing_pattern}} (batch, streaming, lambda/kappa, micro-batch)
**Latency Requirements**: {{latency_requirements}} (real-time <1s, near-real-time <1min, hourly, daily)
**Destination Systems**: {{destination_systems}} (data warehouse, data lake, BI tools, ML platform, reverse ETL)
Design the following layers:
1. **Ingestion Layer** - CDC (Debezium), API polling, webhook receivers, file watchers, streaming (Kafka/Kinesis/PubSub)
2. **Message Bus** - Kafka partitioning strategy, schema registry (Confluent/Apicurio), topic naming conventions
3. **Stream Processing** - Flink, Spark Streaming, Kafka Streams, ksqlDB comparison with job topology for each use case
4. **Batch Processing** - Spark, dbt, Airflow DAGs, Databricks, EMR with cluster sizing and spot instance strategy
5. **Transformation Layer** - medallion architecture (bronze/silver/gold), dbt models, data quality (Great Expectations/Soda)
6. **Storage Layer** - Data lake (S3 + Delta Lake/Iceberg/Hudi), data warehouse (Snowflake/BigQuery/Redshift), hot/warm/cold tiers
7. **Orchestration** - Airflow, Prefect, Dagster, AWS Step Functions comparison with DAG structure
8. **Data Quality** - Schema enforcement, data validation rules, anomaly detection, data lineage (OpenLineage/Marquez)
9. **Serving Layer** - Materialized views, feature store (Feast/Tecton), reverse ETL (Census/Hightouch), query optimization
10. **Governance** - Data catalog (Amundsen/DataHub), PII detection and masking, access control (RBAC/ABAC), retention policies
11. **Monitoring** - Pipeline health dashboards, data freshness SLAs, schema drift detection, alert thresholds
12. **Cost Optimization** - Storage lifecycle policies, compute auto-scaling, partitioning strategy, query optimization
Include actual Apache Airflow DAG pseudocode and Spark job structure for the core pipeline.
Ergebnisse
Lambda-architecture pipeline for **clickstream + order events** (Kafka source, near-real-time dashboards + daily warehouse loads). Medallion layout on S3 with Iceberg.
**Topology**
```
events ─▶ Kafka ─┬─▶ Flink (streaming) ─▶ Iceberg silver ─▶ live dashboard
└─▶ S3 raw (bronze) ─▶ Airflow ─▶ dbt (gold) ─▶ Snowflake
```
```python
# airflow/dags/daily_gold.py — bronze → silver → gold, idempotent by ds
with DAG("daily_gold", schedule="0 2 * * *", catchup=False) as dag:
ingest = SparkSubmitOperator(
task_id="bronze_to_silver",
application="jobs/dedupe_and_cast.py",
conf={"spark.sql.shuffle.partitions": "200"},
)
quality = GreatExpectationsOperator(
task_id="validate_silver", expectation_suite_name="orders.silver")
transform = DbtRunOperator(task_id="dbt_gold", select="marts.orders")
ingest >> quality >> transform # quality gate blocks bad data from gold
```
**Stream side:** Flink (not Spark Streaming) for true event-time windows and exactly-once sinks. Bronze is partitioned `dt=YYYY-MM-DD/hour=HH`; Iceberg handles compaction so small files don't pile up. **Data quality** is a hard gate — Great Expectations failing means gold never refreshes and PagerDuty fires. **Cost:** lifecycle policy moves bronze to Glacier after 30 days; Snowflake warehouse auto-suspends after 60s idle. Lineage is emitted to OpenLineage so a schema drift in the source is traceable to every downstream model.
Modell: Claude Opus 4
31 Likes11 SavesScore: 21
2 Kommentare
Ryan Mitchell·
Tried a variant of this and it caught a bug I'd been chasing for days.
Tobias Keller·
Solid. The typing is tight and it actually compiles, unlike half the snippets I find.