C
CIOPages
InsightsEnterprise Technology Operations
GuideEnterprise Technology Operations

Building Data Pipelines That Scale: ETL, ELT, and Streaming Architectures

Covers batch, micro-batch, and streaming pipeline architectures. Examines the shift from ETL to ELT with cloud data warehouses, change data capture patterns, and how to evaluate orchestration tools like Airflow, Prefect, and dbt.

CIOPages Editorial Team 15 min readApril 1, 2025

AI Advisor · Free Tool

Technology Landscape Advisor

Describe your technology challenge and get an AI-generated landscape analysis: relevant technology categories, key vendors (commercial and open source), recommended architecture patterns, and a curated shortlist — all tailored to your industry, organisation size, and constraints.

Vendor-neutral analysis
Architecture patterns
Downloadable Word report

Building Data Pipelines That Scale: ETL, ELT, and Streaming Architectures

$12.9M Average annual cost of poor data quality for enterprises — of which data pipeline failures, duplicates, and integration errors are the leading contributors (Gartner, 2024)

Data pipelines are the circulatory system of the modern data-driven enterprise. They move data from where it is generated — operational databases, SaaS applications, IoT devices, event streams — to where it needs to be consumed: analytics platforms, machine learning systems, operational dashboards, and downstream applications. When pipelines work well, they are invisible infrastructure. When they fail — producing stale dashboards, incorrect reports, or corrupted analytical models — the business impact is immediate and costly.

The data pipeline landscape has undergone significant evolution. The traditional ETL (Extract, Transform, Load) pattern — transforming data before loading it into a warehouse — has been supplemented and in many cases supplanted by ELT (Extract, Load, Transform), which leverages the computational power of modern cloud data warehouses to perform transformations after loading raw data. Simultaneously, batch processing has been complemented by streaming architectures for use cases requiring sub-minute data freshness.

This guide provides the architectural framework for making pipeline design decisions that age well: when to use ETL vs. ELT, when batch is sufficient vs. when streaming is necessary, how to orchestrate complex pipeline graphs, and what the operational requirements of each approach look like at enterprise scale.

Explore data integration and pipeline vendors: Data Integration & ETL Directory →


ETL vs. ELT: The Paradigm Shift

The choice between ETL and ELT is not merely technical — it reflects a fundamental difference in where analytical compute power lives and who owns the transformation logic.

ETL (Extract, Transform, Load)

In the traditional ETL model, data is extracted from source systems, transformed by a dedicated ETL engine into the target format, and then loaded into the destination (typically a data warehouse or data mart).

ETL strengths:

  • Sensitive data can be masked, filtered, or redacted before reaching the analytical environment — valuable when the destination has broader access than the source
  • Transformation logic runs on dedicated ETL infrastructure, not consuming warehouse compute
  • Well-suited to destinations with limited transformation capabilities (legacy data marts, non-SQL targets)
  • Mature tooling with decades of enterprise adoption

ETL limitations:

  • Raw source data is not preserved in the destination — if the transformation logic is wrong, data must be reloaded from source
  • Transformation changes require re-processing historical data
  • ETL infrastructure becomes a bottleneck and operational dependency
  • Column-level lineage is harder to maintain through heavy transformation logic

ELT (Extract, Load, Transform)

In the ELT model, raw data is extracted from sources and loaded directly into a cloud data warehouse or lake, preserving it in its original form. Transformations are then performed within the warehouse using SQL.

ELT strengths:

  • Raw data is always available — transformation errors can be corrected and rerun without reloading from source
  • SQL-based transformation logic (typically managed with dbt) is versioned, tested, and reviewed like application code
  • Modern cloud warehouses (Snowflake, BigQuery, Redshift, Databricks) have virtually unlimited compute for transformation — no dedicated ETL infrastructure bottleneck
  • Faster time to initial data availability — load raw data immediately, build transformations iteratively
  • Better lineage visibility — dbt generates column-level lineage through the transformation DAG

ELT limitations:

  • Raw sensitive data lands in the analytical environment — requires careful access control and data masking at the transformation layer
  • Transformation compute costs increase with data volume and query complexity
  • Teams must have SQL/dbt expertise for transformation ownership
Dimension ETL ELT
Where transforms run Dedicated ETL engine Data warehouse / lake
Raw data preservation No — transformed data only Yes — raw + transformed
Transformation language Tool-specific (GUI or code) SQL (typically dbt)
Error recovery Reload from source Rerun SQL transformation
Sensitive data handling Mask before loading (advantage) Requires warehouse-level controls
Infrastructure ETL server/cluster Warehouse compute (elastic)
Best for Legacy targets, sensitive data Cloud warehouses, modern stacks

dbt: The ELT Transformation Standard

dbt (data build tool) has become the standard tool for ELT transformation management. It enables data teams to write transformation logic as SQL SELECT statements, manage dependencies between transformations, test data quality, and generate documentation and lineage automatically.

-- dbt model: fct_orders.sql
-- Transforms raw orders with business rules applied
WITH orders AS (
    SELECT * FROM {{ ref('stg_orders') }}
),
customers AS (
    SELECT * FROM {{ ref('stg_customers') }}
)
SELECT
    o.order_id,
    o.order_date,
    c.customer_tier,
    o.order_value,
    o.order_value * CASE WHEN c.customer_tier = 'premium' THEN 0.9 ELSE 1.0 END AS adjusted_value
FROM orders o
LEFT JOIN customers c ON o.customer_id = c.customer_id

dbt's dependency graph enables full lineage from source tables to final analytical models, automated testing of data quality at each transformation step, and incremental materialization that processes only changed data — critical for performance at scale.


Batch vs. Streaming: Choosing the Right Processing Model

The choice between batch and streaming processing is determined by the latency requirements of the consuming use case, not by a preference for modern vs. legacy technology.

When Batch Processing Is Sufficient

Batch processing reads a bounded dataset, processes it, and produces output. It is the right model when:

  • Latency tolerance is hours or days: Daily financial reports, monthly billing summaries, weekly marketing analytics — these use cases are not harmed by data being 1–24 hours old
  • Full dataset operations are required: Machine learning training runs, full aggregate recalculations, end-of-period reporting — operations that need the complete dataset rather than an incremental update
  • Debugging and observability are priorities: Batch jobs produce bounded, inspectable datasets. Failures are easier to diagnose and replay than in continuous streaming systems

Apache Spark: The dominant large-scale batch processing engine. Python (PySpark), Scala, and SQL APIs. Excellent for ML feature engineering, large-scale data transformations, and complex multi-dataset operations. Available as managed service (AWS EMR, Azure HDInsight, Databricks).

Apache Hadoop MapReduce: The predecessor to Spark. Still in production at large enterprises with existing Hadoop investments but rarely chosen for new workloads.

dbt + warehouse compute: For SQL-native transformations, running dbt against a cloud warehouse is effectively batch processing at the warehouse layer — and for many analytical use cases, the simplest and most maintainable approach.

When Streaming Is Required

Streaming processes an unbounded stream of events as they arrive, maintaining state and producing output continuously. It is the right model when:

  • Sub-minute latency is required: Fraud detection, real-time inventory, live leaderboards, dynamic pricing — use cases where stale data has business cost
  • Event-time semantics matter: Correctly handling out-of-order events and late arrivals requires stream processing capabilities that batch does not naturally support
  • Continuous aggregation: Maintaining running totals, sliding window averages, or rolling statistics that update with each new event

Apache Kafka: The de facto standard for streaming data transport. Durable, distributed log that decouples data producers from consumers. Kafka stores streams as ordered, replayable logs — unlike traditional message queues, consumers can replay historical events by resetting their offset.

Apache Flink: The leading stateful stream processing engine. Exactly-once processing semantics, event-time windowing, and a rich API for complex stream transformations. Increasingly the choice for mission-critical streaming workloads.

Apache Spark Structured Streaming: Spark's streaming API, using the same DataFrame/SQL model as batch Spark. Micro-batch processing (not true event-at-a-time streaming) with configurable trigger intervals. Good for teams already using Spark who want unified batch/streaming logic.

Amazon Kinesis / Azure Event Hubs / Google Pub/Sub: Cloud-managed streaming transport services. Lower operational overhead than self-managed Kafka; typically used with cloud-native stream processing services.

Lambda vs. Kappa Architecture: The Lambda architecture maintains separate batch and streaming layers for the same data, combining them in a serving layer. The Kappa architecture uses a single streaming layer for both real-time and historical reprocessing. Modern streaming systems (Flink, Kafka with sufficient retention) have made Kappa the preferred pattern for new systems — maintaining two separate code paths for the same logic is an operational burden that the Kappa approach eliminates.


Pipeline Orchestration: Managing Complex Dependency Graphs

As pipeline complexity grows, orchestration becomes essential — managing dependencies between pipeline stages, scheduling runs, handling failures, retrying failed tasks, and providing visibility into pipeline health.

Apache Airflow

The most widely deployed open-source pipeline orchestrator. DAGs (Directed Acyclic Graphs) define pipeline logic in Python, with a rich operator library for interacting with databases, cloud services, and external APIs.

Airflow strengths: Maximum flexibility, large plugin ecosystem, strong community, extensive enterprise adoption. Airflow limitations: Operationally complex to maintain at scale; Python DAG code can become difficult to maintain; scheduler scaling challenges in large deployments.

Prefect and Dagster

Modern Airflow alternatives designed to address Airflow's operational complexity:

Prefect: Python-first workflow orchestration with a focus on developer experience. Dynamic workflows (create tasks at runtime), automatic retry logic, and a managed cloud offering. Strong for data engineering teams who want Airflow-like capabilities with lower operational overhead.

Dagster: Asset-centric orchestration that frames pipelines in terms of data assets (tables, files, ML models) rather than tasks. Excellent lineage visibility, integrated testing, and strong observability. Growing rapidly in data-platform-centric organizations.

dbt Cloud / Airflow / Prefect: A Practical Pattern

A common enterprise pipeline architecture combines:

  • Fivetran or Airbyte for extraction (managed connectors from 200+ sources)
  • Snowflake, BigQuery, or Databricks as the destination and transformation engine
  • dbt for transformation logic management
  • Airflow, Prefect, or dbt Cloud for orchestration and scheduling

This "Modern Data Stack" pattern delivers ELT with managed extraction, SQL transformation, and orchestrated scheduling — the majority of analytical pipeline requirements addressed by well-maintained managed services.


Data Quality in Pipelines

Pipeline reliability is inseparable from data quality. A pipeline that runs successfully but produces incorrect data is worse than a pipeline that fails — silent data quality issues corrupt downstream analytics and ML models without triggering any alerts.

Data quality dimensions:

  • Completeness: Are all expected records present? Is the row count within expected bounds?
  • Accuracy: Do values match expected ranges, formats, and referential constraints?
  • Freshness: Was the pipeline run within the expected schedule window? Is the most recent data timestamp recent enough?
  • Consistency: Are values consistent across related tables? Does revenue in the orders table match revenue in the invoices table?

Data quality tooling:

  • dbt tests: Built-in uniqueness, not-null, referential integrity, and accepted-values tests. Custom SQL tests for business rule validation.
  • Great Expectations: Open-source data quality framework with rich expectation library and validation checkpoints.
  • Monte Carlo / Acceldata: Data observability platforms providing automated anomaly detection on pipeline outputs — alerting when volume, freshness, or statistical distributions deviate from baselines.

Vendor Ecosystem

Explore data integration and pipeline vendors at the Data & Analytics Directory.

Managed Extraction (EL)

  • Fivetran — Market leader in managed connectors. 500+ pre-built connectors. Fully managed, minimal configuration. Premium pricing at scale.
  • Airbyte — Open-source with managed cloud option. 300+ connectors. More flexibility than Fivetran; more operational overhead in self-hosted mode.
  • Stitch (Talend) — Mid-market managed extraction. Good connector breadth at lower cost than Fivetran.

Transformation

  • dbt (dbt Labs) — The standard for SQL-based ELT transformation. dbt Core (open-source), dbt Cloud (managed). Essential for modern data stacks.
  • Spark / Databricks — For Python/Scala-based transformations on large datasets requiring distributed compute.

Orchestration

  • Apache Airflow — Open-source standard. Astronomer for managed Airflow.
  • Prefect — Modern Airflow alternative with better developer experience.
  • Dagster — Asset-centric orchestration with strong lineage.

Streaming

  • Confluent (managed Kafka) — Enterprise Kafka with managed connectors, schema registry, and stream processing.
  • Amazon MSK / Azure Event Hubs — Cloud-managed Kafka / event streaming.
  • Databricks Delta Live Tables — Declarative streaming pipeline framework on Databricks.

Key Takeaways

Data pipeline architecture is one of the most consequential technical decisions in the modern enterprise because pipelines are the foundation upon which every analytical, ML, and operational intelligence capability is built. The shift from ETL to ELT has unlocked faster iteration and better data lineage. The availability of managed extraction services (Fivetran, Airbyte) has reduced the engineering burden of source connectivity. And the emergence of dbt as a transformation standard has brought software engineering discipline — versioning, testing, documentation — to data transformation code.

The organizations that build reliable, scalable pipeline infrastructure invest in three things: data quality monitoring that catches silent failures before they corrupt downstream analysis, orchestration that provides visibility into pipeline health and graceful handling of failures, and a clear architectural decision about when streaming is genuinely required vs. when batch processing is sufficient. Streaming adds significant operational complexity — it should be adopted for the use cases that genuinely require sub-minute latency, not as a default architectural choice.


data pipelinesETLELTstreamingApache KafkaApache SparkdbtAirflowdata integrationreal-time databatch processingdata lakehouse
Share: