Loading learning content...
The Load phase represents the culmination of the ETL pipeline—the moment when extracted, cleansed, standardized, and enriched data finally reaches its target destination. This isn't merely a file copy or bulk insert; it's a carefully orchestrated process that must maintain data integrity, handle historical changes, optimize for query performance, and complete within operational windows.
Loading strategies directly impact how analysts and applications consume data. Decisions made here determine whether users see consistent snapshots or partially updated states, whether historical analysis is possible, whether queries run in seconds or minutes, and whether the warehouse can scale as data volumes grow.
The Load phase must answer critical questions: How do we handle records that already exist in the target? How do we preserve historical values when dimensions change? How do we load billions of rows within a nightly window? How do we ensure loading failures don't corrupt the warehouse?
These questions don't have universal answers—the right loading strategy depends on business requirements, data characteristics, and infrastructure capabilities. What's universal is the need to understand the trade-offs and choose deliberately.
By the end of this page, you will master the complete spectrum of data loading: initial vs. incremental loading patterns, merge/upsert operations, slowly changing dimension techniques (Type 1, Type 2, Type 3), bulk loading optimizations, transaction handling, and the architectural patterns that enable enterprise-scale data warehousing.
Loading strategies fall into several fundamental patterns, each appropriate for different scenarios. Understanding these patterns enables you to select the right approach for each table and use case.
The loading strategy decision tree:
| Pattern | Description | Use Case | Complexity |
|---|---|---|---|
| Full Refresh (Truncate/Load) | Delete all existing data, reload completely | Small tables, reference data, complete rebuilds | Low |
| Append Only | Insert new records without touching existing | Immutable event data, log tables, fact tables | Low |
| Upsert (Merge) | Insert new, update existing based on key | Dimension tables, mutable entities | Medium |
| Incremental with Deletes | Insert, update, and soft/hard delete | Complete synchronization with source | Medium-High |
| Type 2 SCD Load | Preserve history with versioning | Dimension tracking over time | High |
| Partition Swap | Load to staging partition, swap atomically | Large fact tables, minimal downtime | High |
Initial load vs. incremental load:
Most data warehouses have two distinct loading phases:
Initial Load (Backfill):
Incremental Load (Delta):
Initial Load: [═══════════════════════════════════════════]
Full historical data (years of transactions)
Incremental: [═══] [═══] [═══] [═══] [═══] ...
Daily changes only
Best practice: Load transformed data to staging tables first, then merge or swap into production tables. This separation enables validation before production impact, provides rollback capability, and isolates transformation failures from the warehouse. Production tables only see validated, complete data.
Merge (also called Upsert) is the most common loading pattern for dimension tables and mutable entities. It combines insert and update logic in a single operation: if a record with the matching key exists, update it; otherwise, insert it.
Most modern databases provide native MERGE statements, though implementations vary in syntax and capabilities.
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162
-- SQL Server / Oracle MERGE syntaxMERGE INTO dim_product AS targetUSING staging_products AS sourceON target.product_id = source.product_id -- When record exists, update itWHEN MATCHED THEN UPDATE SET target.product_name = source.product_name, target.category = source.category, target.unit_price = source.unit_price, target.last_updated = CURRENT_TIMESTAMP -- When record doesn't exist, insert itWHEN NOT MATCHED BY TARGET THEN INSERT (product_id, product_name, category, unit_price, created_at, last_updated) VALUES (source.product_id, source.product_name, source.category, source.unit_price, CURRENT_TIMESTAMP, CURRENT_TIMESTAMP) -- Optionally handle deletes (records in target not in source)WHEN NOT MATCHED BY SOURCE THEN UPDATE SET target.is_deleted = 1, target.deleted_at = CURRENT_TIMESTAMP; -- PostgreSQL UPSERT (INSERT ... ON CONFLICT)INSERT INTO dim_product ( product_id, product_name, category, unit_price, created_at, last_updated)SELECT product_id, product_name, category, unit_price, CURRENT_TIMESTAMP, CURRENT_TIMESTAMPFROM staging_products ON CONFLICT (product_id) DO UPDATE SET product_name = EXCLUDED.product_name, category = EXCLUDED.category, unit_price = EXCLUDED.unit_price, last_updated = CURRENT_TIMESTAMP; -- Snowflake MERGE syntaxMERGE INTO dim_product targetUSING staging_products sourceON target.product_id = source.product_id WHEN MATCHED AND ( target.product_name != source.product_name OR target.category != source.category OR target.unit_price != source.unit_price) THEN UPDATE SET product_name = source.product_name, category = source.category, unit_price = source.unit_price, last_updated = CURRENT_TIMESTAMP() WHEN NOT MATCHED THEN INSERT ( product_id, product_name, category, unit_price, created_at, last_updated) VALUES ( source.product_id, source.product_name, source.category, source.unit_price, CURRENT_TIMESTAMP(), CURRENT_TIMESTAMP());Optimizing merge operations:
Only update when changed: Add a condition to check if values actually differ before updating. Unnecessary updates waste I/O and bloat transaction logs.
Hash comparison: For wide tables, compute row-level hashes to quickly identify changed records:
WHEN MATCHED AND target.row_hash != source.row_hash THEN UPDATE...
Batch sizing: Large merges may cause lock contention. Break into batches by key range or limit rows per transaction.
Index considerations: Temporary disable non-essential indexes during large merges; rebuild afterward.
Statistics refresh: Update table statistics after significant merges to maintain query optimizer accuracy.
Watch for: (1) Non-deterministic matches—if multiple source rows match one target, behavior is undefined. Deduplicate staging first. (2) Concurrent modifications—if target table is modified during merge, conflicts may occur. (3) NULL handling—some databases treat NULL != NULL, affecting match logic.
Slowly Changing Dimensions (SCD) address a fundamental challenge in dimensional modeling: dimension attributes change over time, but historical analysis often needs to see the values as they were when facts occurred, not the current values.
For example, when a customer moves from California to Texas, should their historical orders show 'California' (where they were when they ordered) or 'Texas' (where they are now)? The answer depends on the analytical question—and SCD techniques provide the mechanisms to support both.
SCD Type Classification:
| Type | Strategy | History | Implementation |
|---|---|---|---|
| Type 0 | Retain original value | Never changes | No update logic needed |
| Type 1 | Overwrite with current value | No history preserved | Simple UPDATE statement |
| Type 2 | Add new row for each change | Full history preserved | Expire old row, insert new |
| Type 3 | Add column for previous value | Limited history (1 previous) | Store prior value in separate column |
| Type 4 | Separate history table | Full history in mini-dimension | Current in main, history in separate |
| Type 6 | Hybrid: Type 1 + 2 + 3 | Full history + current flag | Combines multiple techniques |
Type 2 SCD in depth:
Type 2 is the most powerful and commonly used SCD technique. It preserves complete history by creating new dimension records for each change, with metadata columns indicating validity periods.
Type 2 dimension structure:
dim_customer:
┌─────────────┬─────────────┬───────────┬───────────────┬───────────────┬────────────┐
│ customer_sk │ customer_id │ state │ effective_dt │ expiration_dt │ is_current │
├─────────────┼─────────────┼───────────┼───────────────┼───────────────┼────────────┤
│ 1001 │ C001 │ California│ 2020-01-01 │ 2023-06-14 │ FALSE │
│ 1047 │ C001 │ Texas │ 2023-06-15 │ 9999-12-31 │ TRUE │
└─────────────┴─────────────┴───────────┴───────────────┴───────────────┴────────────┘
The surrogate key (customer_sk) uniquely identifies each version. The natural key (customer_id) identifies the business entity. Effective and expiration dates define the validity window. The is_current flag enables fast filtering to current records.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081
-- Type 2 SCD Loading Pattern -- Step 1: Identify changed records (records where tracked attributes differ)CREATE TEMP TABLE scd_changes ASSELECT s.customer_id, s.customer_name, s.address, s.city, s.state, s.segment, 'UPDATE' AS change_typeFROM staging_customers sJOIN dim_customer d ON s.customer_id = d.customer_id AND d.is_current = TRUEWHERE ( s.customer_name != d.customer_name OR s.address != d.address OR s.city != d.city OR s.state != d.state OR s.segment != d.segment)UNION ALL-- Identify new records (inserts)SELECT s.customer_id, s.customer_name, s.address, s.city, s.state, s.segment, 'INSERT' AS change_typeFROM staging_customers sLEFT JOIN dim_customer d ON s.customer_id = d.customer_id AND d.is_current = TRUEWHERE d.customer_id IS NULL; -- Step 2: Expire current records that have changesUPDATE dim_customer SET expiration_date = CURRENT_DATE - INTERVAL '1 day', is_current = FALSEWHERE customer_id IN ( SELECT customer_id FROM scd_changes WHERE change_type = 'UPDATE')AND is_current = TRUE; -- Step 3: Insert new versions (both updates and new inserts)INSERT INTO dim_customer ( customer_sk, customer_id, customer_name, address, city, state, segment, effective_date, expiration_date, is_current)SELECT NEXTVAL('customer_sk_seq'), customer_id, customer_name, address, city, state, segment, CURRENT_DATE, '9999-12-31'::DATE, TRUEFROM scd_changes; -- Verification: Count records by typeSELECT change_type, COUNT(*) AS record_countFROM scd_changesGROUP BY change_type;When facts join to Type 2 dimensions, use the surrogate key stored with the fact at transaction time—this captures the dimension state when the fact occurred. For analyses needing current dimension values, join on natural key WHERE is_current = TRUE. Understanding this distinction is critical for accurate historical analysis.
Fact tables present unique loading challenges due to their size (often billions of rows) and performance-critical nature (the primary query target). Loading strategies must optimize for both load performance and query performance.
Fact table characteristics that impact loading:
Partition-based loading:
For large fact tables, partition exchange loading is the gold standard:
┌─────────────────────────────────────────────────────────┐
│ fact_sales Table │
├─────────────┬─────────────┬─────────────┬──────────────┤
│ 2024-01 │ 2024-02 │ 2024-03 │ 2024-04 │
│ (partition) │ (partition) │ (partition) │ (STAGING) │
│ ↓ │ ↓ │ ↓ │ ↓ │
│ [Complete] │ [Complete] │ [Complete] │ [Loading] ──┼──▶ [Swap In]
└─────────────┴─────────────┴─────────────┴──────────────┘
Benefits:
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273747576
-- Fact table loading with dimension key lookup -- Step 1: Load staging fact data with natural keysCREATE TEMP TABLE staging_sales_facts ( order_id VARCHAR(50), order_date DATE, customer_natural_key VARCHAR(50), product_natural_key VARCHAR(50), quantity INT, unit_price DECIMAL(10,2), discount DECIMAL(10,2)); COPY staging_sales_facts FROM 's3://data-lake/sales/2024-04-15/*.parquet'; -- Step 2: Lookup dimension surrogate keysCREATE TEMP TABLE enriched_sales_facts ASSELECT s.order_id, s.order_date, -- Date dimension key (typically the date itself or formatted) TO_CHAR(s.order_date, 'YYYYMMDD')::INT AS date_key, -- Customer dimension surrogate key (point-in-time lookup) COALESCE(c.customer_sk, -1) AS customer_key, -- Product dimension surrogate key (current product record) COALESCE(p.product_sk, -1) AS product_key, -- Measures s.quantity, s.unit_price, s.discount, (s.quantity * s.unit_price) - s.discount AS net_amount, -- Audit columns CURRENT_TIMESTAMP AS load_timestampFROM staging_sales_facts s -- Customer lookup: Find the customer version valid on order dateLEFT JOIN dim_customer c ON s.customer_natural_key = c.customer_natural_key AND s.order_date >= c.effective_date AND s.order_date < c.expiration_date -- Product lookup: Current product (could also be point-in-time)LEFT JOIN dim_product p ON s.product_natural_key = p.product_natural_key AND p.is_current = TRUE; -- Step 3: Validate before loading-- Check for unresolved dimension keysSELECT 'Unresolved customers' AS issue, COUNT(*) AS countFROM enriched_sales_facts WHERE customer_key = -1UNION ALLSELECT 'Unresolved products', COUNT(*)FROM enriched_sales_facts WHERE product_key = -1; -- Step 4: Load to production fact tableINSERT INTO fact_sales ( order_id, date_key, customer_key, product_key, quantity, unit_price, discount, net_amount, load_timestamp)SELECT * FROM enriched_sales_facts; -- Step 5: Update load metadataINSERT INTO etl_audit.fact_load_history ( table_name, load_date, rows_loaded, load_duration_seconds)VALUES ( 'fact_sales', CURRENT_DATE, (SELECT COUNT(*) FROM enriched_sales_facts), EXTRACT(EPOCH FROM (CURRENT_TIMESTAMP - load_start_time)));Transactions sometimes arrive after their natural time period has closed (delayed feeds, corrections, restatements). Loading late-arriving facts requires: (1) Ability to insert into historical partitions, (2) Correct dimension key lookup using the transaction date, not load date, (3) Potential aggregate recalculation if summary tables exist. Design for late arrivals from the start.
Loading large data volumes efficiently requires techniques that bypass normal transaction processing overhead. Bulk loading operations sacrifice some safety features for dramatic performance improvement.
Standard insert vs. bulk load:
Bulk loading techniques by platform:
| Platform | Bulk Load Command | Key Options |
|---|---|---|
| PostgreSQL | COPY | FROM file, binary format |
| SQL Server | BULK INSERT, bcp | TABLOCK, ROWS_PER_BATCH |
| Oracle | SQL*Loader, External Tables | DIRECT path, PARALLEL |
| Snowflake | COPY INTO | FROM stage, FILE_FORMAT |
| BigQuery | bq load, LOAD DATA | Avro/Parquet, source |
| Redshift | COPY | FROM S3, MANIFEST, COMPROWS |
Performance optimization strategies:
12345678910111213141516171819202122232425262728293031323334353637383940
-- Optimized bulk loading pattern for large fact table -- Step 1: Prepare target tableALTER TABLE fact_sales DISABLE TRIGGER ALL;ALTER INDEX idx_fact_sales_date ON fact_sales DISABLE;ALTER INDEX idx_fact_sales_customer ON fact_sales DISABLE; -- Step 2: Perform bulk load with optimal settings (PostgreSQL example)COPY fact_sales ( date_key, customer_key, product_key, quantity, unit_price, net_amount)FROM '/data/staging/sales_20240415.csv'WITH ( FORMAT CSV, HEADER TRUE, NULL '', QUOTE '"', PARALLEL 4); -- Step 3: Rebuild indexes (parallel where supported)ALTER INDEX idx_fact_sales_date ON fact_sales REBUILD;ALTER INDEX idx_fact_sales_customer ON fact_sales REBUILD; -- Step 4: Re-enable triggersALTER TABLE fact_sales ENABLE TRIGGER ALL; -- Step 5: Update statistics for optimizerANALYZE fact_sales; -- Snowflake parallel COPY exampleCOPY INTO fact_salesFROM @my_s3_stage/sales/2024-04-15/FILE_FORMAT = (TYPE = PARQUET)MATCH_BY_COLUMN_NAME = CASE_INSENSITIVEFORCE = FALSEON_ERROR = CONTINUE; -- Snowflake automatically parallelizes based on warehouse size and file countLoading operations must be atomic and recoverable. A failed load should leave the warehouse in a consistent state—either the load completes entirely or it's as if it never happened. This requires careful transaction management.
Atomicity patterns:
Failure recovery scenarios:
| Failure Point | Recovery Strategy |
|---|---|
| Network failure during extract | Re-extract from checkpoint |
| Transformation error | Fix logic, reprocess failed batch |
| Constraint violation on load | Identify violating rows, fix or route to error table |
| Disk space exhaustion | Clear space, resume from checkpoint |
| Database crash mid-load | Transaction rollback automatic; restart load |
| Load completed but wrong | Load from backup or reload staging |
Idempotent loading:
Loads should be idempotent—running them multiple times produces the same result. This enables safe retries without data duplication:
-- Non-idempotent (running twice doubles data):
INSERT INTO fact_sales SELECT * FROM staging_sales;
-- Idempotent (running twice has same effect as once):
MERGE INTO fact_sales t
USING staging_sales s ON t.order_id = s.order_id
WHEN MATCHED THEN UPDATE SET ...
WHEN NOT MATCHED THEN INSERT ...;
-- Or with explicit deduplication:
DELETE FROM fact_sales WHERE load_batch_id = 'batch_20240415';
INSERT INTO fact_sales SELECT * FROM staging_sales;
Assign each load a unique batch ID stored with every loaded row. This enables: (1) Easy identification of rows from a specific load, (2) Quick removal of a bad load, (3) Audit trail of when data arrived, (4) Idempotent reload by batch. The batch ID is your ETL's tracking mechanism.
Enterprise loading architectures incorporate patterns that ensure reliability, performance, and maintainability across hundreds of tables and loading jobs.
Multi-layer loading architecture:
┌───────────────────────────────────────────────────────────────────┐
│ External Sources │
│ [OLTP DBs] [APIs] [Files] [Streams] │
└────────────────────────┬──────────────────────────────────────────┘
│ Extract
▼
┌───────────────────────────────────────────────────────────────────┐
│ RAW / LANDING ZONE (Bronze) │
│ - Exact copy of source data │
│ - Full history retained │
│ - Minimal transformation │
└────────────────────────┬──────────────────────────────────────────┘
│ Cleanse & Standardize
▼
┌───────────────────────────────────────────────────────────────────┐
│ STAGING / INTEGRATION (Silver) │
│ - Cleansed and validated │
│ - Standardized formats │
│ - Business keys applied │
└────────────────────────┬──────────────────────────────────────────┘
│ Conform & Integrate
▼
┌───────────────────────────────────────────────────────────────────┐
│ WAREHOUSE / PRESENTATION (Gold) │
│ - Dimensional model │
│ - Conformed dimensions │
│ - Query-optimized │
└───────────────────────────────────────────────────────────────────┘
The typical load order is: (1) Reference/lookup tables, (2) Dimension tables (respecting inter-dimension dependencies), (3) Fact tables (requiring dimension keys). Modern orchestrators like Airflow, Dagster, and Prefect manage these dependencies as directed acyclic graphs (DAGs).
The Load phase is where data finally becomes available for analysis. Effective loading balances speed, reliability, consistency, and query performance—ensuring that the warehouse serves its analytical purpose.
What's next:
With the understanding of Extract, Transform, and Load phases complete, we turn to the tools and platforms that implement these patterns at scale. The next page explores ETL tools from traditional enterprise platforms to modern cloud-native solutions.
You now understand the Load phase: loading patterns, merge operations, slowly changing dimension techniques, fact table loading, bulk optimization, transaction management, and architectural best practices. Next, we'll survey the ETL tools and platforms that bring these concepts to life.