Database Management SystemsETL Process

ETL Process: Extract, Transform, Load

LevelAdvanced

Duration60 mins

TopicETL Process

3 / 5

Load: Data Loading Strategies and Techniques

The Final Mile of Data Integration

The Load phase represents the culmination of the ETL pipeline—the moment when extracted, cleansed, standardized, and enriched data finally reaches its target destination. This isn't merely a file copy or bulk insert; it's a carefully orchestrated process that must maintain data integrity, handle historical changes, optimize for query performance, and complete within operational windows.

Loading strategies directly impact how analysts and applications consume data. Decisions made here determine whether users see consistent snapshots or partially updated states, whether historical analysis is possible, whether queries run in seconds or minutes, and whether the warehouse can scale as data volumes grow.

The Load phase must answer critical questions: How do we handle records that already exist in the target? How do we preserve historical values when dimensions change? How do we load billions of rows within a nightly window? How do we ensure loading failures don't corrupt the warehouse?

These questions don't have universal answers—the right loading strategy depends on business requirements, data characteristics, and infrastructure capabilities. What's universal is the need to understand the trade-offs and choose deliberately.

What You Will Learn

By the end of this page, you will master the complete spectrum of data loading: initial vs. incremental loading patterns, merge/upsert operations, slowly changing dimension techniques (Type 1, Type 2, Type 3), bulk loading optimizations, transaction handling, and the architectural patterns that enable enterprise-scale data warehousing.

Loading Patterns Overview

Loading strategies fall into several fundamental patterns, each appropriate for different scenarios. Understanding these patterns enables you to select the right approach for each table and use case.

The loading strategy decision tree:

Loading Strategy Selection Matrix
Pattern	Description	Use Case	Complexity
Full Refresh (Truncate/Load)	Delete all existing data, reload completely	Small tables, reference data, complete rebuilds	Low
Append Only	Insert new records without touching existing	Immutable event data, log tables, fact tables	Low
Upsert (Merge)	Insert new, update existing based on key	Dimension tables, mutable entities	Medium
Incremental with Deletes	Insert, update, and soft/hard delete	Complete synchronization with source	Medium-High
Type 2 SCD Load	Preserve history with versioning	Dimension tracking over time	High
Partition Swap	Load to staging partition, swap atomically	Large fact tables, minimal downtime	High

Initial load vs. incremental load:

Most data warehouses have two distinct loading phases:

Initial Load (Backfill):

Populates the warehouse from scratch
Loads all historical data
Often the largest, most resource-intensive operation
May require special optimizations (disabled indexes, parallel loading)
Run once when warehouse is established or rebuilt

Incremental Load (Delta):

Loads only changed data since last load
Called regularly (hourly, daily, near-real-time)
Must handle inserts, updates, and deletes correctly
Performance-optimized for smaller data volumes
The steady-state operational mode

Initial Load:    [═══════════════════════════════════════════]
                 Full historical data (years of transactions)
                 
Incremental:     [═══]  [═══]  [═══]  [═══]  [═══]  ...
                 Daily changes only

The Staging Table Pattern

Best practice: Load transformed data to staging tables first, then merge or swap into production tables. This separation enables validation before production impact, provides rollback capability, and isolates transformation failures from the warehouse. Production tables only see validated, complete data.

Merge and Upsert Operations

Merge (also called Upsert) is the most common loading pattern for dimension tables and mutable entities. It combines insert and update logic in a single operation: if a record with the matching key exists, update it; otherwise, insert it.

Most modern databases provide native MERGE statements, though implementations vary in syntax and capabilities.

merge_operations.sql
SQL
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
-- SQL Server / Oracle MERGE syntax
MERGE INTO dim_product AS target
USING staging_products AS source
ON target.product_id = source.product_id
 
-- When record exists, update it
WHEN MATCHED THEN 
    UPDATE SET 
        target.product_name = source.product_name,
        target.category = source.category,
        target.unit_price = source.unit_price,
        target.last_updated = CURRENT_TIMESTAMP
 
-- When record doesn't exist, insert it
WHEN NOT MATCHED BY TARGET THEN 
    INSERT (product_id, product_name, category, unit_price, created_at, last_updated)
    VALUES (source.product_id, source.product_name, source.category, 
            source.unit_price, CURRENT_TIMESTAMP, CURRENT_TIMESTAMP)
 
-- Optionally handle deletes (records in target not in source)
WHEN NOT MATCHED BY SOURCE THEN 
    UPDATE SET target.is_deleted = 1, target.deleted_at = CURRENT_TIMESTAMP;
 
 
-- PostgreSQL UPSERT (INSERT ... ON CONFLICT)
INSERT INTO dim_product (
    product_id, product_name, category, unit_price, created_at, last_updated
)
SELECT 
    product_id, product_name, category, unit_price, 
    CURRENT_TIMESTAMP, CURRENT_TIMESTAMP
FROM staging_products
 
ON CONFLICT (product_id) 
DO UPDATE SET 
    product_name = EXCLUDED.product_name,
    category = EXCLUDED.category,
    unit_price = EXCLUDED.unit_price,
    last_updated = CURRENT_TIMESTAMP;
 
 
-- Snowflake MERGE syntax
MERGE INTO dim_product target
USING staging_products source
ON target.product_id = source.product_id
 
WHEN MATCHED AND (
    target.product_name != source.product_name OR
    target.category != source.category OR
    target.unit_price != source.unit_price
) THEN UPDATE SET 
    product_name = source.product_name,
    category = source.category,
    unit_price = source.unit_price,
    last_updated = CURRENT_TIMESTAMP()
 
WHEN NOT MATCHED THEN INSERT (
    product_id, product_name, category, unit_price, created_at, last_updated
) VALUES (
    source.product_id, source.product_name, source.category, 
    source.unit_price, CURRENT_TIMESTAMP(), CURRENT_TIMESTAMP()
);

Optimizing merge operations:

Only update when changed: Add a condition to check if values actually differ before updating. Unnecessary updates waste I/O and bloat transaction logs.
Hash comparison: For wide tables, compute row-level hashes to quickly identify changed records:

WHEN MATCHED AND target.row_hash != source.row_hash THEN UPDATE...

Batch sizing: Large merges may cause lock contention. Break into batches by key range or limit rows per transaction.
Index considerations: Temporary disable non-essential indexes during large merges; rebuild afterward.
Statistics refresh: Update table statistics after significant merges to maintain query optimizer accuracy.

Merge Edge Cases

Watch for: (1) Non-deterministic matches—if multiple source rows match one target, behavior is undefined. Deduplicate staging first. (2) Concurrent modifications—if target table is modified during merge, conflicts may occur. (3) NULL handling—some databases treat NULL != NULL, affecting match logic.

Slowly Changing Dimensions (SCD)

Slowly Changing Dimensions (SCD) address a fundamental challenge in dimensional modeling: dimension attributes change over time, but historical analysis often needs to see the values as they were when facts occurred, not the current values.

For example, when a customer moves from California to Texas, should their historical orders show 'California' (where they were when they ordered) or 'Texas' (where they are now)? The answer depends on the analytical question—and SCD techniques provide the mechanisms to support both.

SCD Type Classification:

Slowly Changing Dimension Types
Type	Strategy	History	Implementation
Type 0	Retain original value	Never changes	No update logic needed
Type 1	Overwrite with current value	No history preserved	Simple UPDATE statement
Type 2	Add new row for each change	Full history preserved	Expire old row, insert new
Type 3	Add column for previous value	Limited history (1 previous)	Store prior value in separate column
Type 4	Separate history table	Full history in mini-dimension	Current in main, history in separate
Type 6	Hybrid: Type 1 + 2 + 3	Full history + current flag	Combines multiple techniques

Type 2 SCD in depth:

Type 2 is the most powerful and commonly used SCD technique. It preserves complete history by creating new dimension records for each change, with metadata columns indicating validity periods.

Type 2 dimension structure:

dim_customer:
┌─────────────┬─────────────┬───────────┬───────────────┬───────────────┬────────────┐
│ customer_sk │ customer_id │ state     │ effective_dt  │ expiration_dt │ is_current │
├─────────────┼─────────────┼───────────┼───────────────┼───────────────┼────────────┤
│ 1001        │ C001        │ California│ 2020-01-01    │ 2023-06-14    │ FALSE      │
│ 1047        │ C001        │ Texas     │ 2023-06-15    │ 9999-12-31    │ TRUE       │
└─────────────┴─────────────┴───────────┴───────────────┴───────────────┴────────────┘

The surrogate key (customer_sk) uniquely identifies each version. The natural key (customer_id) identifies the business entity. Effective and expiration dates define the validity window. The is_current flag enables fast filtering to current records.

type2_scd_load.sql
SQL
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
-- Type 2 SCD Loading Pattern
 
-- Step 1: Identify changed records (records where tracked attributes differ)
CREATE TEMP TABLE scd_changes AS
SELECT 
    s.customer_id,
    s.customer_name,
    s.address,
    s.city,
    s.state,
    s.segment,
    'UPDATE' AS change_type
FROM staging_customers s
JOIN dim_customer d 
    ON s.customer_id = d.customer_id 
    AND d.is_current = TRUE
WHERE (
    s.customer_name != d.customer_name OR
    s.address != d.address OR
    s.city != d.city OR
    s.state != d.state OR
    s.segment != d.segment
)
UNION ALL
-- Identify new records (inserts)
SELECT 
    s.customer_id,
    s.customer_name,
    s.address,
    s.city,
    s.state,
    s.segment,
    'INSERT' AS change_type
FROM staging_customers s
LEFT JOIN dim_customer d 
    ON s.customer_id = d.customer_id 
    AND d.is_current = TRUE
WHERE d.customer_id IS NULL;
 
-- Step 2: Expire current records that have changes
UPDATE dim_customer 
SET 
    expiration_date = CURRENT_DATE - INTERVAL '1 day',
    is_current = FALSE
WHERE customer_id IN (
    SELECT customer_id FROM scd_changes WHERE change_type = 'UPDATE'
)
AND is_current = TRUE;
 
-- Step 3: Insert new versions (both updates and new inserts)
INSERT INTO dim_customer (
    customer_sk,
    customer_id,
    customer_name,
    address,
    city,
    state,
    segment,
    effective_date,
    expiration_date,
    is_current
)
SELECT 
    NEXTVAL('customer_sk_seq'),
    customer_id,
    customer_name,
    address,
    city,
    state,
    segment,
    CURRENT_DATE,
    '9999-12-31'::DATE,
    TRUE
FROM scd_changes;
 
-- Verification: Count records by type
SELECT 
    change_type,
    COUNT(*) AS record_count
FROM scd_changes
GROUP BY change_type;

Joining Facts to Type 2 Dimensions

When facts join to Type 2 dimensions, use the surrogate key stored with the fact at transaction time—this captures the dimension state when the fact occurred. For analyses needing current dimension values, join on natural key WHERE is_current = TRUE. Understanding this distinction is critical for accurate historical analysis.

Fact Table Loading Patterns

Fact tables present unique loading challenges due to their size (often billions of rows) and performance-critical nature (the primary query target). Loading strategies must optimize for both load performance and query performance.

Fact table characteristics that impact loading:

Fact Table Loading Considerations

•Volume: Millions to billions of rows; loading must be highly efficient
•Append-mostly nature: Most facts are immutable events; updates are rare
•Time-based partitioning: Facts are typically partitioned by date, enabling partition-level operations
•Dimension key lookups: Facts reference dimensions via surrogate keys; lookups required during load
•Aggregation potential: Pre-aggregated fact tables (summary tables) reduce query load
•Late-arriving facts: Transactions for previous periods require backward loading capability

Partition-based loading:

For large fact tables, partition exchange loading is the gold standard:

Load data into a staging table structured identically to a partition
Validate the staged data thoroughly
Exchange the staging table with the target partition (atomic metadata swap)
Previous partition can be retained for rollback

┌─────────────────────────────────────────────────────────┐
│ fact_sales Table                                        │
├─────────────┬─────────────┬─────────────┬──────────────┤
│ 2024-01     │ 2024-02     │ 2024-03     │ 2024-04      │
│ (partition) │ (partition) │ (partition) │ (STAGING)    │
│ ↓           │ ↓           │ ↓           │      ↓       │
│ [Complete]  │ [Complete]  │ [Complete]  │  [Loading] ──┼──▶ [Swap In]
└─────────────┴─────────────┴─────────────┴──────────────┘

Benefits:

Loading doesn't lock the production table
Atomic swap means no partial visibility
Failed loads don't impact existing data
Easy rollback by swapping back

fact_loading_pattern.sql
SQL
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
-- Fact table loading with dimension key lookup
 
-- Step 1: Load staging fact data with natural keys
CREATE TEMP TABLE staging_sales_facts (
    order_id VARCHAR(50),
    order_date DATE,
    customer_natural_key VARCHAR(50),
    product_natural_key VARCHAR(50),
    quantity INT,
    unit_price DECIMAL(10,2),
    discount DECIMAL(10,2)
);
 
COPY staging_sales_facts FROM 's3://data-lake/sales/2024-04-15/*.parquet';
 
-- Step 2: Lookup dimension surrogate keys
CREATE TEMP TABLE enriched_sales_facts AS
SELECT 
    s.order_id,
    s.order_date,
    
    -- Date dimension key (typically the date itself or formatted)
    TO_CHAR(s.order_date, 'YYYYMMDD')::INT AS date_key,
    
    -- Customer dimension surrogate key (point-in-time lookup)
    COALESCE(c.customer_sk, -1) AS customer_key,
    
    -- Product dimension surrogate key (current product record)
    COALESCE(p.product_sk, -1) AS product_key,
    
    -- Measures
    s.quantity,
    s.unit_price,
    s.discount,
    (s.quantity * s.unit_price) - s.discount AS net_amount,
    
    -- Audit columns
    CURRENT_TIMESTAMP AS load_timestamp
FROM staging_sales_facts s
 
-- Customer lookup: Find the customer version valid on order date
LEFT JOIN dim_customer c 
    ON s.customer_natural_key = c.customer_natural_key
    AND s.order_date >= c.effective_date 
    AND s.order_date < c.expiration_date
 
-- Product lookup: Current product (could also be point-in-time)
LEFT JOIN dim_product p 
    ON s.product_natural_key = p.product_natural_key
    AND p.is_current = TRUE;
 
-- Step 3: Validate before loading
-- Check for unresolved dimension keys
SELECT 'Unresolved customers' AS issue, COUNT(*) AS count
FROM enriched_sales_facts WHERE customer_key = -1
UNION ALL
SELECT 'Unresolved products', COUNT(*)
FROM enriched_sales_facts WHERE product_key = -1;
 
-- Step 4: Load to production fact table
INSERT INTO fact_sales (
    order_id, date_key, customer_key, product_key,
    quantity, unit_price, discount, net_amount, load_timestamp
)
SELECT * FROM enriched_sales_facts;
 
-- Step 5: Update load metadata
INSERT INTO etl_audit.fact_load_history (
    table_name, load_date, rows_loaded, load_duration_seconds
)
VALUES (
    'fact_sales', 
    CURRENT_DATE, 
    (SELECT COUNT(*) FROM enriched_sales_facts),
    EXTRACT(EPOCH FROM (CURRENT_TIMESTAMP - load_start_time))
);

Late-Arriving Facts

Transactions sometimes arrive after their natural time period has closed (delayed feeds, corrections, restatements). Loading late-arriving facts requires: (1) Ability to insert into historical partitions, (2) Correct dimension key lookup using the transaction date, not load date, (3) Potential aggregate recalculation if summary tables exist. Design for late arrivals from the start.

Bulk Loading Optimization

Loading large data volumes efficiently requires techniques that bypass normal transaction processing overhead. Bulk loading operations sacrifice some safety features for dramatic performance improvement.

Standard insert vs. bulk load:

Standard INSERT (Row-by-Row)

•~1,000-10,000 rows/second
•Full transaction logging
•Index maintenance per row
•Constraint checking per row
•Trigger execution per row
•Suitable for small loads

BULK INSERT / COPY

•100,000-1,000,000+ rows/second
•Minimal logging options
•Batch index maintenance
•Deferred constraint checking
•Trigger bypass options
•Essential for large loads

Bulk loading techniques by platform:

Platform	Bulk Load Command	Key Options
PostgreSQL	COPY	FROM file, binary format
SQL Server	BULK INSERT, bcp	TABLOCK, ROWS_PER_BATCH
Oracle	SQL*Loader, External Tables	DIRECT path, PARALLEL
Snowflake	COPY INTO	FROM stage, FILE_FORMAT
BigQuery	bq load, LOAD DATA	Avro/Parquet, source
Redshift	COPY	FROM S3, MANIFEST, COMPROWS

Performance optimization strategies:

Bulk Load Optimization Techniques

•Disable/rebuild indexes: Drop non-clustered indexes before load, recreate after. Bulk index build is far faster than incremental maintenance.
•Disable constraints temporarily: Turn off foreign key and check constraints during load, validate afterward. Eliminates per-row checking overhead.
•Disable triggers: Bypass triggers during bulk load, apply trigger logic in transformation layer instead.
•Minimize logging: Use minimal/bulk-logged recovery modes where available. Full logging can be the primary bottleneck.
•Parallel loading: Load multiple partitions or files concurrently. Modern warehouses scale linearly with parallelism.
•File format optimization: Use columnar formats (Parquet, ORC) over row formats (CSV). Enable compression.
•Pre-sort data: If loading into clustered/sorted tables, pre-sort data to match target order. Avoids expensive re-sorting.
•Batch sizing: Optimal batch sizes vary by platform (10K-100K rows typical). Too small = overhead; too large = transaction log pressure.

bulk_load_example.sql
SQL
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
-- Optimized bulk loading pattern for large fact table
 
-- Step 1: Prepare target table
ALTER TABLE fact_sales DISABLE TRIGGER ALL;
ALTER INDEX idx_fact_sales_date ON fact_sales DISABLE;
ALTER INDEX idx_fact_sales_customer ON fact_sales DISABLE;
 
-- Step 2: Perform bulk load with optimal settings (PostgreSQL example)
COPY fact_sales (
    date_key, customer_key, product_key, 
    quantity, unit_price, net_amount
)
FROM '/data/staging/sales_20240415.csv'
WITH (
    FORMAT CSV,
    HEADER TRUE,
    NULL '',
    QUOTE '"',
    PARALLEL 4
);
 
-- Step 3: Rebuild indexes (parallel where supported)
ALTER INDEX idx_fact_sales_date ON fact_sales REBUILD;
ALTER INDEX idx_fact_sales_customer ON fact_sales REBUILD;
 
-- Step 4: Re-enable triggers
ALTER TABLE fact_sales ENABLE TRIGGER ALL;
 
-- Step 5: Update statistics for optimizer
ANALYZE fact_sales;
 
-- Snowflake parallel COPY example
COPY INTO fact_sales
FROM @my_s3_stage/sales/2024-04-15/
FILE_FORMAT = (TYPE = PARQUET)
MATCH_BY_COLUMN_NAME = CASE_INSENSITIVE
FORCE = FALSE
ON_ERROR = CONTINUE;
 
-- Snowflake automatically parallelizes based on warehouse size and file count

Transaction Management and Recovery

Loading operations must be atomic and recoverable. A failed load should leave the warehouse in a consistent state—either the load completes entirely or it's as if it never happened. This requires careful transaction management.

Atomicity patterns:

Transaction Management Principles

•Single transaction load: Wrap entire load in one transaction. Simple but risky for large loads—failure means complete restart.
•Batch transactions: Commit after each batch. Enables progress even with failures, but partial state visible.
•Staging table pattern: Load to staging, validate, then single-transaction merge to production. Best of both worlds.
•Partition swap pattern: Load to staging partition, atomic DDL swap. Zero downtime, instant rollback capability.
•Savepoints: Create checkpoints within transaction for partial rollback without losing all progress.

Failure recovery scenarios:

Failure Point	Recovery Strategy
Network failure during extract	Re-extract from checkpoint
Transformation error	Fix logic, reprocess failed batch
Constraint violation on load	Identify violating rows, fix or route to error table
Disk space exhaustion	Clear space, resume from checkpoint
Database crash mid-load	Transaction rollback automatic; restart load
Load completed but wrong	Load from backup or reload staging

Idempotent loading:

Loads should be idempotent—running them multiple times produces the same result. This enables safe retries without data duplication:

-- Non-idempotent (running twice doubles data):
INSERT INTO fact_sales SELECT * FROM staging_sales;

-- Idempotent (running twice has same effect as once):
MERGE INTO fact_sales t
USING staging_sales s ON t.order_id = s.order_id
WHEN MATCHED THEN UPDATE SET ...
WHEN NOT MATCHED THEN INSERT ...;

-- Or with explicit deduplication:
DELETE FROM fact_sales WHERE load_batch_id = 'batch_20240415';
INSERT INTO fact_sales SELECT * FROM staging_sales;

The Load Batch ID Pattern

Assign each load a unique batch ID stored with every loaded row. This enables: (1) Easy identification of rows from a specific load, (2) Quick removal of a bad load, (3) Audit trail of when data arrived, (4) Idempotent reload by batch. The batch ID is your ETL's tracking mechanism.

Loading Architecture Best Practices

Enterprise loading architectures incorporate patterns that ensure reliability, performance, and maintainability across hundreds of tables and loading jobs.

Multi-layer loading architecture:

┌───────────────────────────────────────────────────────────────────┐
│ External Sources                                                  │
│ [OLTP DBs] [APIs] [Files] [Streams]                              │
└────────────────────────┬──────────────────────────────────────────┘
                         │ Extract
                         ▼
┌───────────────────────────────────────────────────────────────────┐
│ RAW / LANDING ZONE (Bronze)                                       │
│ - Exact copy of source data                                       │
│ - Full history retained                                           │
│ - Minimal transformation                                          │
└────────────────────────┬──────────────────────────────────────────┘
                         │ Cleanse & Standardize
                         ▼
┌───────────────────────────────────────────────────────────────────┐
│ STAGING / INTEGRATION (Silver)                                    │
│ - Cleansed and validated                                          │
│ - Standardized formats                                            │
│ - Business keys applied                                           │
└────────────────────────┬──────────────────────────────────────────┘
                         │ Conform & Integrate
                         ▼
┌───────────────────────────────────────────────────────────────────┐
│ WAREHOUSE / PRESENTATION (Gold)                                   │
│ - Dimensional model                                               │
│ - Conformed dimensions                                            │
│ - Query-optimized                                                 │
└───────────────────────────────────────────────────────────────────┘

Loading Best Practices

•Validate before production load: All quality checks pass in staging before any production table is touched. Never load bad data.
•Maintain load history: Log every load with row counts, timing, and status. Essential for debugging and SLA tracking.
•Enable rollback: Design loads so bad data can be removed. Batch IDs, partition swaps, and backup tables support this.
•Monitor performance trends: Track load duration over time. Increasing durations indicate problems before they become outages.
•Coordinate dependencies: Load dimension tables before fact tables that reference them. Orchestrators manage DAG execution.
•Handle late-arriving data: Design for out-of-order arrivals, corrections, and restatements from the start.
•Test load recovery: Periodically test that failed loads can be recovered. Don't learn your recovery process during an outage.
•Document load processes: Each table should have documented load patterns, schedules, dependencies, and ownership.

Load Order Matters

The typical load order is: (1) Reference/lookup tables, (2) Dimension tables (respecting inter-dimension dependencies), (3) Fact tables (requiring dimension keys). Modern orchestrators like Airflow, Dagster, and Prefect manage these dependencies as directed acyclic graphs (DAGs).

Summary: Mastering Data Loading

The Load phase is where data finally becomes available for analysis. Effective loading balances speed, reliability, consistency, and query performance—ensuring that the warehouse serves its analytical purpose.

Key Takeaways

•Choose the right loading pattern: Full refresh vs. incremental vs. SCD—each has appropriate use cases based on table size, change frequency, and history requirements.
•Merge/upsert handles mutable data: INSERT...ON CONFLICT or MERGE statements provide atomic insert-or-update semantics for dimension tables.
•Type 2 SCD preserves history: Expire old dimension records and insert new versions to enable point-in-time analysis while tracking complete change history.
•Fact loading requires dimension key resolution: Look up surrogate keys during fact loading, using point-in-time dimension versions for historical accuracy.
•Bulk loading techniques are essential: Disable indexes, minimize logging, parallelize—without optimization, large loads won't complete within windows.
•Transaction management enables recovery: Staging table patterns, partition swaps, and idempotent operations ensure failed loads don't corrupt the warehouse.
•Multi-layer architecture provides flexibility: Raw → Staging → Warehouse layers separate concerns and enable validation before production impact.
•Orchestration manages dependencies: Dimensions load before facts; orchestrators ensure correct ordering and handle failures gracefully.

What's next:

With the understanding of Extract, Transform, and Load phases complete, we turn to the tools and platforms that implement these patterns at scale. The next page explores ETL tools from traditional enterprise platforms to modern cloud-native solutions.

Page Complete

You now understand the Load phase: loading patterns, merge operations, slowly changing dimension techniques, fact table loading, bulk optimization, transaction management, and architectural best practices. Next, we'll survey the ETL tools and platforms that bring these concepts to life.

3 / 5

Loading learning content...

Database Management SystemsETL Process

ETL Process: Extract, Transform, Load

LevelAdvanced

Duration60 mins

TopicETL Process

3 / 5

Load: Data Loading Strategies and Techniques

The Final Mile of Data Integration

What You Will Learn

Loading Patterns Overview

Loading strategies fall into several fundamental patterns, each appropriate for different scenarios. Understanding these patterns enables you to select the right approach for each table and use case.

The loading strategy decision tree:

Loading Strategy Selection Matrix
Pattern	Description	Use Case	Complexity
Full Refresh (Truncate/Load)	Delete all existing data, reload completely	Small tables, reference data, complete rebuilds	Low
Append Only	Insert new records without touching existing	Immutable event data, log tables, fact tables	Low
Upsert (Merge)	Insert new, update existing based on key	Dimension tables, mutable entities	Medium
Incremental with Deletes	Insert, update, and soft/hard delete	Complete synchronization with source	Medium-High
Type 2 SCD Load	Preserve history with versioning	Dimension tracking over time	High
Partition Swap	Load to staging partition, swap atomically	Large fact tables, minimal downtime	High

Initial load vs. incremental load:

Most data warehouses have two distinct loading phases:

Initial Load (Backfill):

Populates the warehouse from scratch
Loads all historical data
Often the largest, most resource-intensive operation
May require special optimizations (disabled indexes, parallel loading)
Run once when warehouse is established or rebuilt

Incremental Load (Delta):

Loads only changed data since last load
Called regularly (hourly, daily, near-real-time)
Must handle inserts, updates, and deletes correctly
Performance-optimized for smaller data volumes
The steady-state operational mode

Initial Load:    [═══════════════════════════════════════════]
                 Full historical data (years of transactions)
                 
Incremental:     [═══]  [═══]  [═══]  [═══]  [═══]  ...
                 Daily changes only

The Staging Table Pattern

Merge and Upsert Operations

Most modern databases provide native MERGE statements, though implementations vary in syntax and capabilities.

merge_operations.sql
SQL
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
-- SQL Server / Oracle MERGE syntax
MERGE INTO dim_product AS target
USING staging_products AS source
ON target.product_id = source.product_id
 
-- When record exists, update it
WHEN MATCHED THEN 
    UPDATE SET 
        target.product_name = source.product_name,
        target.category = source.category,
        target.unit_price = source.unit_price,
        target.last_updated = CURRENT_TIMESTAMP
 
-- When record doesn't exist, insert it
WHEN NOT MATCHED BY TARGET THEN 
    INSERT (product_id, product_name, category, unit_price, created_at, last_updated)
    VALUES (source.product_id, source.product_name, source.category, 
            source.unit_price, CURRENT_TIMESTAMP, CURRENT_TIMESTAMP)
 
-- Optionally handle deletes (records in target not in source)
WHEN NOT MATCHED BY SOURCE THEN 
    UPDATE SET target.is_deleted = 1, target.deleted_at = CURRENT_TIMESTAMP;
 
 
-- PostgreSQL UPSERT (INSERT ... ON CONFLICT)
INSERT INTO dim_product (
    product_id, product_name, category, unit_price, created_at, last_updated
)
SELECT 
    product_id, product_name, category, unit_price, 
    CURRENT_TIMESTAMP, CURRENT_TIMESTAMP
FROM staging_products
 
ON CONFLICT (product_id) 
DO UPDATE SET 
    product_name = EXCLUDED.product_name,
    category = EXCLUDED.category,
    unit_price = EXCLUDED.unit_price,
    last_updated = CURRENT_TIMESTAMP;
 
 
-- Snowflake MERGE syntax
MERGE INTO dim_product target
USING staging_products source
ON target.product_id = source.product_id
 
WHEN MATCHED AND (
    target.product_name != source.product_name OR
    target.category != source.category OR
    target.unit_price != source.unit_price
) THEN UPDATE SET 
    product_name = source.product_name,
    category = source.category,
    unit_price = source.unit_price,
    last_updated = CURRENT_TIMESTAMP()
 
WHEN NOT MATCHED THEN INSERT (
    product_id, product_name, category, unit_price, created_at, last_updated
) VALUES (
    source.product_id, source.product_name, source.category, 
    source.unit_price, CURRENT_TIMESTAMP(), CURRENT_TIMESTAMP()
);

Optimizing merge operations:

Only update when changed: Add a condition to check if values actually differ before updating. Unnecessary updates waste I/O and bloat transaction logs.
Hash comparison: For wide tables, compute row-level hashes to quickly identify changed records:

WHEN MATCHED AND target.row_hash != source.row_hash THEN UPDATE...

Batch sizing: Large merges may cause lock contention. Break into batches by key range or limit rows per transaction.
Index considerations: Temporary disable non-essential indexes during large merges; rebuild afterward.
Statistics refresh: Update table statistics after significant merges to maintain query optimizer accuracy.

Merge Edge Cases

Slowly Changing Dimensions (SCD)

SCD Type Classification:

Slowly Changing Dimension Types
Type	Strategy	History	Implementation
Type 0	Retain original value	Never changes	No update logic needed
Type 1	Overwrite with current value	No history preserved	Simple UPDATE statement
Type 2	Add new row for each change	Full history preserved	Expire old row, insert new
Type 3	Add column for previous value	Limited history (1 previous)	Store prior value in separate column
Type 4	Separate history table	Full history in mini-dimension	Current in main, history in separate
Type 6	Hybrid: Type 1 + 2 + 3	Full history + current flag	Combines multiple techniques

Type 2 SCD in depth:

Type 2 is the most powerful and commonly used SCD technique. It preserves complete history by creating new dimension records for each change, with metadata columns indicating validity periods.

Type 2 dimension structure:

dim_customer:
┌─────────────┬─────────────┬───────────┬───────────────┬───────────────┬────────────┐
│ customer_sk │ customer_id │ state     │ effective_dt  │ expiration_dt │ is_current │
├─────────────┼─────────────┼───────────┼───────────────┼───────────────┼────────────┤
│ 1001        │ C001        │ California│ 2020-01-01    │ 2023-06-14    │ FALSE      │
│ 1047        │ C001        │ Texas     │ 2023-06-15    │ 9999-12-31    │ TRUE       │
└─────────────┴─────────────┴───────────┴───────────────┴───────────────┴────────────┘

type2_scd_load.sql
SQL
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
-- Type 2 SCD Loading Pattern
 
-- Step 1: Identify changed records (records where tracked attributes differ)
CREATE TEMP TABLE scd_changes AS
SELECT 
    s.customer_id,
    s.customer_name,
    s.address,
    s.city,
    s.state,
    s.segment,
    'UPDATE' AS change_type
FROM staging_customers s
JOIN dim_customer d 
    ON s.customer_id = d.customer_id 
    AND d.is_current = TRUE
WHERE (
    s.customer_name != d.customer_name OR
    s.address != d.address OR
    s.city != d.city OR
    s.state != d.state OR
    s.segment != d.segment
)
UNION ALL
-- Identify new records (inserts)
SELECT 
    s.customer_id,
    s.customer_name,
    s.address,
    s.city,
    s.state,
    s.segment,
    'INSERT' AS change_type
FROM staging_customers s
LEFT JOIN dim_customer d 
    ON s.customer_id = d.customer_id 
    AND d.is_current = TRUE
WHERE d.customer_id IS NULL;
 
-- Step 2: Expire current records that have changes
UPDATE dim_customer 
SET 
    expiration_date = CURRENT_DATE - INTERVAL '1 day',
    is_current = FALSE
WHERE customer_id IN (
    SELECT customer_id FROM scd_changes WHERE change_type = 'UPDATE'
)
AND is_current = TRUE;
 
-- Step 3: Insert new versions (both updates and new inserts)
INSERT INTO dim_customer (
    customer_sk,
    customer_id,
    customer_name,
    address,
    city,
    state,
    segment,
    effective_date,
    expiration_date,
    is_current
)
SELECT 
    NEXTVAL('customer_sk_seq'),
    customer_id,
    customer_name,
    address,
    city,
    state,
    segment,
    CURRENT_DATE,
    '9999-12-31'::DATE,
    TRUE
FROM scd_changes;
 
-- Verification: Count records by type
SELECT 
    change_type,
    COUNT(*) AS record_count
FROM scd_changes
GROUP BY change_type;

Joining Facts to Type 2 Dimensions

Fact Table Loading Patterns

Fact table characteristics that impact loading:

Fact Table Loading Considerations

•Volume: Millions to billions of rows; loading must be highly efficient
•Append-mostly nature: Most facts are immutable events; updates are rare
•Time-based partitioning: Facts are typically partitioned by date, enabling partition-level operations
•Dimension key lookups: Facts reference dimensions via surrogate keys; lookups required during load
•Aggregation potential: Pre-aggregated fact tables (summary tables) reduce query load
•Late-arriving facts: Transactions for previous periods require backward loading capability

Partition-based loading:

For large fact tables, partition exchange loading is the gold standard:

Load data into a staging table structured identically to a partition
Validate the staged data thoroughly
Exchange the staging table with the target partition (atomic metadata swap)
Previous partition can be retained for rollback

┌─────────────────────────────────────────────────────────┐
│ fact_sales Table                                        │
├─────────────┬─────────────┬─────────────┬──────────────┤
│ 2024-01     │ 2024-02     │ 2024-03     │ 2024-04      │
│ (partition) │ (partition) │ (partition) │ (STAGING)    │
│ ↓           │ ↓           │ ↓           │      ↓       │
│ [Complete]  │ [Complete]  │ [Complete]  │  [Loading] ──┼──▶ [Swap In]
└─────────────┴─────────────┴─────────────┴──────────────┘

Benefits:

Loading doesn't lock the production table
Atomic swap means no partial visibility
Failed loads don't impact existing data
Easy rollback by swapping back

fact_loading_pattern.sql
SQL
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
-- Fact table loading with dimension key lookup
 
-- Step 1: Load staging fact data with natural keys
CREATE TEMP TABLE staging_sales_facts (
    order_id VARCHAR(50),
    order_date DATE,
    customer_natural_key VARCHAR(50),
    product_natural_key VARCHAR(50),
    quantity INT,
    unit_price DECIMAL(10,2),
    discount DECIMAL(10,2)
);
 
COPY staging_sales_facts FROM 's3://data-lake/sales/2024-04-15/*.parquet';
 
-- Step 2: Lookup dimension surrogate keys
CREATE TEMP TABLE enriched_sales_facts AS
SELECT 
    s.order_id,
    s.order_date,
    
    -- Date dimension key (typically the date itself or formatted)
    TO_CHAR(s.order_date, 'YYYYMMDD')::INT AS date_key,
    
    -- Customer dimension surrogate key (point-in-time lookup)
    COALESCE(c.customer_sk, -1) AS customer_key,
    
    -- Product dimension surrogate key (current product record)
    COALESCE(p.product_sk, -1) AS product_key,
    
    -- Measures
    s.quantity,
    s.unit_price,
    s.discount,
    (s.quantity * s.unit_price) - s.discount AS net_amount,
    
    -- Audit columns
    CURRENT_TIMESTAMP AS load_timestamp
FROM staging_sales_facts s
 
-- Customer lookup: Find the customer version valid on order date
LEFT JOIN dim_customer c 
    ON s.customer_natural_key = c.customer_natural_key
    AND s.order_date >= c.effective_date 
    AND s.order_date < c.expiration_date
 
-- Product lookup: Current product (could also be point-in-time)
LEFT JOIN dim_product p 
    ON s.product_natural_key = p.product_natural_key
    AND p.is_current = TRUE;
 
-- Step 3: Validate before loading
-- Check for unresolved dimension keys
SELECT 'Unresolved customers' AS issue, COUNT(*) AS count
FROM enriched_sales_facts WHERE customer_key = -1
UNION ALL
SELECT 'Unresolved products', COUNT(*)
FROM enriched_sales_facts WHERE product_key = -1;
 
-- Step 4: Load to production fact table
INSERT INTO fact_sales (
    order_id, date_key, customer_key, product_key,
    quantity, unit_price, discount, net_amount, load_timestamp
)
SELECT * FROM enriched_sales_facts;
 
-- Step 5: Update load metadata
INSERT INTO etl_audit.fact_load_history (
    table_name, load_date, rows_loaded, load_duration_seconds
)
VALUES (
    'fact_sales', 
    CURRENT_DATE, 
    (SELECT COUNT(*) FROM enriched_sales_facts),
    EXTRACT(EPOCH FROM (CURRENT_TIMESTAMP - load_start_time))
);

Late-Arriving Facts

Bulk Loading Optimization

Standard insert vs. bulk load:

Standard INSERT (Row-by-Row)

•~1,000-10,000 rows/second
•Full transaction logging
•Index maintenance per row
•Constraint checking per row
•Trigger execution per row
•Suitable for small loads

BULK INSERT / COPY

•100,000-1,000,000+ rows/second
•Minimal logging options
•Batch index maintenance
•Deferred constraint checking
•Trigger bypass options
•Essential for large loads

Bulk loading techniques by platform:

Platform	Bulk Load Command	Key Options
PostgreSQL	COPY	FROM file, binary format
SQL Server	BULK INSERT, bcp	TABLOCK, ROWS_PER_BATCH
Oracle	SQL*Loader, External Tables	DIRECT path, PARALLEL
Snowflake	COPY INTO	FROM stage, FILE_FORMAT
BigQuery	bq load, LOAD DATA	Avro/Parquet, source
Redshift	COPY	FROM S3, MANIFEST, COMPROWS

Performance optimization strategies:

Bulk Load Optimization Techniques

•Disable/rebuild indexes: Drop non-clustered indexes before load, recreate after. Bulk index build is far faster than incremental maintenance.
•Disable constraints temporarily: Turn off foreign key and check constraints during load, validate afterward. Eliminates per-row checking overhead.
•Disable triggers: Bypass triggers during bulk load, apply trigger logic in transformation layer instead.
•Minimize logging: Use minimal/bulk-logged recovery modes where available. Full logging can be the primary bottleneck.
•Parallel loading: Load multiple partitions or files concurrently. Modern warehouses scale linearly with parallelism.
•File format optimization: Use columnar formats (Parquet, ORC) over row formats (CSV). Enable compression.
•Pre-sort data: If loading into clustered/sorted tables, pre-sort data to match target order. Avoids expensive re-sorting.
•Batch sizing: Optimal batch sizes vary by platform (10K-100K rows typical). Too small = overhead; too large = transaction log pressure.

bulk_load_example.sql
SQL
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
-- Optimized bulk loading pattern for large fact table
 
-- Step 1: Prepare target table
ALTER TABLE fact_sales DISABLE TRIGGER ALL;
ALTER INDEX idx_fact_sales_date ON fact_sales DISABLE;
ALTER INDEX idx_fact_sales_customer ON fact_sales DISABLE;
 
-- Step 2: Perform bulk load with optimal settings (PostgreSQL example)
COPY fact_sales (
    date_key, customer_key, product_key, 
    quantity, unit_price, net_amount
)
FROM '/data/staging/sales_20240415.csv'
WITH (
    FORMAT CSV,
    HEADER TRUE,
    NULL '',
    QUOTE '"',
    PARALLEL 4
);
 
-- Step 3: Rebuild indexes (parallel where supported)
ALTER INDEX idx_fact_sales_date ON fact_sales REBUILD;
ALTER INDEX idx_fact_sales_customer ON fact_sales REBUILD;
 
-- Step 4: Re-enable triggers
ALTER TABLE fact_sales ENABLE TRIGGER ALL;
 
-- Step 5: Update statistics for optimizer
ANALYZE fact_sales;
 
-- Snowflake parallel COPY example
COPY INTO fact_sales
FROM @my_s3_stage/sales/2024-04-15/
FILE_FORMAT = (TYPE = PARQUET)
MATCH_BY_COLUMN_NAME = CASE_INSENSITIVE
FORCE = FALSE
ON_ERROR = CONTINUE;
 
-- Snowflake automatically parallelizes based on warehouse size and file count

Transaction Management and Recovery

Atomicity patterns:

Transaction Management Principles

•Single transaction load: Wrap entire load in one transaction. Simple but risky for large loads—failure means complete restart.
•Batch transactions: Commit after each batch. Enables progress even with failures, but partial state visible.
•Staging table pattern: Load to staging, validate, then single-transaction merge to production. Best of both worlds.
•Partition swap pattern: Load to staging partition, atomic DDL swap. Zero downtime, instant rollback capability.
•Savepoints: Create checkpoints within transaction for partial rollback without losing all progress.

Failure recovery scenarios:

Failure Point	Recovery Strategy
Network failure during extract	Re-extract from checkpoint
Transformation error	Fix logic, reprocess failed batch
Constraint violation on load	Identify violating rows, fix or route to error table
Disk space exhaustion	Clear space, resume from checkpoint
Database crash mid-load	Transaction rollback automatic; restart load
Load completed but wrong	Load from backup or reload staging

Idempotent loading:

Loads should be idempotent—running them multiple times produces the same result. This enables safe retries without data duplication:

-- Non-idempotent (running twice doubles data):
INSERT INTO fact_sales SELECT * FROM staging_sales;

-- Idempotent (running twice has same effect as once):
MERGE INTO fact_sales t
USING staging_sales s ON t.order_id = s.order_id
WHEN MATCHED THEN UPDATE SET ...
WHEN NOT MATCHED THEN INSERT ...;

-- Or with explicit deduplication:
DELETE FROM fact_sales WHERE load_batch_id = 'batch_20240415';
INSERT INTO fact_sales SELECT * FROM staging_sales;

The Load Batch ID Pattern

Loading Architecture Best Practices

Enterprise loading architectures incorporate patterns that ensure reliability, performance, and maintainability across hundreds of tables and loading jobs.

Multi-layer loading architecture:

┌───────────────────────────────────────────────────────────────────┐
│ External Sources                                                  │
│ [OLTP DBs] [APIs] [Files] [Streams]                              │
└────────────────────────┬──────────────────────────────────────────┘
                         │ Extract
                         ▼
┌───────────────────────────────────────────────────────────────────┐
│ RAW / LANDING ZONE (Bronze)                                       │
│ - Exact copy of source data                                       │
│ - Full history retained                                           │
│ - Minimal transformation                                          │
└────────────────────────┬──────────────────────────────────────────┘
                         │ Cleanse & Standardize
                         ▼
┌───────────────────────────────────────────────────────────────────┐
│ STAGING / INTEGRATION (Silver)                                    │
│ - Cleansed and validated                                          │
│ - Standardized formats                                            │
│ - Business keys applied                                           │
└────────────────────────┬──────────────────────────────────────────┘
                         │ Conform & Integrate
                         ▼
┌───────────────────────────────────────────────────────────────────┐
│ WAREHOUSE / PRESENTATION (Gold)                                   │
│ - Dimensional model                                               │
│ - Conformed dimensions                                            │
│ - Query-optimized                                                 │
└───────────────────────────────────────────────────────────────────┘

Loading Best Practices

•Validate before production load: All quality checks pass in staging before any production table is touched. Never load bad data.
•Maintain load history: Log every load with row counts, timing, and status. Essential for debugging and SLA tracking.
•Enable rollback: Design loads so bad data can be removed. Batch IDs, partition swaps, and backup tables support this.
•Monitor performance trends: Track load duration over time. Increasing durations indicate problems before they become outages.
•Coordinate dependencies: Load dimension tables before fact tables that reference them. Orchestrators manage DAG execution.
•Handle late-arriving data: Design for out-of-order arrivals, corrections, and restatements from the start.
•Test load recovery: Periodically test that failed loads can be recovered. Don't learn your recovery process during an outage.
•Document load processes: Each table should have documented load patterns, schedules, dependencies, and ownership.

Load Order Matters

Summary: Mastering Data Loading

Key Takeaways

•Choose the right loading pattern: Full refresh vs. incremental vs. SCD—each has appropriate use cases based on table size, change frequency, and history requirements.
•Merge/upsert handles mutable data: INSERT...ON CONFLICT or MERGE statements provide atomic insert-or-update semantics for dimension tables.
•Type 2 SCD preserves history: Expire old dimension records and insert new versions to enable point-in-time analysis while tracking complete change history.
•Fact loading requires dimension key resolution: Look up surrogate keys during fact loading, using point-in-time dimension versions for historical accuracy.
•Bulk loading techniques are essential: Disable indexes, minimize logging, parallelize—without optimization, large loads won't complete within windows.
•Transaction management enables recovery: Staging table patterns, partition swaps, and idempotent operations ensure failed loads don't corrupt the warehouse.
•Multi-layer architecture provides flexibility: Raw → Staging → Warehouse layers separate concerns and enable validation before production impact.
•Orchestration manages dependencies: Dimensions load before facts; orchestrators ensure correct ordering and handle failures gracefully.

What's next:

Page Complete

3 / 5