Database Management SystemsOLTP vs OLAP

OLTP vs OLAP: Understanding Transaction and Analytical Processing

LevelIntermediate

Duration90 mins

TopicOLTP vs OLAP

5 / 5

Integration Challenges

Bridging Two Worlds

Organizations don't operate in silos. The sales record inserted into an OLTP system at 2 PM must eventually appear in the OLAP system where analysts calculate quarterly revenue. The customer who updates their preferences in an application wants those preferences reflected in personalized recommendations driven by analytical models.

Integrating OLTP and OLAP systems is one of the most challenging aspects of enterprise data architecture. These systems have fundamentally different data models, consistency guarantees, and performance characteristics—yet business processes require data to flow between them reliably.

This page examines the integration challenges in depth:

Data Movement: How do we extract data from OLTP and load it into OLAP?
Timeliness: How fresh does OLAP data need to be? What's achievable?
Consistency: How do we ensure OLAP data accurately reflects OLTP reality?
Schema Evolution: How do we handle OLTP schema changes in OLAP systems?
Modern Approaches: What new architectures attempt to solve these challenges?

What You Will Learn

By the end of this page, you will understand the core integration challenges between OLTP and OLAP systems: ETL vs. ELT pipelines, batch vs. streaming data movement, consistency and freshness trade-offs, schema evolution strategies, and modern architectural patterns (Lambda, Kappa, Data Lakehouse) that address these challenges.

The Data Movement Problem

Data must flow from operational systems to analytical systems. This sounds simple but involves numerous challenges:

The Source Systems Challenge:

Real enterprises have multiple OLTP systems:

E-commerce platform (orders, customers, products)
CRM system (leads, opportunities, contacts)
ERP system (inventory, procurement, accounting)
Custom applications (domain-specific data)

Each system has its own:

Schema design and data model
Technical platform and APIs
Update and maintenance schedules
Data quality issues

The Target System Challenge:

The data warehouse must:

Integrate data from all sources into a unified model
Handle schema differences and naming conventions
Resolve entity matching (is 'John Smith' in CRM the same as 'J. Smith' in orders?)
Track history that source systems may not preserve
Transform data into analysis-friendly structures

Data Movement Challenges
Challenge	Description	Impact
Source Diversity	Multiple systems with different schemas/technologies	Complex extraction logic per source
Data Volume	Hundreds of GB to TB transferred regularly	Network bandwidth, transfer time
Data Freshness	Business needs recent data for decisions	Trade-off: freshness vs. cost/complexity
Data Quality	Missing, duplicate, inconsistent source data	Garbage in, garbage out
Schema Drift	Source schemas change over time	Pipelines break, data misaligned
Dependency Management	Table B depends on Table A being loaded first	Orchestration complexity

The 80/20 Rule of Data Engineering

Data engineers often spend 80% of their time on data quality, extraction, and integration challenges—and only 20% on the "interesting" work of building analytical models and dashboards. Underestimating integration complexity is the most common cause of data warehouse project failures.

ETL vs. ELT Paradigms

Two paradigms dominate data movement between OLTP and OLAP systems:

ETL: Extract, Transform, Load

The traditional approach:

Extract: Pull data from source OLTP systems
Transform: Clean, validate, and reshape data (often in a separate processing engine)
Load: Insert transformed data into the target warehouse

Transformation happens before loading—data enters the warehouse already in its final analytical schema.

ELT: Extract, Load, Transform

The modern approach:

Extract: Pull data from source OLTP systems
Load: Load raw data directly into the warehouse or data lake
Transform: Use the warehouse's processing power to transform data

Transformation happens after loading—raw data is preserved, and transformations are performed using SQL or warehouse-native tools.

ETL Advantages

•Only clean data enters warehouse
•External processing doesn't consume warehouse resources
•Well-suited for complex transformations
•Mature tooling (Informatica, Talend, DataStage)
•Data quality enforced at ingestion

ETL Disadvantages

•Raw data lost if transformation changes needed
•Separate ETL infrastructure required
•Slower iteration (change requires re-extract)
•Bottleneck on ETL processing capacity
•Schema changes require ETL pipeline updates

ELT Advantages

•Raw data preserved for reprocessing
•Leverage warehouse's scalable compute
•Faster iteration (modify transforms, re-run)
•Simpler extraction logic (just copy data)
•Analysts can create new transforms in SQL

ELT Disadvantages

•Raw data increases storage costs
•Transformation consumes warehouse resources
•Data quality issues visible to users
•Requires powerful warehouse (modern cloud warehouses)
•Security: sensitive raw data in warehouse

The Modern Preference: ELT

Cloud data warehouses (Snowflake, BigQuery, Redshift) have made ELT the dominant paradigm. Their elastic compute, cheap storage, and SQL-based transformation (dbt) favor loading raw data and transforming inside the warehouse. ETL remains relevant for complex transformations involving non-SQL logic or sensitive data that shouldn't reach the warehouse in raw form.

Batch vs. Streaming Data Movement

Data can flow from OLTP to OLAP in batches or as a continuous stream:

Batch Processing:

Data is extracted and loaded at regular intervals:

Nightly full or incremental loads
Hourly syncs for more time-sensitive data
Weekly or monthly for slowly-changing reference data

Advantages:

Simple to implement and understand
Lower infrastructure costs
Easier to debug (clear batch boundaries)
Source system impact limited to batch windows

Disadvantages:

Data is always stale (at least one batch interval)
Large batches consume resources intensively
Failed batches may delay data significantly

Stream Processing:

Changes flow continuously in near-real-time:

Change Data Capture (CDC) from OLTP transaction logs
Event streams from applications
Real-time message queues (Kafka, Kinesis)

Advantages:

Near-real-time data freshness
Smaller, continuous resource usage (no batch spikes)
Enables operational analytics (real-time dashboards)

Disadvantages:

Significantly more complex infrastructure
Harder to debug and reason about
Higher cost (always-on infrastructure)
Exactly-once semantics are challenging

data_movement_patterns.sql
SQL
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
-- BATCH: Incremental Load Pattern
-- ================================
 
-- Extract changed records since last load
-- (using modified_at timestamp or CDC markers)
 
-- Step 1: In OLTP source (or via CDC)
SELECT *
FROM orders
WHERE modified_at > '2024-01-15 00:00:00'
  AND modified_at <= '2024-01-16 00:00:00';
 
-- Step 2: Stage in warehouse
INSERT INTO staging.orders_incremental
SELECT * FROM external_source.orders_changes;
 
-- Step 3: Merge into target (upsert pattern)
MERGE INTO warehouse.dim_orders AS target
USING staging.orders_incremental AS source
ON target.order_id = source.order_id
WHEN MATCHED THEN UPDATE SET
    status = source.status,
    total_amount = source.total_amount,
    modified_at = source.modified_at
WHEN NOT MATCHED THEN INSERT
    (order_id, customer_id, status, total_amount, created_at, modified_at)
VALUES
    (source.order_id, source.customer_id, source.status, 
     source.total_amount, source.created_at, source.modified_at);
 
 
-- STREAMING: CDC-based Continuous Load
-- ====================================
 
-- Kafka Connect captures OLTP transaction log changes
-- Stream processing applies changes to warehouse
 
-- Pseudo-code for stream processor:
-- ON each CDC event:
--   IF event.operation = 'INSERT' THEN
--     INSERT INTO warehouse.fact_orders ...
--   ELSE IF event.operation = 'UPDATE' THEN
--     UPDATE warehouse.fact_orders SET ... WHERE ...
--   ELSE IF event.operation = 'DELETE' THEN
--     -- Handle deletes (soft delete, archive, or hard delete)
--   
-- Buffer changes and micro-batch commit every 1-5 seconds

Batch vs. Streaming Comparison
Aspect	Batch	Streaming
Latency	Hours to day	Seconds to minutes
Complexity	Low to medium	High
Infrastructure	Simple (ETL tool + scheduler)	Complex (Kafka, Flink, CDC)
Cost	Lower (runs periodically)	Higher (always-on)
Error Recovery	Re-run failed batch	Complex replay/reset
Use Cases	Reports, historical analysis	Real-time dashboards, alerts

Hybrid Approaches

Many organizations use hybrid approaches: streaming for critical metrics that need real-time visibility, and batch for comprehensive historical loads. The Lambda Architecture (batch + streaming paths) formalizes this pattern, though it adds significant complexity.

Consistency and Freshness Challenges

Ensuring OLAP data accurately reflects OLTP reality involves fundamental trade-offs:

The Consistency Challenge:

OLTP systems provide transactional consistency—at any moment, querying the database shows a complete, correct state. OLAP systems receive data asynchronously, creating consistency challenges:

Partial Updates: Order header loaded, but order lines still in transit
Cross-Source Inconsistency: CRM shows new customer, but orders for that customer not yet loaded
Temporal Misalignment: Yesterday's orders with today's customer addresses
Replication Lag: Different warehouse regions see different data versions

The Freshness Challenge:

Freshness and cost are in tension:

Fresher data: Requires more infrastructure, complexity, and cost
Staler data: Simpler and cheaper, but decisions based on outdated information

Business requirements should drive freshness targets—not technical convenience.

Freshness Tiers and Use Cases
Tier	Latency	Use Cases	Implementation
Real-time	< 1 minute	Fraud detection, live dashboards, alerts	CDC streaming, Kafka, Flink
Near-real-time	1-15 minutes	Operational monitoring, session analytics	Micro-batch, CDC with buffering
Hourly	1 hour	Intraday reporting, executive dashboards	Scheduled batch, incremental loads
Daily	24 hours	Standard BI reports, historical analysis	Nightly batch loads
Weekly/Monthly	7-30 days	Financial close, regulatory reporting	Scheduled batch with reconciliation

Strategies for Consistency:

Watermarks: Track "data complete through timestamp X"—queries can specify required freshness
Snapshotting: Take consistent snapshots of source systems at known points in time
Event Sourcing: Capture all changes as ordered events; replay to reconstruct any point in time
Two-Phase Loads: Stage data, validate consistency, then atomically publish
Reconciliation: Periodic full reconciliation between source and target to catch discrepancies

The Eventually Consistent Reality

OLAP data is always eventually consistent with OLTP sources—the only question is how eventual. Accept this reality and design systems that make consistency boundaries explicit. Users should know: 'This dashboard shows data as of 3 hours ago' rather than believing they're seeing real-time truth.

Schema Evolution Challenges

OLTP schemas change over time, but OLAP systems must continue functioning:

Types of Schema Changes:

Additive Changes: New columns, new tables (usually safe)
Renaming Changes: Column or table renamed (breaks pipelines)
Type Changes: Column type changed (may break or lose precision)
Structural Changes: Tables split or merged, relationships changed
Semantic Changes: Same column, different meaning (most dangerous)

The Coordination Problem:

OLTP application teams ship database changes to meet feature requirements. They may not consider (or even know about) downstream OLAP dependencies. A column rename in the source system breaks ETL pipelines at 2 AM.

Schema Evolution Strategies:

Schema Evolution Management

•Contract-Based Interfaces: Define explicit data contracts between OLTP and OLAP; changes require contract negotiation
•Schema Registry: Central registry of schemas with versioning; pipelines reference specific versions
•Schema-on-Read: Store raw data; apply schema interpretation at query time (flexible but complex)
•Defensive Extraction: Extract to staging with flexible types (strings, JSON); transform validates
•Change Detection: Monitor source schemas for changes; alert before pipeline runs
•Dual-Write Periods: During transitions, source writes to old and new schemas; OLAP migrates gracefully

schema_evolution_handling.sql
SQL
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
-- Defensive Schema Evolution Pattern
-- ==================================
 
-- 1. Extract to flexible staging (raw JSON or all-string columns)
CREATE TABLE staging.orders_raw (
    raw_data VARIANT,  -- Snowflake VARIANT / BigQuery JSON
    extracted_at TIMESTAMP,
    source_system VARCHAR(100)
);
 
-- 2. Transform with explicit mapping and defaults
CREATE TABLE warehouse.fact_orders AS
SELECT
    raw_data:order_id::INTEGER AS order_id,
    raw_data:customer_id::INTEGER AS customer_id,
    -- Handle column rename: old='amount', new='total_amount'
    COALESCE(
        raw_data:total_amount::DECIMAL(12,2),
        raw_data:amount::DECIMAL(12,2)
    ) AS order_amount,
    -- Handle new column with default
    COALESCE(raw_data:currency::VARCHAR(3), 'USD') AS currency,
    -- Handle type change gracefully
    TRY_CAST(raw_data:order_date AS DATE) AS order_date,
    extracted_at
FROM staging.orders_raw;
 
-- 3. Automated schema change detection
-- Compare current source schema to expected schema
CREATE PROCEDURE detect_schema_changes()
AS $$
    WITH current_schema AS (
        SELECT column_name, data_type 
        FROM source.information_schema.columns 
        WHERE table_name = 'orders'
    ),
    expected_schema AS (
        SELECT column_name, data_type 
        FROM warehouse.schema_registry 
        WHERE table_name = 'orders' AND version = 'current'
    )
    SELECT 'MISSING' AS change_type, expected_schema.* 
    FROM expected_schema 
    LEFT JOIN current_schema USING (column_name)
    WHERE current_schema.column_name IS NULL
    
    UNION ALL
    
    SELECT 'NEW' AS change_type, current_schema.* 
    FROM current_schema 
    LEFT JOIN expected_schema USING (column_name)
    WHERE expected_schema.column_name IS NULL;
$$;

Data Contracts as Practice

Leading data organizations treat data interfaces like API contracts. Source systems document their data schemas, guarantee backward compatibility for defined periods, and communicate breaking changes in advance. Tools like dbt, Great Expectations, and Monte Carlo help enforce and monitor these contracts.

Modern Architecture Patterns

Several architectural patterns have emerged to address OLTP/OLAP integration challenges:

Lambda Architecture:

Dual processing paths for batch accuracy and streaming speed:

Batch Layer: Complete, accurate historical processing (MapReduce, Spark)
Speed Layer: Real-time stream processing for recent data (Storm, Flink)
Serving Layer: Merge batch and speed results for queries

Pros: Best of both worlds (accuracy + speed) Cons: Maintaining two codepaths, complexity, eventual inconsistency between layers

Kappa Architecture:

Unified stream processing for everything:

Treat all data as streams (even historical as replayable stream)
Single processing logic for both real-time and reprocessing
Append-only log (Kafka) as source of truth

Pros: Simpler single codebase, easier consistency Cons: Reprocessing historical data is slow, requires Kafka-like infrastructure

Data Lakehouse:

Unified storage supporting both OLTP-like and OLAP workloads:

Open file formats (Parquet, ORC) on object storage
ACID transactions on data lakes (Delta Lake, Iceberg, Hudi)
Schema enforcement and evolution
Time travel and versioning
Direct querying without separate warehouse

Modern Architecture Pattern Comparison
Pattern	Core Idea	Best For	Drawback
Lambda	Batch + Speed layers	Need both accuracy and low latency	Dual codebase complexity
Kappa	Streaming only	Native streaming data, simpler codebase	Historical reprocessing challenges
Lakehouse	Unified lake + warehouse	Consolidating data infrastructure	Maturity, some performance trade-offs
Data Mesh	Decentralized ownership	Large orgs with domain teams	Governance, interoperability challenges
HTAP	Single system for both	Moderate workloads, simplicity	Performance compromises, scalability limits

HTAP: Hybrid Transactional/Analytical Processing:

Some modern databases attempt to serve both workloads:

TiDB: Distributed SQL with separate OLTP and OLAP engines sharing data
SingleStoreDB: Unified engine for both workloads
Google AlloyDB: PostgreSQL-compatible with columnar analytics acceleration
CockroachDB: Distributed SQL with analytical capabilities

HTAP is promising for organizations wanting simpler architecture, but requires careful evaluation—most HTAP systems still have trade-offs that favor one workload type.

No Silver Bullet

Every architecture pattern involves trade-offs. Lambda trades complexity for capability. Kappa trades reprocessing speed for simplicity. Lakehouse trades proven warehouse performance for flexibility. Evaluate patterns against your specific requirements—org size, team skills, data volumes, freshness needs—rather than following trends.

Practical Integration Implementation

Let's examine a practical, modern integration architecture:

Reference Architecture:

OLTP Systems → CDC/Events → Message Queue → Stream Processing → Data Lake
                                                        ↓
                                              Transformation (dbt)
                                                        ↓
                                              Data Warehouse/Lakehouse
                                                        ↓
                                              BI Tools / ML Platforms

Key Components:

Change Data Capture (CDC): Debezium, AWS DMS, or native database log readers capture OLTP changes
Message Queue: Kafka or cloud equivalents (Kinesis, Pub/Sub) for durable, replayable event streams
Stream Processing: Optional—Flink, Spark Streaming for real-time transformations
Data Lake: S3/GCS/ADLS with open formats (Parquet + Delta/Iceberg) for raw data storage
Transformation: dbt for SQL-based transformation orchestrated by Airflow/Dagster
Warehouse: Snowflake/BigQuery/Redshift for high-performance analytical queries
Observability: Monte Carlo/Great Expectations for data quality, lineage, and monitoring

Integration Best Practices

•Idempotent Pipelines: Design pipelines that can be safely re-run without duplicating data
•Checkpointing: Track progress so failed pipelines resume from checkpoint, not from scratch
•Data Quality Gates: Validate data at ingestion and transformation stages; fail fast on quality issues
•Lineage Tracking: Know where every data point came from; essential for debugging and compliance
•Alerting and Monitoring: Know when pipelines fail, data is late, or quality degrades before users notice
•Documentation: Document schemas, business logic, SLAs, and ownership—future you will thank past you
•Incremental Processing: Process only changed data; avoid full refreshes except when necessary

Start Simple, Evolve as Needed

Begin with batch ELT and daily loads. Add streaming only when business requirements demand real-time data. Premature optimization toward streaming adds enormous complexity for marginal benefit. Many successful data platforms run entirely on hourly or daily batch loads—freshness beyond that is rarely worth the cost.

Summary: Navigating Integration Challenges

We have explored the complex landscape of OLTP/OLAP integration. Let's consolidate the essential knowledge:

Key Integration Insights

•Data Movement is Non-Trivial — Multiple sources, data quality issues, schema differences, and volume make extraction complex.
•ELT Dominates Modern Stacks — Load raw data to warehouse, transform with SQL—simpler and more flexible than traditional ETL.
•Batch vs. Streaming Trade-off — Batch is simpler and cheaper; streaming provides freshness. Choose based on business requirements, not technical enthusiasm.
•Consistency is Eventual — Accept that OLAP data lags OLTP reality. Make latency explicit and design systems that handle this gracefully.
•Schema Evolution Requires Strategy — Source schemas change; defensive extraction, contracts, and monitoring prevent pipeline failures.
•Multiple Architectural Options — Lambda, Kappa, Lakehouse, HTAP each solve different problems with different trade-offs.
•Start Simple — Begin with batch ELT; add complexity (streaming, real-time) only when requirements demand it.

Module Complete:

You have now comprehensively studied the OLTP vs. OLAP dichotomy—from their individual characteristics, through detailed comparison and requirements analysis, to the practical challenges of integrating these systems. This knowledge forms the foundation for understanding data warehousing concepts covered in subsequent modules.

What's Next:

The next module explores Data Warehouse Concepts in depth—subject orientation, integration, time-variance, and non-volatility. You'll see how the OLAP requirements we've studied translate into specific warehouse design principles.

Module Complete

You have mastered the OLTP vs. OLAP distinction—understanding not just what differs, but why these differences exist, how they manifest in requirements, and how organizations bridge these worlds through integration patterns. This forms the essential foundation for data warehousing and analytics architecture.

5 / 5

Loading learning content...

Database Management SystemsOLTP vs OLAP

OLTP vs OLAP: Understanding Transaction and Analytical Processing

LevelIntermediate

Duration90 mins

TopicOLTP vs OLAP

5 / 5

Integration Challenges

Bridging Two Worlds

This page examines the integration challenges in depth:

Data Movement: How do we extract data from OLTP and load it into OLAP?
Timeliness: How fresh does OLAP data need to be? What's achievable?
Consistency: How do we ensure OLAP data accurately reflects OLTP reality?
Schema Evolution: How do we handle OLTP schema changes in OLAP systems?
Modern Approaches: What new architectures attempt to solve these challenges?

What You Will Learn

The Data Movement Problem

Data must flow from operational systems to analytical systems. This sounds simple but involves numerous challenges:

The Source Systems Challenge:

Real enterprises have multiple OLTP systems:

E-commerce platform (orders, customers, products)
CRM system (leads, opportunities, contacts)
ERP system (inventory, procurement, accounting)
Custom applications (domain-specific data)

Each system has its own:

Schema design and data model
Technical platform and APIs
Update and maintenance schedules
Data quality issues

The Target System Challenge:

The data warehouse must:

Integrate data from all sources into a unified model
Handle schema differences and naming conventions
Resolve entity matching (is 'John Smith' in CRM the same as 'J. Smith' in orders?)
Track history that source systems may not preserve
Transform data into analysis-friendly structures

Data Movement Challenges
Challenge	Description	Impact
Source Diversity	Multiple systems with different schemas/technologies	Complex extraction logic per source
Data Volume	Hundreds of GB to TB transferred regularly	Network bandwidth, transfer time
Data Freshness	Business needs recent data for decisions	Trade-off: freshness vs. cost/complexity
Data Quality	Missing, duplicate, inconsistent source data	Garbage in, garbage out
Schema Drift	Source schemas change over time	Pipelines break, data misaligned
Dependency Management	Table B depends on Table A being loaded first	Orchestration complexity

The 80/20 Rule of Data Engineering

ETL vs. ELT Paradigms

Two paradigms dominate data movement between OLTP and OLAP systems:

ETL: Extract, Transform, Load

The traditional approach:

Extract: Pull data from source OLTP systems
Transform: Clean, validate, and reshape data (often in a separate processing engine)
Load: Insert transformed data into the target warehouse

Transformation happens before loading—data enters the warehouse already in its final analytical schema.

ELT: Extract, Load, Transform

The modern approach:

Extract: Pull data from source OLTP systems
Load: Load raw data directly into the warehouse or data lake
Transform: Use the warehouse's processing power to transform data

Transformation happens after loading—raw data is preserved, and transformations are performed using SQL or warehouse-native tools.

ETL Advantages

•Only clean data enters warehouse
•External processing doesn't consume warehouse resources
•Well-suited for complex transformations
•Mature tooling (Informatica, Talend, DataStage)
•Data quality enforced at ingestion

ETL Disadvantages

•Raw data lost if transformation changes needed
•Separate ETL infrastructure required
•Slower iteration (change requires re-extract)
•Bottleneck on ETL processing capacity
•Schema changes require ETL pipeline updates

ELT Advantages

•Raw data preserved for reprocessing
•Leverage warehouse's scalable compute
•Faster iteration (modify transforms, re-run)
•Simpler extraction logic (just copy data)
•Analysts can create new transforms in SQL

ELT Disadvantages

•Raw data increases storage costs
•Transformation consumes warehouse resources
•Data quality issues visible to users
•Requires powerful warehouse (modern cloud warehouses)
•Security: sensitive raw data in warehouse

The Modern Preference: ELT

Batch vs. Streaming Data Movement

Data can flow from OLTP to OLAP in batches or as a continuous stream:

Batch Processing:

Data is extracted and loaded at regular intervals:

Nightly full or incremental loads
Hourly syncs for more time-sensitive data
Weekly or monthly for slowly-changing reference data

Advantages:

Simple to implement and understand
Lower infrastructure costs
Easier to debug (clear batch boundaries)
Source system impact limited to batch windows

Disadvantages:

Data is always stale (at least one batch interval)
Large batches consume resources intensively
Failed batches may delay data significantly

Stream Processing:

Changes flow continuously in near-real-time:

Change Data Capture (CDC) from OLTP transaction logs
Event streams from applications
Real-time message queues (Kafka, Kinesis)

Advantages:

Near-real-time data freshness
Smaller, continuous resource usage (no batch spikes)
Enables operational analytics (real-time dashboards)

Disadvantages:

Significantly more complex infrastructure
Harder to debug and reason about
Higher cost (always-on infrastructure)
Exactly-once semantics are challenging

data_movement_patterns.sql
SQL
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
-- BATCH: Incremental Load Pattern
-- ================================
 
-- Extract changed records since last load
-- (using modified_at timestamp or CDC markers)
 
-- Step 1: In OLTP source (or via CDC)
SELECT *
FROM orders
WHERE modified_at > '2024-01-15 00:00:00'
  AND modified_at <= '2024-01-16 00:00:00';
 
-- Step 2: Stage in warehouse
INSERT INTO staging.orders_incremental
SELECT * FROM external_source.orders_changes;
 
-- Step 3: Merge into target (upsert pattern)
MERGE INTO warehouse.dim_orders AS target
USING staging.orders_incremental AS source
ON target.order_id = source.order_id
WHEN MATCHED THEN UPDATE SET
    status = source.status,
    total_amount = source.total_amount,
    modified_at = source.modified_at
WHEN NOT MATCHED THEN INSERT
    (order_id, customer_id, status, total_amount, created_at, modified_at)
VALUES
    (source.order_id, source.customer_id, source.status, 
     source.total_amount, source.created_at, source.modified_at);
 
 
-- STREAMING: CDC-based Continuous Load
-- ====================================
 
-- Kafka Connect captures OLTP transaction log changes
-- Stream processing applies changes to warehouse
 
-- Pseudo-code for stream processor:
-- ON each CDC event:
--   IF event.operation = 'INSERT' THEN
--     INSERT INTO warehouse.fact_orders ...
--   ELSE IF event.operation = 'UPDATE' THEN
--     UPDATE warehouse.fact_orders SET ... WHERE ...
--   ELSE IF event.operation = 'DELETE' THEN
--     -- Handle deletes (soft delete, archive, or hard delete)
--   
-- Buffer changes and micro-batch commit every 1-5 seconds

Batch vs. Streaming Comparison
Aspect	Batch	Streaming
Latency	Hours to day	Seconds to minutes
Complexity	Low to medium	High
Infrastructure	Simple (ETL tool + scheduler)	Complex (Kafka, Flink, CDC)
Cost	Lower (runs periodically)	Higher (always-on)
Error Recovery	Re-run failed batch	Complex replay/reset
Use Cases	Reports, historical analysis	Real-time dashboards, alerts

Hybrid Approaches

Consistency and Freshness Challenges

Ensuring OLAP data accurately reflects OLTP reality involves fundamental trade-offs:

The Consistency Challenge:

OLTP systems provide transactional consistency—at any moment, querying the database shows a complete, correct state. OLAP systems receive data asynchronously, creating consistency challenges:

Partial Updates: Order header loaded, but order lines still in transit
Cross-Source Inconsistency: CRM shows new customer, but orders for that customer not yet loaded
Temporal Misalignment: Yesterday's orders with today's customer addresses
Replication Lag: Different warehouse regions see different data versions

The Freshness Challenge:

Freshness and cost are in tension:

Fresher data: Requires more infrastructure, complexity, and cost
Staler data: Simpler and cheaper, but decisions based on outdated information

Business requirements should drive freshness targets—not technical convenience.

Freshness Tiers and Use Cases
Tier	Latency	Use Cases	Implementation
Real-time	< 1 minute	Fraud detection, live dashboards, alerts	CDC streaming, Kafka, Flink
Near-real-time	1-15 minutes	Operational monitoring, session analytics	Micro-batch, CDC with buffering
Hourly	1 hour	Intraday reporting, executive dashboards	Scheduled batch, incremental loads
Daily	24 hours	Standard BI reports, historical analysis	Nightly batch loads
Weekly/Monthly	7-30 days	Financial close, regulatory reporting	Scheduled batch with reconciliation

Strategies for Consistency:

Watermarks: Track "data complete through timestamp X"—queries can specify required freshness
Snapshotting: Take consistent snapshots of source systems at known points in time
Event Sourcing: Capture all changes as ordered events; replay to reconstruct any point in time
Two-Phase Loads: Stage data, validate consistency, then atomically publish
Reconciliation: Periodic full reconciliation between source and target to catch discrepancies

The Eventually Consistent Reality

Schema Evolution Challenges

OLTP schemas change over time, but OLAP systems must continue functioning:

Types of Schema Changes:

Additive Changes: New columns, new tables (usually safe)
Renaming Changes: Column or table renamed (breaks pipelines)
Type Changes: Column type changed (may break or lose precision)
Structural Changes: Tables split or merged, relationships changed
Semantic Changes: Same column, different meaning (most dangerous)

The Coordination Problem:

Schema Evolution Strategies:

Schema Evolution Management

•Contract-Based Interfaces: Define explicit data contracts between OLTP and OLAP; changes require contract negotiation
•Schema Registry: Central registry of schemas with versioning; pipelines reference specific versions
•Schema-on-Read: Store raw data; apply schema interpretation at query time (flexible but complex)
•Defensive Extraction: Extract to staging with flexible types (strings, JSON); transform validates
•Change Detection: Monitor source schemas for changes; alert before pipeline runs
•Dual-Write Periods: During transitions, source writes to old and new schemas; OLAP migrates gracefully

schema_evolution_handling.sql
SQL
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
-- Defensive Schema Evolution Pattern
-- ==================================
 
-- 1. Extract to flexible staging (raw JSON or all-string columns)
CREATE TABLE staging.orders_raw (
    raw_data VARIANT,  -- Snowflake VARIANT / BigQuery JSON
    extracted_at TIMESTAMP,
    source_system VARCHAR(100)
);
 
-- 2. Transform with explicit mapping and defaults
CREATE TABLE warehouse.fact_orders AS
SELECT
    raw_data:order_id::INTEGER AS order_id,
    raw_data:customer_id::INTEGER AS customer_id,
    -- Handle column rename: old='amount', new='total_amount'
    COALESCE(
        raw_data:total_amount::DECIMAL(12,2),
        raw_data:amount::DECIMAL(12,2)
    ) AS order_amount,
    -- Handle new column with default
    COALESCE(raw_data:currency::VARCHAR(3), 'USD') AS currency,
    -- Handle type change gracefully
    TRY_CAST(raw_data:order_date AS DATE) AS order_date,
    extracted_at
FROM staging.orders_raw;
 
-- 3. Automated schema change detection
-- Compare current source schema to expected schema
CREATE PROCEDURE detect_schema_changes()
AS $$
    WITH current_schema AS (
        SELECT column_name, data_type 
        FROM source.information_schema.columns 
        WHERE table_name = 'orders'
    ),
    expected_schema AS (
        SELECT column_name, data_type 
        FROM warehouse.schema_registry 
        WHERE table_name = 'orders' AND version = 'current'
    )
    SELECT 'MISSING' AS change_type, expected_schema.* 
    FROM expected_schema 
    LEFT JOIN current_schema USING (column_name)
    WHERE current_schema.column_name IS NULL
    
    UNION ALL
    
    SELECT 'NEW' AS change_type, current_schema.* 
    FROM current_schema 
    LEFT JOIN expected_schema USING (column_name)
    WHERE expected_schema.column_name IS NULL;
$$;

Data Contracts as Practice

Modern Architecture Patterns

Several architectural patterns have emerged to address OLTP/OLAP integration challenges:

Lambda Architecture:

Dual processing paths for batch accuracy and streaming speed:

Batch Layer: Complete, accurate historical processing (MapReduce, Spark)
Speed Layer: Real-time stream processing for recent data (Storm, Flink)
Serving Layer: Merge batch and speed results for queries

Pros: Best of both worlds (accuracy + speed) Cons: Maintaining two codepaths, complexity, eventual inconsistency between layers

Kappa Architecture:

Unified stream processing for everything:

Treat all data as streams (even historical as replayable stream)
Single processing logic for both real-time and reprocessing
Append-only log (Kafka) as source of truth

Pros: Simpler single codebase, easier consistency Cons: Reprocessing historical data is slow, requires Kafka-like infrastructure

Data Lakehouse:

Unified storage supporting both OLTP-like and OLAP workloads:

Open file formats (Parquet, ORC) on object storage
ACID transactions on data lakes (Delta Lake, Iceberg, Hudi)
Schema enforcement and evolution
Time travel and versioning
Direct querying without separate warehouse

Modern Architecture Pattern Comparison
Pattern	Core Idea	Best For	Drawback
Lambda	Batch + Speed layers	Need both accuracy and low latency	Dual codebase complexity
Kappa	Streaming only	Native streaming data, simpler codebase	Historical reprocessing challenges
Lakehouse	Unified lake + warehouse	Consolidating data infrastructure	Maturity, some performance trade-offs
Data Mesh	Decentralized ownership	Large orgs with domain teams	Governance, interoperability challenges
HTAP	Single system for both	Moderate workloads, simplicity	Performance compromises, scalability limits

HTAP: Hybrid Transactional/Analytical Processing:

Some modern databases attempt to serve both workloads:

TiDB: Distributed SQL with separate OLTP and OLAP engines sharing data
SingleStoreDB: Unified engine for both workloads
Google AlloyDB: PostgreSQL-compatible with columnar analytics acceleration
CockroachDB: Distributed SQL with analytical capabilities

HTAP is promising for organizations wanting simpler architecture, but requires careful evaluation—most HTAP systems still have trade-offs that favor one workload type.

No Silver Bullet

Practical Integration Implementation

Let's examine a practical, modern integration architecture:

Reference Architecture:

OLTP Systems → CDC/Events → Message Queue → Stream Processing → Data Lake
                                                        ↓
                                              Transformation (dbt)
                                                        ↓
                                              Data Warehouse/Lakehouse
                                                        ↓
                                              BI Tools / ML Platforms

Key Components:

Change Data Capture (CDC): Debezium, AWS DMS, or native database log readers capture OLTP changes
Message Queue: Kafka or cloud equivalents (Kinesis, Pub/Sub) for durable, replayable event streams
Stream Processing: Optional—Flink, Spark Streaming for real-time transformations
Data Lake: S3/GCS/ADLS with open formats (Parquet + Delta/Iceberg) for raw data storage
Transformation: dbt for SQL-based transformation orchestrated by Airflow/Dagster
Warehouse: Snowflake/BigQuery/Redshift for high-performance analytical queries
Observability: Monte Carlo/Great Expectations for data quality, lineage, and monitoring

Integration Best Practices

•Idempotent Pipelines: Design pipelines that can be safely re-run without duplicating data
•Checkpointing: Track progress so failed pipelines resume from checkpoint, not from scratch
•Data Quality Gates: Validate data at ingestion and transformation stages; fail fast on quality issues
•Lineage Tracking: Know where every data point came from; essential for debugging and compliance
•Alerting and Monitoring: Know when pipelines fail, data is late, or quality degrades before users notice
•Documentation: Document schemas, business logic, SLAs, and ownership—future you will thank past you
•Incremental Processing: Process only changed data; avoid full refreshes except when necessary

Start Simple, Evolve as Needed

Summary: Navigating Integration Challenges

We have explored the complex landscape of OLTP/OLAP integration. Let's consolidate the essential knowledge:

Key Integration Insights

•Data Movement is Non-Trivial — Multiple sources, data quality issues, schema differences, and volume make extraction complex.
•ELT Dominates Modern Stacks — Load raw data to warehouse, transform with SQL—simpler and more flexible than traditional ETL.
•Batch vs. Streaming Trade-off — Batch is simpler and cheaper; streaming provides freshness. Choose based on business requirements, not technical enthusiasm.
•Consistency is Eventual — Accept that OLAP data lags OLTP reality. Make latency explicit and design systems that handle this gracefully.
•Schema Evolution Requires Strategy — Source schemas change; defensive extraction, contracts, and monitoring prevent pipeline failures.
•Multiple Architectural Options — Lambda, Kappa, Lakehouse, HTAP each solve different problems with different trade-offs.
•Start Simple — Begin with batch ELT; add complexity (streaming, real-time) only when requirements demand it.

Module Complete:

What's Next:

Module Complete

5 / 5