Oltp Vs Olap Considerations - Learning Module

Loading content...

0/241

OLAP Denormalization

The Analytical Processing Paradigm

Online Analytical Processing (OLAP) systems serve a fundamentally different purpose than their transactional counterparts. Where OLTP systems process individual business transactions in real-time, OLAP systems analyze vast historical datasets to uncover patterns, trends, and insights that drive strategic decision-making.

Consider the questions OLAP systems answer:

"What were our total sales by region, product category, and quarter for the past five years?"
"How does this year's customer retention compare to previous years across different channels?"
"Which marketing campaigns generated the highest revenue relative to cost?"

These questions require scanning, aggregating, and comparing millions or billions of historical records. The workload characteristics are the inverse of OLTP—and so are the optimal schema design strategies.

What You Will Master

By the end of this page, you will understand why denormalized schemas are not just acceptable but essential for OLAP systems. You'll learn to analyze analytical workload patterns, recognize when denormalization provides performance benefits that far outweigh data redundancy costs, and understand the fundamental trade-offs that make OLAP and OLTP require opposite design strategies.

OLAP System Characteristics

Understanding OLAP requires examining every aspect of how these systems differ from OLTP. The differences aren't subtle—they're fundamental inversions that affect every aspect of database design.

Defining OLAP:

OLAP systems analyze historical business data to support decision-making. Users don't process individual transactions; they explore aggregated data across multiple dimensions (time, geography, product, customer segment). These systems are characterized by:

Core OLAP Characteristics

•Read-Heavy Workloads — OLAP systems are 95%+ reads. Data is loaded in batches (nightly, hourly) rather than modified by ongoing transactions. Write operations are bulk loads, not individual INSERTs.
•Large Scan Operations — Queries routinely scan millions or billions of rows. A single report might need every transaction from the past 5 years.
•Complex Aggregations — SUM, COUNT, AVG, GROUP BY operations dominate. The goal is summarization, not point lookups.
•Multi-Dimensional Analysis — Users analyze data across many dimensions simultaneously: sales by region by quarter by product category by customer segment.
•Historical Focus — OLAP primarily examines historical patterns. Current-day data is important, but year-over-year comparisons require years of history.
•Ad-Hoc Queries — Unlike OLTP where queries are predictable, OLAP users explore data interactively, generating new query patterns constantly.
•Few Concurrent Users — Hundreds of analysts, not thousands of customers. But each query is far more resource-intensive.

OLTP vs OLAP: Complete Comparison
Characteristic	OLTP	OLAP
Primary Operation	INSERT, UPDATE, DELETE	SELECT with aggregation
Users	1000s of concurrent customers/clerks	10s-100s of analysts/executives
Query Complexity	Simple (point lookups)	Complex (multi-table, multi-group)
Rows per Query	1-100	Millions-Billions
Data Freshness	Real-time (current second)	Near-time (hours to days)
Schema Design	Normalized (3NF/BCNF)	Denormalized (Star/Snowflake)
Transaction Duration	Milliseconds	Minutes to hours
Concurrency Control	Row-level locking	MVCC/Snapshot isolation
Backup Strategy	Point-in-time recovery	Full history preservation
Main Metric	Transactions per second (TPS)	Query response time

The Scan vs. Seek Inversion:

This fundamental difference drives all schema design decisions:

OLTP (Seek-Oriented):

Queries find specific rows: WHERE order_id = 12345
Performance depends on index efficiency
JOINs access small, indexed sets
Normalization helps because each table is accessed via index

OLAP (Scan-Oriented):

Queries scan large ranges: WHERE order_date BETWEEN '2020-01-01' AND '2024-12-31'
Performance depends on scan efficiency
JOINs require matching millions of rows against millions of rows
Denormalization helps because it eliminates the need for massive JOINs

The JOIN Performance Problem

A normalized OLAP query might need: JOIN orders (5 billion rows) TO order_items (20 billion rows) TO products (100K rows) TO categories (1K rows) TO customers (50 million rows) TO regions (1K rows). Each JOIN is O(n×m) or requires hash tables holding millions of entries. Eliminating these JOINs through denormalization transforms query execution.

Why Denormalization Aligns with OLAP Requirements

The alignment between denormalization and OLAP is as natural as the alignment between normalization and OLTP. When workload characteristics invert, optimal design inverts as well.

The Core Logic:

Normalization trades read complexity (JOINs) for write simplicity (single-point updates). OLAP systems have:

Very few writes (bulk loads only)
Enormous numbers of reads (analytical queries)

The trade-off calculation flips:

OLTP: Frequent writes × simple writes > Frequent reads × complex reads ✓
OLAP: Rare writes × complex writes < Frequent reads × simple reads ✓

Denormalized Schema Benefits for OLAP

•Eliminated JOINs — All needed data in one table; no runtime matching
•Faster Scans — Sequential reads through denormalized tables are efficient
•Simpler Query Plans — Query optimizer has trivial decisions to make
•Better Compression — Related data nearby compresses well together
•Predictable Performance — No join cardinality surprises
•Aggregation Friendly — Pre-joined data aggregates without intermediate materialization

Normalized Schema Problems in OLAP

•JOIN Explosion — Matching billions of rows across tables
•Memory Pressure — Hash tables for JOINs may exceed available RAM
•Random I/O — Following foreign keys causes non-sequential disk access
•Complex Optimization — Query optimizer must choose among many join orders
•Unpredictable Performance — Join cardinality estimates often wrong
•Intermediate Materialization — Query engine must hold join results before aggregating

Quantifying the Impact:

Consider a typical analytical query: "Total revenue by product category by quarter for the past 3 years."

Against Normalized Schema (3NF):

1. Scan orders (3 years) → 50 million rows
2. JOIN to order_items → 200 million rows
3. JOIN to products → +100K lookups
4. JOIN to categories → +1K lookups
5. GROUP BY category, quarter
6. Calculate SUM(revenue)

Against Denormalized Schema (Star Schema):

1. Scan sales_fact with predicates → 200 million rows
2. GROUP BY category_name, quarter
3. Calculate SUM(revenue)

The denormalized version eliminates three JOINs, each of which requires memory, CPU, and often disk I/O for intermediate results. The performance difference is typically 10-100x.

The Write Cost Is Amortized

OLAP data is loaded in batches—typically nightly ETL jobs. The extra complexity of loading denormalized data happens once per batch. If 1,000 analytical queries then execute against that data, the per-query savings vastly outweigh the one-time load overhead.

Aggregation and Grouping in OLAP

OLAP queries are fundamentally about aggregation—transforming detailed transactional records into summary statistics that reveal business patterns. Understanding aggregation mechanics reveals why denormalization is essential.

The Aggregation Pipeline:

Every OLAP aggregation query follows a pipeline:

Scan: Read relevant rows from storage
Filter: Apply WHERE clause predicates
Group: Organize rows into groups (GROUP BY)
Aggregate: Compute SUM, COUNT, AVG, etc. per group
Having: Filter groups (HAVING clause)
Sort/Limit: Order results, return subset

In a normalized schema, steps 1-3 require expensive JOINs. In a denormalized schema, all data needed for grouping is already co-located.

aggregation_comparison.sql
SQL
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
-- Query: Revenue by region, category, and quarter
 
-- === NORMALIZED SCHEMA (3NF) ===
SELECT 
    r.region_name,
    c.category_name,
    DATE_TRUNC('quarter', o.order_date) AS quarter,
    SUM(oi.quantity * oi.unit_price) AS total_revenue,
    COUNT(DISTINCT o.order_id) AS order_count
FROM orders o
JOIN order_items oi ON o.order_id = oi.order_id
JOIN products p ON oi.product_id = p.product_id
JOIN categories c ON p.category_id = c.category_id
JOIN customers cu ON o.customer_id = cu.customer_id
JOIN regions r ON cu.region_id = r.region_id
WHERE o.order_date >= '2022-01-01' AND o.order_date < '2025-01-01'
GROUP BY r.region_name, c.category_name, DATE_TRUNC('quarter', o.order_date)
ORDER BY quarter, region_name, category_name;
 
-- Execution Plan (conceptual):
-- 1. Scan orders with date filter → 50M rows
-- 2. Hash join to order_items → 200M intermediate rows
-- 3. Hash join to products → 200M rows (with product_id)
-- 4. Hash join to categories → 200M rows (with category_name)
-- 5. Hash join to customers → 200M rows (with customer_id)
-- 6. Hash join to regions → 200M rows (with region_name)
-- 7. Hash aggregate by region, category, quarter
-- Memory required: Multiple large hash tables
-- I/O: Random access patterns for dimension lookups
 
 
-- === DENORMALIZED SCHEMA (Star Schema) ===
SELECT 
    region_name,
    category_name,
    quarter,
    SUM(revenue) AS total_revenue,
    COUNT(DISTINCT order_id) AS order_count
FROM sales_fact
WHERE order_date >= '2022-01-01' AND order_date < '2025-01-01'
GROUP BY region_name, category_name, quarter
ORDER BY quarter, region_name, category_name;
 
-- Execution Plan (conceptual):
-- 1. Scan sales_fact with date filter → 200M rows
-- 2. Hash aggregate by region, category, quarter
-- Memory required: One aggregation hash table
-- I/O: Sequential scan of single table
 
-- Performance improvement: 5-50x typical

Multi-Level Aggregation (Rollups):

Business users rarely want a single level of detail. They want to drill down from year to quarter to month to day. They want to see total company revenue, then by region, then by store. This is where denormalization shines:

Normalized Schema: Each drill-down level requires re-running JOINs with different GROUP BY
Denormalized Schema: Different GROUP BY on same table, reusing cached data

The CUBE and ROLLUP Problem:

SQL's ROLLUP and CUBE operations compute subtotals at multiple levels. On normalized schemas, this requires joining once then computing all aggregation levels. On denormalized schemas, all levels are direct aggregations of the base fact table.

-- This becomes exponentially expensive with JOINs
SELECT region, category, quarter, SUM(revenue)
FROM sales_fact
GROUP BY CUBE(region, category, quarter);
-- Returns: All combinations including subtotals
-- region + category + quarter
-- region + category (all quarters)
-- region + quarter (all categories)
-- category + quarter (all regions)
-- etc.

Pre-Aggregated Summaries

Many OLAP systems take denormalization further with pre-aggregated summary tables. If users frequently query monthly revenue by region, a monthly_regional_sales table stores pre-computed results. This is extreme denormalization—trading storage and ETL complexity for instant query response.

Physical Storage and I/O Optimization

The performance advantages of denormalization extend beyond logical query execution to physical storage characteristics. Modern OLAP databases exploit these properties aggressively.

Columnar Storage:

Most OLAP databases use columnar storage (column-oriented rather than row-oriented). Instead of storing complete rows together, they store each column's values contiguously:

Row-Oriented (OLTP):

[order_id=1, date='2024-01-01', customer_id=100, region='East', amount=500]
[order_id=2, date='2024-01-01', customer_id=101, region='West', amount=300]

Column-Oriented (OLAP):

order_id:    [1, 2, 3, 4, 5, ...]
date:        ['2024-01-01', '2024-01-01', '2024-01-02', ...]
region:      ['East', 'West', 'East', 'West', ...]
amount:      [500, 300, 250, 800, ...]

Denormalized schemas amplify columnar storage benefits:

Columnar Storage + Denormalization Synergies

•Read Only Needed Columns — A query using only 5 of 50 columns reads only those 5 columns' data. With denormalization, all relevant columns are in one table.
•Better Compression — Repeated values (like region names) in a column compress extremely well using run-length encoding, dictionary encoding, or delta encoding.
•Vectorized Processing — Modern CPUs process arrays of same-type values faster than scattered values. Columns enable SIMD (Single Instruction, Multiple Data) operations.
•Cache Efficiency — Reading contiguous column data utilizes CPU cache lines effectively. JOINs cause cache misses from random access patterns.
•Parallel Scanning — Independent columns can be scanned by different CPU cores simultaneously.

Compression Effectiveness by Data Pattern
Data Type	Example	Compression Ratio	Technique
Low Cardinality String	region_name (100 values)	50:1 to 100:1	Dictionary Encoding
Sorted Timestamp	order_date (sequential)	10:1 to 20:1	Delta Encoding
Repeated Values	country = 'USA' (80% of rows)	100:1+	Run-Length Encoding
Numeric with Pattern	quantity (1-100 range)	5:1 to 10:1	Bit Packing
High Cardinality String	customer_email (unique)	2:1 to 3:1	LZ4/Zstd Compression

I/O Pattern Optimization:

Denormalized schemas transform query I/O from random to sequential:

Normalized (Random I/O):

1. Scan fact table sequentially
2. For each row, lookup dimension table 1 (random access)
3. For each row, lookup dimension table 2 (random access)
4. For each row, lookup dimension table 3 (random access)
...

Denormalized (Sequential I/O):

1. Scan denormalized table sequentially
2. Done

Sequential I/O is 100-1000x faster than random I/O on spinning disks, and still 10-50x faster on SSDs. This difference is magnified when scanning billions of rows.

Modern OLAP Engines

Systems like ClickHouse, Apache Druid, and Amazon Redshift are designed around these principles. They assume denormalized (or star schema) data, columnar storage, and vectorized processing. Running normalized schemas against these engines negates their design advantages.

ETL and Data Loading Patterns

The "write cost" of denormalization in OLAP is concentrated in the ETL (Extract, Transform, Load) process. Understanding this process reveals how OLAP systems absorb the complexity that OLTP systems cannot afford.

The ETL Pipeline:

OLAP data originates from OLTP systems. The ETL process:

Extracts data from source OLTP databases
Transforms normalized data into denormalized form
Loads denormalized data into the OLAP warehouse

This process runs as scheduled batch jobs (nightly, hourly) rather than real-time transactions.

etl_denormalization.sql
SQL
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
-- ETL: Transform normalized OLTP data to denormalized OLAP fact table
 
-- Source: Normalized OLTP tables
-- orders, order_items, products, categories, customers, regions
 
-- Target: Denormalized OLAP fact table
INSERT INTO sales_fact (
    order_id,
    order_item_id,
    order_date,
    order_quarter,
    order_year,
    customer_id,
    customer_name,
    customer_segment,
    region_id,
    region_name,
    region_zone,
    product_id,
    product_name,
    product_sku,
    category_id,
    category_name,
    category_group,
    quantity,
    unit_price,
    discount_amount,
    revenue,
    cost,
    profit
)
SELECT 
    o.order_id,
    oi.order_item_id,
    o.order_date,
    DATE_TRUNC('quarter', o.order_date) AS order_quarter,
    EXTRACT(YEAR FROM o.order_date) AS order_year,
    c.customer_id,
    c.customer_name,
    c.segment AS customer_segment,
    r.region_id,
    r.region_name,
    r.zone AS region_zone,
    p.product_id,
    p.product_name,
    p.sku AS product_sku,
    cat.category_id,
    cat.category_name,
    cat.group_name AS category_group,
    oi.quantity,
    oi.unit_price,
    COALESCE(oi.discount, 0) AS discount_amount,
    oi.quantity * oi.unit_price - COALESCE(oi.discount, 0) AS revenue,
    oi.quantity * p.unit_cost AS cost,
    (oi.quantity * oi.unit_price - COALESCE(oi.discount, 0)) - (oi.quantity * p.unit_cost) AS profit
FROM orders o
JOIN order_items oi ON o.order_id = oi.order_id
JOIN products p ON oi.product_id = p.product_id
JOIN categories cat ON p.category_id = cat.category_id
JOIN customers c ON o.customer_id = c.customer_id
JOIN regions r ON c.region_id = r.region_id
WHERE o.order_date >= :last_load_date  -- Incremental load
  AND o.order_date < :current_date;
 
-- This query runs ONCE per ETL batch
-- All analytical queries then benefit from pre-joined data

Incremental vs. Full Load:

Mature OLAP systems use incremental loading—processing only new or changed records since the last load:

Full Load: Reload entire fact table (simple but slow, suitable for small warehouses)
Incremental Load: Append new records, update changed dimension values (complex but scalable)
Merge (Upsert): Insert new records, update existing matches (handles late-arriving data)

Change Data Capture (CDC):

Modern ETL pipelines use CDC to efficiently identify OLTP changes:

Transaction log parsing (reading database WAL/binlog)
Timestamp-based detection (records with modified_at > last_load)
Trigger-based capture (OLTP triggers write to staging tables)

Slowly Changing Dimensions:

When dimension attributes change (customer moves to new region), OLAP systems must decide:

Type 1: Overwrite old value (lose history)
Type 2: Add new row with validity dates (preserve history)
Type 3: Add new column for old value (limited history)

ETL Timing and Frequency Patterns
Load Type	Frequency	Duration	Data Latency	Use Case
Nightly Full Load	Daily	Hours	1 day	Small warehouses, simple requirements
Nightly Incremental	Daily	Minutes to hour	1 day	Most enterprise data warehouses
Hourly Micro-batch	Hourly	Minutes	1-2 hours	Near-real-time dashboards
Streaming/CDC	Continuous	Seconds	Seconds	Real-time analytics

The ETL Trade-off

ETL complexity is the 'cost' of OLAP denormalization. Organizations invest in ETL pipelines, data quality validation, and monitoring because the analytical query performance gains justify this investment millions of times over.

OLAP Database Technologies

The database technology landscape has specialized to serve OLTP and OLAP workloads with purpose-built systems. Understanding these technologies reinforces why denormalization is essential for analytics.

Traditional Data Warehouses:

Purpose-built for star schema and denormalized data:

Teradata: Pioneer of massively parallel processing (MPP) for analytics
Oracle Exadata: Hardware-software integrated analytical platform
IBM Netezza: Appliance-based analytical processing
Microsoft SQL Server (Columnstore): Row-store with columnar indexes for analytics

Modern OLAP Database Technologies
Database	Architecture	Optimal Schema	Key Feature
Amazon Redshift	MPP Columnar	Star Schema	Automatic compression, distribution keys
Google BigQuery	Serverless Columnar	Denormalized/Nested	Auto-scaling, nested record support
Snowflake	Multi-cluster Columnar	Star/Denormalized	Separation of storage and compute
ClickHouse	Columnar OLAP	Highly Denormalized	Extreme insert throughput, real-time
Apache Druid	Real-time OLAP	Pre-aggregated	Sub-second queries on streaming data
DuckDB	Embedded OLAP	Star/Denormalized	In-process analytics, zero configuration
Databricks (Delta)	Lakehouse	Star/Denormalized	Combines data lake and warehouse

Key Technical Features:

Massively Parallel Processing (MPP): Queries execute across many nodes simultaneously. Denormalized data can be evenly distributed across nodes without cross-node JOINs.

Columnar Compression: Column-oriented storage with aggressive compression. Denormalized tables with repeated dimension values compress extremely well.

Vectorized Execution: Process data in batches of thousands of values rather than row-by-row. Works best with homogeneous columnar data.

Materialized Views: Pre-computed aggregations refreshed on schedule. Extends denormalization to pre-aggregated summary levels.

Approximate Query Processing: HyperLogLog for approximate COUNT DISTINCT, sampling for approximate aggregates. Trades precision for speed on massive datasets.

Technology Selection Principle

Choosing a columnar OLAP database and then using normalized schemas defeats the purpose. These systems are optimized for wide, denormalized tables with columnar storage. Forcing OLTP-style normalized joins onto them produces worse performance than traditional row-store databases.

Summary: Denormalization as OLAP Foundation

We've established the fundamental connection between denormalization and OLAP system requirements—the inverse of the OLTP-normalization relationship. Let's consolidate the key principles:

Key Takeaways

•OLAP systems prioritize read efficiency over write simplicity — The characteristics of OLAP workloads (read-heavy, scan-oriented, aggregation-focused) align directly with what denormalization provides.
•Denormalization eliminates expensive JOINs — Analytical queries scanning billions of rows cannot afford the cost of matching fact tables to dimension tables at query time.
•Physical storage mechanisms favor denormalized data — Columnar storage, compression, and vectorized processing all work better with wide, denormalized tables.
•ETL absorbs the write complexity — The cost of denormalization is paid once during scheduled batch loads, not during interactive query execution.
•Modern OLAP technologies assume denormalized schemas — Purpose-built analytical databases are optimized for star schemas and wide fact tables.
•The inverse trade-off from OLTP — Where OLTP trades read complexity for write simplicity, OLAP trades write complexity for read simplicity.

What's Next:

With the OLTP (normalized) and OLAP (denormalized) paradigms established, we're ready to explore the most common OLAP schema pattern in depth: the Star Schema. The star schema is a formalized approach to OLAP denormalization that balances query performance with maintainability and flexibility.

The next page examines star schema design—fact tables, dimension tables, and the principles that make this pattern the gold standard for analytical data modeling.

Page Complete

You now understand why denormalization is the natural choice for OLAP systems. The same properties that cause problems in OLTP contexts—data redundancy, wide rows, complex updates—become advantages when reads dominate, scans replace seeks, and aggregation is the primary operation. Next, we'll formalize these principles into the star schema pattern.