Loading content...
Online Analytical Processing (OLAP) systems serve a fundamentally different purpose than their transactional counterparts. Where OLTP systems process individual business transactions in real-time, OLAP systems analyze vast historical datasets to uncover patterns, trends, and insights that drive strategic decision-making.
Consider the questions OLAP systems answer:
These questions require scanning, aggregating, and comparing millions or billions of historical records. The workload characteristics are the inverse of OLTP—and so are the optimal schema design strategies.
By the end of this page, you will understand why denormalized schemas are not just acceptable but essential for OLAP systems. You'll learn to analyze analytical workload patterns, recognize when denormalization provides performance benefits that far outweigh data redundancy costs, and understand the fundamental trade-offs that make OLAP and OLTP require opposite design strategies.
Understanding OLAP requires examining every aspect of how these systems differ from OLTP. The differences aren't subtle—they're fundamental inversions that affect every aspect of database design.
Defining OLAP:
OLAP systems analyze historical business data to support decision-making. Users don't process individual transactions; they explore aggregated data across multiple dimensions (time, geography, product, customer segment). These systems are characterized by:
| Characteristic | OLTP | OLAP |
|---|---|---|
| Primary Operation | INSERT, UPDATE, DELETE | SELECT with aggregation |
| Users | 1000s of concurrent customers/clerks | 10s-100s of analysts/executives |
| Query Complexity | Simple (point lookups) | Complex (multi-table, multi-group) |
| Rows per Query | 1-100 | Millions-Billions |
| Data Freshness | Real-time (current second) | Near-time (hours to days) |
| Schema Design | Normalized (3NF/BCNF) | Denormalized (Star/Snowflake) |
| Transaction Duration | Milliseconds | Minutes to hours |
| Concurrency Control | Row-level locking | MVCC/Snapshot isolation |
| Backup Strategy | Point-in-time recovery | Full history preservation |
| Main Metric | Transactions per second (TPS) | Query response time |
The Scan vs. Seek Inversion:
This fundamental difference drives all schema design decisions:
OLTP (Seek-Oriented):
WHERE order_id = 12345OLAP (Scan-Oriented):
WHERE order_date BETWEEN '2020-01-01' AND '2024-12-31'A normalized OLAP query might need: JOIN orders (5 billion rows) TO order_items (20 billion rows) TO products (100K rows) TO categories (1K rows) TO customers (50 million rows) TO regions (1K rows). Each JOIN is O(n×m) or requires hash tables holding millions of entries. Eliminating these JOINs through denormalization transforms query execution.
The alignment between denormalization and OLAP is as natural as the alignment between normalization and OLTP. When workload characteristics invert, optimal design inverts as well.
The Core Logic:
Normalization trades read complexity (JOINs) for write simplicity (single-point updates). OLAP systems have:
The trade-off calculation flips:
Quantifying the Impact:
Consider a typical analytical query: "Total revenue by product category by quarter for the past 3 years."
Against Normalized Schema (3NF):
1. Scan orders (3 years) → 50 million rows
2. JOIN to order_items → 200 million rows
3. JOIN to products → +100K lookups
4. JOIN to categories → +1K lookups
5. GROUP BY category, quarter
6. Calculate SUM(revenue)
Against Denormalized Schema (Star Schema):
1. Scan sales_fact with predicates → 200 million rows
2. GROUP BY category_name, quarter
3. Calculate SUM(revenue)
The denormalized version eliminates three JOINs, each of which requires memory, CPU, and often disk I/O for intermediate results. The performance difference is typically 10-100x.
OLAP data is loaded in batches—typically nightly ETL jobs. The extra complexity of loading denormalized data happens once per batch. If 1,000 analytical queries then execute against that data, the per-query savings vastly outweigh the one-time load overhead.
OLAP queries are fundamentally about aggregation—transforming detailed transactional records into summary statistics that reveal business patterns. Understanding aggregation mechanics reveals why denormalization is essential.
The Aggregation Pipeline:
Every OLAP aggregation query follows a pipeline:
In a normalized schema, steps 1-3 require expensive JOINs. In a denormalized schema, all data needed for grouping is already co-located.
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950
-- Query: Revenue by region, category, and quarter -- === NORMALIZED SCHEMA (3NF) ===SELECT r.region_name, c.category_name, DATE_TRUNC('quarter', o.order_date) AS quarter, SUM(oi.quantity * oi.unit_price) AS total_revenue, COUNT(DISTINCT o.order_id) AS order_countFROM orders oJOIN order_items oi ON o.order_id = oi.order_idJOIN products p ON oi.product_id = p.product_idJOIN categories c ON p.category_id = c.category_idJOIN customers cu ON o.customer_id = cu.customer_idJOIN regions r ON cu.region_id = r.region_idWHERE o.order_date >= '2022-01-01' AND o.order_date < '2025-01-01'GROUP BY r.region_name, c.category_name, DATE_TRUNC('quarter', o.order_date)ORDER BY quarter, region_name, category_name; -- Execution Plan (conceptual):-- 1. Scan orders with date filter → 50M rows-- 2. Hash join to order_items → 200M intermediate rows-- 3. Hash join to products → 200M rows (with product_id)-- 4. Hash join to categories → 200M rows (with category_name)-- 5. Hash join to customers → 200M rows (with customer_id)-- 6. Hash join to regions → 200M rows (with region_name)-- 7. Hash aggregate by region, category, quarter-- Memory required: Multiple large hash tables-- I/O: Random access patterns for dimension lookups -- === DENORMALIZED SCHEMA (Star Schema) ===SELECT region_name, category_name, quarter, SUM(revenue) AS total_revenue, COUNT(DISTINCT order_id) AS order_countFROM sales_factWHERE order_date >= '2022-01-01' AND order_date < '2025-01-01'GROUP BY region_name, category_name, quarterORDER BY quarter, region_name, category_name; -- Execution Plan (conceptual):-- 1. Scan sales_fact with date filter → 200M rows-- 2. Hash aggregate by region, category, quarter-- Memory required: One aggregation hash table-- I/O: Sequential scan of single table -- Performance improvement: 5-50x typicalMulti-Level Aggregation (Rollups):
Business users rarely want a single level of detail. They want to drill down from year to quarter to month to day. They want to see total company revenue, then by region, then by store. This is where denormalization shines:
The CUBE and ROLLUP Problem:
SQL's ROLLUP and CUBE operations compute subtotals at multiple levels. On normalized schemas, this requires joining once then computing all aggregation levels. On denormalized schemas, all levels are direct aggregations of the base fact table.
-- This becomes exponentially expensive with JOINs
SELECT region, category, quarter, SUM(revenue)
FROM sales_fact
GROUP BY CUBE(region, category, quarter);
-- Returns: All combinations including subtotals
-- region + category + quarter
-- region + category (all quarters)
-- region + quarter (all categories)
-- category + quarter (all regions)
-- etc.
Many OLAP systems take denormalization further with pre-aggregated summary tables. If users frequently query monthly revenue by region, a monthly_regional_sales table stores pre-computed results. This is extreme denormalization—trading storage and ETL complexity for instant query response.
The performance advantages of denormalization extend beyond logical query execution to physical storage characteristics. Modern OLAP databases exploit these properties aggressively.
Columnar Storage:
Most OLAP databases use columnar storage (column-oriented rather than row-oriented). Instead of storing complete rows together, they store each column's values contiguously:
Row-Oriented (OLTP):
[order_id=1, date='2024-01-01', customer_id=100, region='East', amount=500]
[order_id=2, date='2024-01-01', customer_id=101, region='West', amount=300]
Column-Oriented (OLAP):
order_id: [1, 2, 3, 4, 5, ...]
date: ['2024-01-01', '2024-01-01', '2024-01-02', ...]
region: ['East', 'West', 'East', 'West', ...]
amount: [500, 300, 250, 800, ...]
Denormalized schemas amplify columnar storage benefits:
| Data Type | Example | Compression Ratio | Technique |
|---|---|---|---|
| Low Cardinality String | region_name (100 values) | 50:1 to 100:1 | Dictionary Encoding |
| Sorted Timestamp | order_date (sequential) | 10:1 to 20:1 | Delta Encoding |
| Repeated Values | country = 'USA' (80% of rows) | 100:1+ | Run-Length Encoding |
| Numeric with Pattern | quantity (1-100 range) | 5:1 to 10:1 | Bit Packing |
| High Cardinality String | customer_email (unique) | 2:1 to 3:1 | LZ4/Zstd Compression |
I/O Pattern Optimization:
Denormalized schemas transform query I/O from random to sequential:
Normalized (Random I/O):
1. Scan fact table sequentially
2. For each row, lookup dimension table 1 (random access)
3. For each row, lookup dimension table 2 (random access)
4. For each row, lookup dimension table 3 (random access)
...
Denormalized (Sequential I/O):
1. Scan denormalized table sequentially
2. Done
Sequential I/O is 100-1000x faster than random I/O on spinning disks, and still 10-50x faster on SSDs. This difference is magnified when scanning billions of rows.
Systems like ClickHouse, Apache Druid, and Amazon Redshift are designed around these principles. They assume denormalized (or star schema) data, columnar storage, and vectorized processing. Running normalized schemas against these engines negates their design advantages.
The "write cost" of denormalization in OLAP is concentrated in the ETL (Extract, Transform, Load) process. Understanding this process reveals how OLAP systems absorb the complexity that OLTP systems cannot afford.
The ETL Pipeline:
OLAP data originates from OLTP systems. The ETL process:
This process runs as scheduled batch jobs (nightly, hourly) rather than real-time transactions.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566
-- ETL: Transform normalized OLTP data to denormalized OLAP fact table -- Source: Normalized OLTP tables-- orders, order_items, products, categories, customers, regions -- Target: Denormalized OLAP fact tableINSERT INTO sales_fact ( order_id, order_item_id, order_date, order_quarter, order_year, customer_id, customer_name, customer_segment, region_id, region_name, region_zone, product_id, product_name, product_sku, category_id, category_name, category_group, quantity, unit_price, discount_amount, revenue, cost, profit)SELECT o.order_id, oi.order_item_id, o.order_date, DATE_TRUNC('quarter', o.order_date) AS order_quarter, EXTRACT(YEAR FROM o.order_date) AS order_year, c.customer_id, c.customer_name, c.segment AS customer_segment, r.region_id, r.region_name, r.zone AS region_zone, p.product_id, p.product_name, p.sku AS product_sku, cat.category_id, cat.category_name, cat.group_name AS category_group, oi.quantity, oi.unit_price, COALESCE(oi.discount, 0) AS discount_amount, oi.quantity * oi.unit_price - COALESCE(oi.discount, 0) AS revenue, oi.quantity * p.unit_cost AS cost, (oi.quantity * oi.unit_price - COALESCE(oi.discount, 0)) - (oi.quantity * p.unit_cost) AS profitFROM orders oJOIN order_items oi ON o.order_id = oi.order_idJOIN products p ON oi.product_id = p.product_idJOIN categories cat ON p.category_id = cat.category_idJOIN customers c ON o.customer_id = c.customer_idJOIN regions r ON c.region_id = r.region_idWHERE o.order_date >= :last_load_date -- Incremental load AND o.order_date < :current_date; -- This query runs ONCE per ETL batch-- All analytical queries then benefit from pre-joined dataIncremental vs. Full Load:
Mature OLAP systems use incremental loading—processing only new or changed records since the last load:
Change Data Capture (CDC):
Modern ETL pipelines use CDC to efficiently identify OLTP changes:
Slowly Changing Dimensions:
When dimension attributes change (customer moves to new region), OLAP systems must decide:
| Load Type | Frequency | Duration | Data Latency | Use Case |
|---|---|---|---|---|
| Nightly Full Load | Daily | Hours | 1 day | Small warehouses, simple requirements |
| Nightly Incremental | Daily | Minutes to hour | 1 day | Most enterprise data warehouses |
| Hourly Micro-batch | Hourly | Minutes | 1-2 hours | Near-real-time dashboards |
| Streaming/CDC | Continuous | Seconds | Seconds | Real-time analytics |
ETL complexity is the 'cost' of OLAP denormalization. Organizations invest in ETL pipelines, data quality validation, and monitoring because the analytical query performance gains justify this investment millions of times over.
The database technology landscape has specialized to serve OLTP and OLAP workloads with purpose-built systems. Understanding these technologies reinforces why denormalization is essential for analytics.
Traditional Data Warehouses:
Purpose-built for star schema and denormalized data:
| Database | Architecture | Optimal Schema | Key Feature |
|---|---|---|---|
| Amazon Redshift | MPP Columnar | Star Schema | Automatic compression, distribution keys |
| Google BigQuery | Serverless Columnar | Denormalized/Nested | Auto-scaling, nested record support |
| Snowflake | Multi-cluster Columnar | Star/Denormalized | Separation of storage and compute |
| ClickHouse | Columnar OLAP | Highly Denormalized | Extreme insert throughput, real-time |
| Apache Druid | Real-time OLAP | Pre-aggregated | Sub-second queries on streaming data |
| DuckDB | Embedded OLAP | Star/Denormalized | In-process analytics, zero configuration |
| Databricks (Delta) | Lakehouse | Star/Denormalized | Combines data lake and warehouse |
Key Technical Features:
Massively Parallel Processing (MPP): Queries execute across many nodes simultaneously. Denormalized data can be evenly distributed across nodes without cross-node JOINs.
Columnar Compression: Column-oriented storage with aggressive compression. Denormalized tables with repeated dimension values compress extremely well.
Vectorized Execution: Process data in batches of thousands of values rather than row-by-row. Works best with homogeneous columnar data.
Materialized Views: Pre-computed aggregations refreshed on schedule. Extends denormalization to pre-aggregated summary levels.
Approximate Query Processing: HyperLogLog for approximate COUNT DISTINCT, sampling for approximate aggregates. Trades precision for speed on massive datasets.
Choosing a columnar OLAP database and then using normalized schemas defeats the purpose. These systems are optimized for wide, denormalized tables with columnar storage. Forcing OLTP-style normalized joins onto them produces worse performance than traditional row-store databases.
We've established the fundamental connection between denormalization and OLAP system requirements—the inverse of the OLTP-normalization relationship. Let's consolidate the key principles:
What's Next:
With the OLTP (normalized) and OLAP (denormalized) paradigms established, we're ready to explore the most common OLAP schema pattern in depth: the Star Schema. The star schema is a formalized approach to OLAP denormalization that balances query performance with maintainability and flexibility.
The next page examines star schema design—fact tables, dimension tables, and the principles that make this pattern the gold standard for analytical data modeling.
You now understand why denormalization is the natural choice for OLAP systems. The same properties that cause problems in OLTP contexts—data redundancy, wide rows, complex updates—become advantages when reads dominate, scans replace seeks, and aggregation is the primary operation. Next, we'll formalize these principles into the star schema pattern.