Database Management SystemsOLTP vs OLAP

OLTP vs OLAP: Understanding Transaction and Analytical Processing

LevelIntermediate

Duration90 mins

TopicOLTP vs OLAP

2 / 5

OLAP Characteristics

The Intelligence Layer of Business

While OLTP systems capture the heartbeat of daily operations, Online Analytical Processing (OLAP) systems provide the intelligence that guides strategic decisions. When an executive asks 'What were our Q4 sales by region compared to last year?' or a data scientist investigates 'Which customer segments show the highest churn risk?'—they're querying OLAP systems.

OLAP systems exist to answer questions that OLTP systems cannot efficiently address. While an OLTP system can tell you the current balance of account #12345, it struggles with 'What is the average balance across all accounts, grouped by customer segment and compared month-over-month for the past three years?' Such queries would scan millions of rows, lock critical tables, and potentially bring the operational system to its knees.

This page explores the unique characteristics of OLAP systems—designed from the ground up for complex analytical queries over massive datasets. Understanding these characteristics is essential for:

Data Engineers: Designing data warehouse architectures and ETL pipelines
Database Architects: Choosing appropriate technologies for analytical workloads
Data Analysts/Scientists: Understanding why certain query patterns are fast or slow
Software Engineers: Building applications that leverage both operational and analytical data

What You Will Learn

By the end of this page, you will understand the defining characteristics of OLAP systems: their read-heavy, aggregate-intensive workloads; their columnar storage and compression techniques; their relaxed consistency models; and their architectural optimizations for scanning terabytes of historical data. You'll see why OLAP systems sacrifice OLTP's transactional guarantees in favor of analytical power.

Defining Online Analytical Processing

Online Analytical Processing (OLAP) refers to a class of database systems and approaches optimized for complex analytical queries over large volumes of historical data. The term was coined by E.F. Codd in 1993 to describe multidimensional data analysis capabilities.

The Analytical Mission:

OLAP systems are purpose-built to answer business intelligence questions:

Trend Analysis: How have sales evolved over time? What seasonal patterns exist?
Comparative Analysis: How does this quarter compare to last year? How do regions perform relative to each other?
Aggregation: What is the total/average/maximum across millions of records, potentially grouped by multiple dimensions?
Drill-Down/Roll-Up: Explore data at different granularities—from yearly totals down to daily transactions and back.
Ad-Hoc Exploration: Support analysts asking questions not anticipated when the system was designed.

The BASE Model (for some OLAP systems):

While traditional data warehouses maintain ACID properties, modern distributed OLAP systems often embrace BASE semantics:

Basically Available: System remains operational even during partial failures
Soft State: Data may be temporarily inconsistent across nodes
Eventually Consistent: Given time without updates, all replicas converge to consistency

This relaxation is acceptable because OLAP data is typically append-only or bulk-loaded, not subject to the concurrent modification patterns that demand strict ACID in OLTP.

OLAP System Priorities
Priority	Description	Trade-off Accepted
Query Performance	Fast response to complex analytical queries	Higher data latency (batch loading)
Scan Efficiency	Optimize for reading large data volumes	Slower random access and updates
Historical Depth	Store years of historical data	Larger storage footprint
Flexibility	Support ad-hoc queries not predefined	More complex query optimization
Aggregation Speed	Fast GROUP BY, SUM, AVG, COUNT operations	Data organized for aggregation, not transactions

OLAP vs. Business Intelligence

OLAP is the data processing layer; Business Intelligence (BI) is the presentation layer. BI tools (Tableau, Power BI, Looker) visualize data, but they query OLAP systems to get that data. Understanding OLAP helps you design systems that make BI tools performant and analysts productive.

Query Characteristics and Workload Patterns

OLAP queries exhibit dramatically different characteristics from OLTP transactions:

Read-Dominant Workloads:

OLAP systems are overwhelmingly read-oriented. Write operations are typically:

Bulk loads from source systems (hourly, daily, or weekly)
Append-only inserts of new fact records
Rare updates (usually corrections or dimension changes)

Query-to-write ratios of 100:1 or 1000:1 are typical. This read dominance fundamentally shapes OLAP architecture.

Complex, Multi-Table Queries:

Typical OLAP queries involve:

Multiple JOINs: Connecting fact tables to dimension tables (star/snowflake schemas)
Aggregations: SUM, COUNT, AVG, MIN, MAX across millions of rows
GROUP BY: Slicing data by dimensions (region, time period, product category)
HAVING/WHERE filters: Complex predicates, often on date ranges
Window Functions: Running totals, rankings, period-over-period comparisons

Long-Running Queries:

Unlike OLTP's millisecond transactions, OLAP queries may run for seconds, minutes, or even hours:

Simple dashboard queries: 1-10 seconds
Complex reports: 10-60 seconds
Data science exploration: Minutes to hours
Full-table scans: Depends on data volume

typical_olap_queries.sql
SQL
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
-- Typical OLAP Query Examples
-- ================================
 
-- 1. Aggregation with Multiple Dimensions
-- "Total sales by region and product category for 2024"
SELECT 
    d_region.region_name,
    d_product.category_name,
    SUM(f_sales.amount) AS total_sales,
    COUNT(DISTINCT f_sales.customer_id) AS unique_customers,
    AVG(f_sales.amount) AS avg_transaction
FROM fact_sales f_sales
JOIN dim_date d_date ON f_sales.date_key = d_date.date_key
JOIN dim_region d_region ON f_sales.region_key = d_region.region_key
JOIN dim_product d_product ON f_sales.product_key = d_product.product_key
WHERE d_date.year = 2024
GROUP BY d_region.region_name, d_product.category_name
ORDER BY total_sales DESC;
 
-- 2. Period-over-Period Comparison with Window Functions
-- "Monthly sales with year-over-year growth rate"
SELECT 
    year,
    month,
    monthly_sales,
    LAG(monthly_sales, 12) OVER (ORDER BY year, month) AS prev_year_sales,
    ROUND(
        (monthly_sales - LAG(monthly_sales, 12) OVER (ORDER BY year, month)) * 100.0 /
        NULLIF(LAG(monthly_sales, 12) OVER (ORDER BY year, month), 0),
        2
    ) AS yoy_growth_pct
FROM (
    SELECT 
        d_date.year,
        d_date.month,
        SUM(f_sales.amount) AS monthly_sales
    FROM fact_sales f_sales
    JOIN dim_date d_date ON f_sales.date_key = d_date.date_key
    GROUP BY d_date.year, d_date.month
) monthly_totals
ORDER BY year DESC, month DESC;
 
-- 3. Drill-Down Analysis
-- "Top 10 products in underperforming regions"
WITH regional_performance AS (
    SELECT 
        d_region.region_key,
        d_region.region_name,
        SUM(f_sales.amount) AS total_sales,
        AVG(SUM(f_sales.amount)) OVER () AS avg_regional_sales
    FROM fact_sales f_sales
    JOIN dim_date d_date ON f_sales.date_key = d_date.date_key
    JOIN dim_region d_region ON f_sales.region_key = d_region.region_key
    WHERE d_date.year = 2024
    GROUP BY d_region.region_key, d_region.region_name
    HAVING SUM(f_sales.amount) < AVG(SUM(f_sales.amount)) OVER ()
)
SELECT 
    rp.region_name,
    d_product.product_name,
    SUM(f_sales.amount) AS product_sales,
    RANK() OVER (PARTITION BY rp.region_name ORDER BY SUM(f_sales.amount) DESC) AS rank
FROM fact_sales f_sales
JOIN regional_performance rp ON f_sales.region_key = rp.region_key
JOIN dim_product d_product ON f_sales.product_key = d_product.product_key
JOIN dim_date d_date ON f_sales.date_key = d_date.date_key
WHERE d_date.year = 2024
GROUP BY rp.region_name, d_product.product_name
QUALIFY rank <= 10
ORDER BY rp.region_name, rank;

Query Complexity vs. Data Volume

OLAP query performance depends on both query complexity and data volume. A simple aggregation over 1 billion rows may be faster than a complex multi-join over 10 million rows. Modern OLAP systems optimize for both, using techniques like predicate pushdown, join reordering, and parallel execution.

Columnar Storage Architecture

The most distinctive technical characteristic of modern OLAP systems is columnar storage—organizing data by column rather than by row:

Row-Oriented vs. Column-Oriented:

Traditional OLTP databases store data row-by-row:

Row 1: [id=1, name='Alice', amount=100, date='2024-01-01']
Row 2: [id=2, name='Bob', amount=200, date='2024-01-02']
Row 3: [id=3, name='Carol', amount=150, date='2024-01-03']

Columnar databases store data column-by-column:

id column:     [1, 2, 3, ...]
name column:   ['Alice', 'Bob', 'Carol', ...]
amount column: [100, 200, 150, ...]
date column:   ['2024-01-01', '2024-01-02', '2024-01-03', ...]

Why Columnar Storage Benefits Analytics:

Read Efficiency: Analytical queries typically access few columns but many rows. SELECT SUM(amount) only needs the amount column—no need to read id, name, or date.
Compression: Columns contain homogeneous data types. Integer columns compress far better than mixed-type rows. Typical compression ratios: 5:1 to 10:1.
Vectorized Processing: Modern CPUs process vectors of similar values more efficiently than heterogeneous row data. SIMD instructions accelerate column operations.
Cache Efficiency: Sequential access to column data maximizes CPU cache utilization. Row-based access patterns cause cache thrashing.

Storage Format Comparison
Characteristic	Row-Oriented (OLTP)	Column-Oriented (OLAP)
Optimal for	Single-record access (by ID)	Full-column scans, aggregations
Write performance	Fast single-row inserts/updates	Bulk loads preferred
Compression ratio	2:1 to 3:1 typical	5:1 to 10:1 typical
Query: SELECT * WHERE id=X	Very fast (direct access)	Slow (reconstruct from columns)
Query: SUM(amount) GROUP BY region	Slow (scan all columns)	Very fast (scan one column)
Storage locality	Entire row together	Each column separate

Columnar Compression Techniques

•Run-Length Encoding (RLE): Store repeated values as (value, count). Highly effective for low-cardinality columns like 'status' or 'region'.
•Dictionary Encoding: Replace strings with integer codes. 'North America' → 1, 'Europe' → 2. Dramatically shrinks string columns.
•Bit-Packing: Use minimum bits required. If max value is 1000, use 10 bits instead of 32 bits per integer.
•Delta Encoding: Store differences between consecutive values. Excellent for sorted or sequential data like timestamps.
•Frame of Reference (FOR): Store a base value plus small offsets. Effective for clustered numeric data.

The Compression-Performance Connection

Compression isn't just about storage savings—it directly improves query performance. Compressed data means fewer disk reads and more data fits in memory/cache. A 10:1 compression ratio effectively gives you 10x more memory and 10x faster disk scans.

Data Volume and Scale Considerations

OLAP systems routinely handle data volumes that would overwhelm OLTP systems:

Historical Data Retention:

While OLTP systems maintain current state (today's inventory, current balances), OLAP systems preserve history:

Years of transaction history: Every sale, every click, every event
Multiple versions of dimensions: Track how product categories, organizational structures, or regions changed over time (Slowly Changing Dimensions)
Aggregated and raw data: Pre-computed summaries alongside granular records

Scale Benchmarks:

Small Data Warehouse: 100 GB - 1 TB, millions of rows
Medium Data Warehouse: 1 TB - 100 TB, billions of rows
Large Data Warehouse: 100 TB - 10 PB, trillions of rows
Hyperscale Analytics: 10+ PB, unfathomable row counts

Partitioning Strategies:

Managing massive data volumes requires strategic partitioning:

Range Partitioning: Most common for OLAP. Partition by date (monthly, yearly)—queries on recent data scan only recent partitions
List Partitioning: Partition by categorical values (region, product line)
Hash Partitioning: Distribute data evenly for parallel processing
Composite Partitioning: Combine strategies (range by date, then hash within each range)

olap_partitioning.sql
SQL
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
-- OLAP Partitioning Example (PostgreSQL syntax)
-- ==============================================
 
-- Fact table partitioned by month
CREATE TABLE fact_sales (
    sale_id         BIGINT NOT NULL,
    date_key        DATE NOT NULL,
    customer_key    INT NOT NULL,
    product_key     INT NOT NULL,
    store_key       INT NOT NULL,
    quantity        INT NOT NULL,
    unit_price      DECIMAL(10,2) NOT NULL,
    total_amount    DECIMAL(12,2) NOT NULL,
    discount_amount DECIMAL(10,2) DEFAULT 0
) PARTITION BY RANGE (date_key);
 
-- Create partitions for each month
CREATE TABLE fact_sales_2024_01 PARTITION OF fact_sales
    FOR VALUES FROM ('2024-01-01') TO ('2024-02-01');
 
CREATE TABLE fact_sales_2024_02 PARTITION OF fact_sales
    FOR VALUES FROM ('2024-02-01') TO ('2024-03-01');
 
CREATE TABLE fact_sales_2024_03 PARTITION OF fact_sales
    FOR VALUES FROM ('2024-03-01') TO ('2024-04-01');
 
-- Query benefits from partition pruning
-- Only scans January and February partitions
SELECT 
    SUM(total_amount) AS q1_partial_sales
FROM fact_sales
WHERE date_key >= '2024-01-01' 
  AND date_key < '2024-03-01';
 
-- Maintenance: Archive old partitions
ALTER TABLE fact_sales DETACH PARTITION fact_sales_2020_01;
-- Move to archive storage or compress

Partition Pruning is Critical

Without proper partitioning and partition-aware queries, analytical queries scan entire multi-terabyte tables. A query that should take 5 seconds takes 5 hours. Always design partition keys aligned with common query filter patterns—date is almost always the primary partition key for OLAP systems.

Pre-Aggregation and Materialized Views

OLAP systems employ sophisticated pre-computation strategies to accelerate common queries:

Materialized Views:

Pre-computed query results stored as physical tables:

CREATE MATERIALIZED VIEW mv_daily_sales_summary AS
SELECT 
    date_key,
    region_key,
    product_category,
    SUM(amount) AS total_sales,
    COUNT(*) AS transaction_count,
    AVG(amount) AS avg_transaction
FROM fact_sales f
JOIN dim_product p ON f.product_key = p.product_key
GROUP BY date_key, region_key, product_category;

Queries against this materialized view return instantly instead of scanning billions of fact rows.

OLAP Cubes:

Multidimensional data structures that pre-aggregate data across multiple dimensions:

Dimensions: The axes of analysis (Time, Region, Product, Customer Segment)
Measures: The numeric values being aggregated (Sales, Quantity, Profit)
Cells: Each intersection of dimension values contains pre-computed aggregates

Cubes enable instant drill-down, roll-up, slice, and dice operations without recalculating aggregates.

Aggregate Tables:

Pre-computed summary tables at various granularities:

fact_sales → Raw transactions (1B rows)
agg_daily_sales → Daily summaries (10M rows)
agg_monthly_sales → Monthly summaries (100K rows)
agg_yearly_sales → Yearly summaries (1K rows)

Query routers automatically select the appropriate aggregate table based on query granularity.

Pre-Aggregation Benefits

•Sub-second response for common queries
•Reduced computational load on OLAP engine
•Consistent performance regardless of base table size
•Lower resource consumption for repetitive queries
•Better user experience for dashboards

Pre-Aggregation Challenges

•Increased storage requirements
•Refresh latency (stale data risk)
•Complex refresh logic for incremental updates
•Maintenance overhead for many aggregates
•Cannot serve ad-hoc queries outside defined aggregates

The Aggregate Navigation Problem

When multiple aggregate tables exist, the system must choose the right one for each query. This 'aggregate navigation' problem is solved by metadata-driven query routing, materialized view rewrite optimization, or intelligent BI tools that understand the aggregate hierarchy.

Parallel and Distributed Processing

OLAP systems exploit parallelism at every level to achieve acceptable performance over massive datasets:

Intra-Query Parallelism:

A single query is decomposed into parallel tasks:

Parallel Scan: Multiple threads/processes read different partitions or segments simultaneously
Parallel Aggregation: Local aggregation on each partition, then global merge
Parallel Join: Hash-partition both tables, join partitions independently, union results
Pipeline Parallelism: Different query stages execute concurrently (scan → filter → aggregate)

Massively Parallel Processing (MPP):

Distributed OLAP architectures spread data and processing across many nodes:

Shared-Nothing Architecture: Each node has private CPU, memory, and storage. No contention.
Data Distribution: Tables partitioned across nodes (hash or range)
Query Coordination: A coordinator node parses queries, creates distributed plans, and orchestrates execution
Result Gathering: Partial results from each node are aggregated at the coordinator

Popular MPP Systems:

Snowflake: Cloud-native, separation of compute and storage
Amazon Redshift: AWS managed columnar data warehouse
Google BigQuery: Serverless, highly scalable analytics
Azure Synapse Analytics: Microsoft's unified analytics platform
Greenplum/Postgres-XL: Open-source MPP solutions
ClickHouse: High-performance open-source columnar DBMS

Parallel Execution Strategies
Strategy	Description	Best For	Limitation
Partition-Wise Scan	Each worker scans its partition	Large table scans	Requires good partitioning
Hash Distribution	Redistribute data by join key	Large-to-large joins	Network shuffle overhead
Broadcast Join	Replicate small table to all nodes	Small dimension joins	Memory limits on table size
Sort-Merge Parallel	Parallel sort, then merge	ORDER BY, DISTINCT	Merge step is sequential
Aggregation Rollup	Local agg → global agg	GROUP BY with aggregates	High-cardinality groups costly

The Separation of Compute and Storage

Modern cloud OLAP systems (Snowflake, BigQuery) separate storage from compute. Data lives in cheap object storage (S3, GCS); compute resources are provisioned on-demand. This enables independent scaling—store petabytes affordably while spinning up compute only when running queries.

Query Optimization for Analytical Workloads

OLAP query optimizers employ specialized techniques beyond traditional OLTP optimization:

Statistics-Based Optimization:

Column Statistics: Min/max values, distinct count, null count, histogram of value distribution
Table Statistics: Row count, data size, partitioning information
Join Statistics: Correlation between columns, foreign key relationships
Cost Models: Estimate I/O cost, CPU cost, and network cost for distributed operations

OLAP-Specific Optimizations:

Predicate Pushdown: Move filters as close to storage as possible. Filter during scan, not after loading into memory.
Projection Pushdown: Read only columns needed by the query. Critical for columnar storage.
Partition Pruning: Use query predicates to eliminate entire partitions from scanning.
Join Order Optimization: With many joins (star schema), finding optimal join order is NP-hard. OLAP optimizers use heuristics (smallest table first, fact table last) and cost-based search.
Aggregate Pushdown: Push aggregation below joins when legal, reducing intermediate result sizes.
Materialized View Rewriting: Automatically rewrite queries to use pre-computed aggregates when they satisfy the query.
Runtime Filters: Generate bloom filters from small table during join, apply to large table scan to skip non-matching data.

olap_optimization_example.sql
SQL
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
-- Original Query
SELECT 
    d_region.region_name,
    SUM(f_sales.amount) AS total_sales
FROM fact_sales f_sales
JOIN dim_date d_date ON f_sales.date_key = d_date.date_key
JOIN dim_region d_region ON f_sales.region_key = d_region.region_key
WHERE d_date.year = 2024
  AND d_region.country = 'USA'
GROUP BY d_region.region_name;
 
-- Optimizer Transformations Applied:
-- =====================================
 
-- 1. Predicate Pushdown: Filter dimensions first
--    - d_date filtered to 2024 before join (~365 rows from millions)
--    - d_region filtered to USA before join (~50 rows from thousands)
 
-- 2. Partition Pruning
--    - fact_sales scans only 2024 partitions (12 months)
--    - Skips years 2020, 2021, 2022, 2023 entirely
 
-- 3. Projection Pushdown
--    - Only reads: date_key, region_key, amount from fact_sales
--    - Ignores: customer_key, product_key, quantity, etc.
 
-- 4. Join Order: Small → Large
--    - Join dim_date (365 rows) first
--    - Then dim_region (50 rows)
--    - Finally scan matching fact_sales
 
-- 5. Runtime Filter (Bloom Filter)
--    - Build bloom filter from dim_date.date_key (2024 dates)
--    - Apply to fact_sales.date_key during scan
--    - Skip blocks without matching date_keys

Statistics Freshness Matters

OLAP query optimizers depend heavily on accurate statistics. After bulk loads, run ANALYZE/COMPUTE STATISTICS. Stale statistics lead to poor plan choices—a query that should take 10 seconds takes 10 minutes because the optimizer chose the wrong join order.

Summary: OLAP System Characteristics

We have comprehensively explored the defining characteristics of Online Analytical Processing systems. Let's consolidate the essential knowledge:

Key OLAP Characteristics

•Read-Dominant Workloads — Queries vastly outnumber writes; writes are typically bulk loads rather than transactional updates.
•Complex Analytical Queries — Multi-table joins, aggregations, window functions, and ad-hoc exploration patterns.
•Columnar Storage — Data organized by column for efficient scanning, compression, and aggregation.
•Massive Data Volumes — Terabytes to petabytes of historical data, requiring sophisticated partitioning strategies.
•Pre-Aggregation — Materialized views and aggregate tables accelerate common query patterns.
•Parallel/Distributed Processing — MPP architectures spread work across many nodes for horizontal scalability.
•Specialized Query Optimization — Predicate pushdown, partition pruning, and cost-based optimization for analytical workloads.

What's Next:

Now that we understand both OLTP and OLAP characteristics in depth, we'll directly compare them side-by-side. You'll see the fundamental trade-offs that make it impossible to optimally serve both workloads with a single system architecture—and understand why organizations maintain separate transactional and analytical databases.

Page Complete

You now possess a comprehensive understanding of OLAP system characteristics—the analytical workload patterns, columnar storage innovations, compression techniques, parallel processing architectures, and query optimization strategies that enable insights from massive datasets. This knowledge prepares you to understand why OLTP and OLAP systems require fundamentally different architectural approaches.

2 / 5

Loading learning content...

Database Management SystemsOLTP vs OLAP

OLTP vs OLAP: Understanding Transaction and Analytical Processing

LevelIntermediate

Duration90 mins

TopicOLTP vs OLAP

2 / 5

OLAP Characteristics

The Intelligence Layer of Business

This page explores the unique characteristics of OLAP systems—designed from the ground up for complex analytical queries over massive datasets. Understanding these characteristics is essential for:

Data Engineers: Designing data warehouse architectures and ETL pipelines
Database Architects: Choosing appropriate technologies for analytical workloads
Data Analysts/Scientists: Understanding why certain query patterns are fast or slow
Software Engineers: Building applications that leverage both operational and analytical data

What You Will Learn

Defining Online Analytical Processing

The Analytical Mission:

OLAP systems are purpose-built to answer business intelligence questions:

Trend Analysis: How have sales evolved over time? What seasonal patterns exist?
Comparative Analysis: How does this quarter compare to last year? How do regions perform relative to each other?
Aggregation: What is the total/average/maximum across millions of records, potentially grouped by multiple dimensions?
Drill-Down/Roll-Up: Explore data at different granularities—from yearly totals down to daily transactions and back.
Ad-Hoc Exploration: Support analysts asking questions not anticipated when the system was designed.

The BASE Model (for some OLAP systems):

While traditional data warehouses maintain ACID properties, modern distributed OLAP systems often embrace BASE semantics:

Basically Available: System remains operational even during partial failures
Soft State: Data may be temporarily inconsistent across nodes
Eventually Consistent: Given time without updates, all replicas converge to consistency

This relaxation is acceptable because OLAP data is typically append-only or bulk-loaded, not subject to the concurrent modification patterns that demand strict ACID in OLTP.

OLAP System Priorities
Priority	Description	Trade-off Accepted
Query Performance	Fast response to complex analytical queries	Higher data latency (batch loading)
Scan Efficiency	Optimize for reading large data volumes	Slower random access and updates
Historical Depth	Store years of historical data	Larger storage footprint
Flexibility	Support ad-hoc queries not predefined	More complex query optimization
Aggregation Speed	Fast GROUP BY, SUM, AVG, COUNT operations	Data organized for aggregation, not transactions

OLAP vs. Business Intelligence

Query Characteristics and Workload Patterns

OLAP queries exhibit dramatically different characteristics from OLTP transactions:

Read-Dominant Workloads:

OLAP systems are overwhelmingly read-oriented. Write operations are typically:

Bulk loads from source systems (hourly, daily, or weekly)
Append-only inserts of new fact records
Rare updates (usually corrections or dimension changes)

Query-to-write ratios of 100:1 or 1000:1 are typical. This read dominance fundamentally shapes OLAP architecture.

Complex, Multi-Table Queries:

Typical OLAP queries involve:

Multiple JOINs: Connecting fact tables to dimension tables (star/snowflake schemas)
Aggregations: SUM, COUNT, AVG, MIN, MAX across millions of rows
GROUP BY: Slicing data by dimensions (region, time period, product category)
HAVING/WHERE filters: Complex predicates, often on date ranges
Window Functions: Running totals, rankings, period-over-period comparisons

Long-Running Queries:

Unlike OLTP's millisecond transactions, OLAP queries may run for seconds, minutes, or even hours:

Simple dashboard queries: 1-10 seconds
Complex reports: 10-60 seconds
Data science exploration: Minutes to hours
Full-table scans: Depends on data volume

typical_olap_queries.sql
SQL
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
-- Typical OLAP Query Examples
-- ================================
 
-- 1. Aggregation with Multiple Dimensions
-- "Total sales by region and product category for 2024"
SELECT 
    d_region.region_name,
    d_product.category_name,
    SUM(f_sales.amount) AS total_sales,
    COUNT(DISTINCT f_sales.customer_id) AS unique_customers,
    AVG(f_sales.amount) AS avg_transaction
FROM fact_sales f_sales
JOIN dim_date d_date ON f_sales.date_key = d_date.date_key
JOIN dim_region d_region ON f_sales.region_key = d_region.region_key
JOIN dim_product d_product ON f_sales.product_key = d_product.product_key
WHERE d_date.year = 2024
GROUP BY d_region.region_name, d_product.category_name
ORDER BY total_sales DESC;
 
-- 2. Period-over-Period Comparison with Window Functions
-- "Monthly sales with year-over-year growth rate"
SELECT 
    year,
    month,
    monthly_sales,
    LAG(monthly_sales, 12) OVER (ORDER BY year, month) AS prev_year_sales,
    ROUND(
        (monthly_sales - LAG(monthly_sales, 12) OVER (ORDER BY year, month)) * 100.0 /
        NULLIF(LAG(monthly_sales, 12) OVER (ORDER BY year, month), 0),
        2
    ) AS yoy_growth_pct
FROM (
    SELECT 
        d_date.year,
        d_date.month,
        SUM(f_sales.amount) AS monthly_sales
    FROM fact_sales f_sales
    JOIN dim_date d_date ON f_sales.date_key = d_date.date_key
    GROUP BY d_date.year, d_date.month
) monthly_totals
ORDER BY year DESC, month DESC;
 
-- 3. Drill-Down Analysis
-- "Top 10 products in underperforming regions"
WITH regional_performance AS (
    SELECT 
        d_region.region_key,
        d_region.region_name,
        SUM(f_sales.amount) AS total_sales,
        AVG(SUM(f_sales.amount)) OVER () AS avg_regional_sales
    FROM fact_sales f_sales
    JOIN dim_date d_date ON f_sales.date_key = d_date.date_key
    JOIN dim_region d_region ON f_sales.region_key = d_region.region_key
    WHERE d_date.year = 2024
    GROUP BY d_region.region_key, d_region.region_name
    HAVING SUM(f_sales.amount) < AVG(SUM(f_sales.amount)) OVER ()
)
SELECT 
    rp.region_name,
    d_product.product_name,
    SUM(f_sales.amount) AS product_sales,
    RANK() OVER (PARTITION BY rp.region_name ORDER BY SUM(f_sales.amount) DESC) AS rank
FROM fact_sales f_sales
JOIN regional_performance rp ON f_sales.region_key = rp.region_key
JOIN dim_product d_product ON f_sales.product_key = d_product.product_key
JOIN dim_date d_date ON f_sales.date_key = d_date.date_key
WHERE d_date.year = 2024
GROUP BY rp.region_name, d_product.product_name
QUALIFY rank <= 10
ORDER BY rp.region_name, rank;

Query Complexity vs. Data Volume

Columnar Storage Architecture

The most distinctive technical characteristic of modern OLAP systems is columnar storage—organizing data by column rather than by row:

Row-Oriented vs. Column-Oriented:

Traditional OLTP databases store data row-by-row:

Row 1: [id=1, name='Alice', amount=100, date='2024-01-01']
Row 2: [id=2, name='Bob', amount=200, date='2024-01-02']
Row 3: [id=3, name='Carol', amount=150, date='2024-01-03']

Columnar databases store data column-by-column:

id column:     [1, 2, 3, ...]
name column:   ['Alice', 'Bob', 'Carol', ...]
amount column: [100, 200, 150, ...]
date column:   ['2024-01-01', '2024-01-02', '2024-01-03', ...]

Why Columnar Storage Benefits Analytics:

Read Efficiency: Analytical queries typically access few columns but many rows. SELECT SUM(amount) only needs the amount column—no need to read id, name, or date.
Compression: Columns contain homogeneous data types. Integer columns compress far better than mixed-type rows. Typical compression ratios: 5:1 to 10:1.
Vectorized Processing: Modern CPUs process vectors of similar values more efficiently than heterogeneous row data. SIMD instructions accelerate column operations.
Cache Efficiency: Sequential access to column data maximizes CPU cache utilization. Row-based access patterns cause cache thrashing.

Storage Format Comparison
Characteristic	Row-Oriented (OLTP)	Column-Oriented (OLAP)
Optimal for	Single-record access (by ID)	Full-column scans, aggregations
Write performance	Fast single-row inserts/updates	Bulk loads preferred
Compression ratio	2:1 to 3:1 typical	5:1 to 10:1 typical
Query: SELECT * WHERE id=X	Very fast (direct access)	Slow (reconstruct from columns)
Query: SUM(amount) GROUP BY region	Slow (scan all columns)	Very fast (scan one column)
Storage locality	Entire row together	Each column separate

Columnar Compression Techniques

•Run-Length Encoding (RLE): Store repeated values as (value, count). Highly effective for low-cardinality columns like 'status' or 'region'.
•Dictionary Encoding: Replace strings with integer codes. 'North America' → 1, 'Europe' → 2. Dramatically shrinks string columns.
•Bit-Packing: Use minimum bits required. If max value is 1000, use 10 bits instead of 32 bits per integer.
•Delta Encoding: Store differences between consecutive values. Excellent for sorted or sequential data like timestamps.
•Frame of Reference (FOR): Store a base value plus small offsets. Effective for clustered numeric data.

The Compression-Performance Connection

Data Volume and Scale Considerations

OLAP systems routinely handle data volumes that would overwhelm OLTP systems:

Historical Data Retention:

While OLTP systems maintain current state (today's inventory, current balances), OLAP systems preserve history:

Years of transaction history: Every sale, every click, every event
Multiple versions of dimensions: Track how product categories, organizational structures, or regions changed over time (Slowly Changing Dimensions)
Aggregated and raw data: Pre-computed summaries alongside granular records

Scale Benchmarks:

Small Data Warehouse: 100 GB - 1 TB, millions of rows
Medium Data Warehouse: 1 TB - 100 TB, billions of rows
Large Data Warehouse: 100 TB - 10 PB, trillions of rows
Hyperscale Analytics: 10+ PB, unfathomable row counts

Partitioning Strategies:

Managing massive data volumes requires strategic partitioning:

Range Partitioning: Most common for OLAP. Partition by date (monthly, yearly)—queries on recent data scan only recent partitions
List Partitioning: Partition by categorical values (region, product line)
Hash Partitioning: Distribute data evenly for parallel processing
Composite Partitioning: Combine strategies (range by date, then hash within each range)

olap_partitioning.sql
SQL
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
-- OLAP Partitioning Example (PostgreSQL syntax)
-- ==============================================
 
-- Fact table partitioned by month
CREATE TABLE fact_sales (
    sale_id         BIGINT NOT NULL,
    date_key        DATE NOT NULL,
    customer_key    INT NOT NULL,
    product_key     INT NOT NULL,
    store_key       INT NOT NULL,
    quantity        INT NOT NULL,
    unit_price      DECIMAL(10,2) NOT NULL,
    total_amount    DECIMAL(12,2) NOT NULL,
    discount_amount DECIMAL(10,2) DEFAULT 0
) PARTITION BY RANGE (date_key);
 
-- Create partitions for each month
CREATE TABLE fact_sales_2024_01 PARTITION OF fact_sales
    FOR VALUES FROM ('2024-01-01') TO ('2024-02-01');
 
CREATE TABLE fact_sales_2024_02 PARTITION OF fact_sales
    FOR VALUES FROM ('2024-02-01') TO ('2024-03-01');
 
CREATE TABLE fact_sales_2024_03 PARTITION OF fact_sales
    FOR VALUES FROM ('2024-03-01') TO ('2024-04-01');
 
-- Query benefits from partition pruning
-- Only scans January and February partitions
SELECT 
    SUM(total_amount) AS q1_partial_sales
FROM fact_sales
WHERE date_key >= '2024-01-01' 
  AND date_key < '2024-03-01';
 
-- Maintenance: Archive old partitions
ALTER TABLE fact_sales DETACH PARTITION fact_sales_2020_01;
-- Move to archive storage or compress

Partition Pruning is Critical

Pre-Aggregation and Materialized Views

OLAP systems employ sophisticated pre-computation strategies to accelerate common queries:

Materialized Views:

Pre-computed query results stored as physical tables:

CREATE MATERIALIZED VIEW mv_daily_sales_summary AS
SELECT 
    date_key,
    region_key,
    product_category,
    SUM(amount) AS total_sales,
    COUNT(*) AS transaction_count,
    AVG(amount) AS avg_transaction
FROM fact_sales f
JOIN dim_product p ON f.product_key = p.product_key
GROUP BY date_key, region_key, product_category;

Queries against this materialized view return instantly instead of scanning billions of fact rows.

OLAP Cubes:

Multidimensional data structures that pre-aggregate data across multiple dimensions:

Dimensions: The axes of analysis (Time, Region, Product, Customer Segment)
Measures: The numeric values being aggregated (Sales, Quantity, Profit)
Cells: Each intersection of dimension values contains pre-computed aggregates

Cubes enable instant drill-down, roll-up, slice, and dice operations without recalculating aggregates.

Aggregate Tables:

Pre-computed summary tables at various granularities:

fact_sales → Raw transactions (1B rows)
agg_daily_sales → Daily summaries (10M rows)
agg_monthly_sales → Monthly summaries (100K rows)
agg_yearly_sales → Yearly summaries (1K rows)

Query routers automatically select the appropriate aggregate table based on query granularity.

Pre-Aggregation Benefits

•Sub-second response for common queries
•Reduced computational load on OLAP engine
•Consistent performance regardless of base table size
•Lower resource consumption for repetitive queries
•Better user experience for dashboards

Pre-Aggregation Challenges

•Increased storage requirements
•Refresh latency (stale data risk)
•Complex refresh logic for incremental updates
•Maintenance overhead for many aggregates
•Cannot serve ad-hoc queries outside defined aggregates

The Aggregate Navigation Problem

Parallel and Distributed Processing

OLAP systems exploit parallelism at every level to achieve acceptable performance over massive datasets:

Intra-Query Parallelism:

A single query is decomposed into parallel tasks:

Parallel Scan: Multiple threads/processes read different partitions or segments simultaneously
Parallel Aggregation: Local aggregation on each partition, then global merge
Parallel Join: Hash-partition both tables, join partitions independently, union results
Pipeline Parallelism: Different query stages execute concurrently (scan → filter → aggregate)

Massively Parallel Processing (MPP):

Distributed OLAP architectures spread data and processing across many nodes:

Shared-Nothing Architecture: Each node has private CPU, memory, and storage. No contention.
Data Distribution: Tables partitioned across nodes (hash or range)
Query Coordination: A coordinator node parses queries, creates distributed plans, and orchestrates execution
Result Gathering: Partial results from each node are aggregated at the coordinator

Popular MPP Systems:

Snowflake: Cloud-native, separation of compute and storage
Amazon Redshift: AWS managed columnar data warehouse
Google BigQuery: Serverless, highly scalable analytics
Azure Synapse Analytics: Microsoft's unified analytics platform
Greenplum/Postgres-XL: Open-source MPP solutions
ClickHouse: High-performance open-source columnar DBMS

Parallel Execution Strategies
Strategy	Description	Best For	Limitation
Partition-Wise Scan	Each worker scans its partition	Large table scans	Requires good partitioning
Hash Distribution	Redistribute data by join key	Large-to-large joins	Network shuffle overhead
Broadcast Join	Replicate small table to all nodes	Small dimension joins	Memory limits on table size
Sort-Merge Parallel	Parallel sort, then merge	ORDER BY, DISTINCT	Merge step is sequential
Aggregation Rollup	Local agg → global agg	GROUP BY with aggregates	High-cardinality groups costly

The Separation of Compute and Storage

Query Optimization for Analytical Workloads

OLAP query optimizers employ specialized techniques beyond traditional OLTP optimization:

Statistics-Based Optimization:

Column Statistics: Min/max values, distinct count, null count, histogram of value distribution
Table Statistics: Row count, data size, partitioning information
Join Statistics: Correlation between columns, foreign key relationships
Cost Models: Estimate I/O cost, CPU cost, and network cost for distributed operations

OLAP-Specific Optimizations:

Predicate Pushdown: Move filters as close to storage as possible. Filter during scan, not after loading into memory.
Projection Pushdown: Read only columns needed by the query. Critical for columnar storage.
Partition Pruning: Use query predicates to eliminate entire partitions from scanning.
Join Order Optimization: With many joins (star schema), finding optimal join order is NP-hard. OLAP optimizers use heuristics (smallest table first, fact table last) and cost-based search.
Aggregate Pushdown: Push aggregation below joins when legal, reducing intermediate result sizes.
Materialized View Rewriting: Automatically rewrite queries to use pre-computed aggregates when they satisfy the query.
Runtime Filters: Generate bloom filters from small table during join, apply to large table scan to skip non-matching data.

olap_optimization_example.sql
SQL
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
-- Original Query
SELECT 
    d_region.region_name,
    SUM(f_sales.amount) AS total_sales
FROM fact_sales f_sales
JOIN dim_date d_date ON f_sales.date_key = d_date.date_key
JOIN dim_region d_region ON f_sales.region_key = d_region.region_key
WHERE d_date.year = 2024
  AND d_region.country = 'USA'
GROUP BY d_region.region_name;
 
-- Optimizer Transformations Applied:
-- =====================================
 
-- 1. Predicate Pushdown: Filter dimensions first
--    - d_date filtered to 2024 before join (~365 rows from millions)
--    - d_region filtered to USA before join (~50 rows from thousands)
 
-- 2. Partition Pruning
--    - fact_sales scans only 2024 partitions (12 months)
--    - Skips years 2020, 2021, 2022, 2023 entirely
 
-- 3. Projection Pushdown
--    - Only reads: date_key, region_key, amount from fact_sales
--    - Ignores: customer_key, product_key, quantity, etc.
 
-- 4. Join Order: Small → Large
--    - Join dim_date (365 rows) first
--    - Then dim_region (50 rows)
--    - Finally scan matching fact_sales
 
-- 5. Runtime Filter (Bloom Filter)
--    - Build bloom filter from dim_date.date_key (2024 dates)
--    - Apply to fact_sales.date_key during scan
--    - Skip blocks without matching date_keys

Statistics Freshness Matters

Summary: OLAP System Characteristics

We have comprehensively explored the defining characteristics of Online Analytical Processing systems. Let's consolidate the essential knowledge:

Key OLAP Characteristics

•Read-Dominant Workloads — Queries vastly outnumber writes; writes are typically bulk loads rather than transactional updates.
•Complex Analytical Queries — Multi-table joins, aggregations, window functions, and ad-hoc exploration patterns.
•Columnar Storage — Data organized by column for efficient scanning, compression, and aggregation.
•Massive Data Volumes — Terabytes to petabytes of historical data, requiring sophisticated partitioning strategies.
•Pre-Aggregation — Materialized views and aggregate tables accelerate common query patterns.
•Parallel/Distributed Processing — MPP architectures spread work across many nodes for horizontal scalability.
•Specialized Query Optimization — Predicate pushdown, partition pruning, and cost-based optimization for analytical workloads.

What's Next:

Page Complete

2 / 5