Data Warehouse Concepts - Learning Module

Loading content...

0/252

Non-Volatile

The Stillness of Analytical Data

Operational databases are in constant motion. Every second brings new orders, updated customer profiles, changed inventory levels, deleted records. This volatility is essential—it reflects the living, changing state of business operations.

Now imagine trying to analyze a moving target:

You start building a monthly revenue report at 2:00 PM. By 3:00 PM, when you run the query again to verify, the numbers have changed—new transactions arrived, adjustments were posted, records were deleted. Which version is 'correct'? Neither—both were true at their moment of execution.

This is the volatility problem in analytical systems. When data changes during analysis, results become inconsistent, irreproducible, and untrustworthy.

Non-volatility is the data warehouse characteristic that addresses this fundamental challenge. Warehouse data, once loaded, is stable. It doesn't change during analytical windows. Reports run today and tomorrow on the same data produce identical results. The analytical foundation is solid.

What You Will Learn

By the end of this page, you will deeply understand non-volatility: what it means operationally, why it's essential for trustworthy analytics, how it's implemented through load patterns and isolation mechanisms, and how it differs from operational ACID properties. You'll learn to design warehouse load strategies that maintain stability while enabling fresh data.

Defining Non-Volatility

Non-volatility means that once data is correctly loaded into the data warehouse, it is not updated or deleted as part of normal operations. The warehouse is primarily read-only between load operations, and loads themselves are controlled, scheduled events—not continuous streams.

"Data in the operational environment is regularly updated, changed, and deleted. Data in the data warehouse is loaded and accessed, but is not updated (in the general sense of the word update) in the normal course of processing." — Bill Inmon

What Non-Volatility Provides:

Non-Volatility Guarantees

•Query Consistency — Multiple queries against the same data return consistent results, regardless of when executed within a stable window.
•Reproducibility — Reports run last week can be reproduced today with identical results (assuming same query, same data state).
•Audit Trail Reliability — Historical records provide reliable reference for compliance and audit purposes—they don't change after the fact.
•Simplified Concurrency — Without continuous updates, concurrency control is dramatically simplified—readers never conflict with writers during stable windows.
•Performance Optimization — Read-heavy, update-light workloads enable aggressive indexing, caching, and materialization strategies.

Operational (Volatile)

•Continuous INSERT, UPDATE, DELETE
•Data changes millisecond to millisecond
•Current state is always 'now'
•Locking for concurrent transactions
•Queries see different data each run

Warehouse (Non-Volatile)

•Periodic batch LOADs only
•Data stable between loads
•State reflects 'as of last load'
•Minimal concurrency conflict
•Queries are reproducible

The Load-Only Pattern

Non-volatility is implemented through the load-only pattern: data enters the warehouse through controlled load processes, and once loaded, existing data is not modified.

The Three Operations:

INSERT (Load) — New records are appended. This is the dominant operation.
DELETE (Rare) — Physical deletion is rare and typically reserved for error correction or compliance requirements (e.g., GDPR erasure requests).
UPDATE (Very Rare) — Updates outside of SCD processing are exceptional—typically for error correction only.

Converting Mermaid diagram...

Load Windows and Stable Windows:

The warehouse alternates between two states:

Loading Window — The controlled period when ETL processes run, inserting new data and (rarely) correcting errors. This window is typically scheduled (nightly, hourly) and brief relative to the stable window.
Stable Window — The analytical window between loads. Data is read-only. All queries see consistent data. This is when users interact with the warehouse.

The clear separation between loading and querying enables both data freshness and query stability—a balance operational systems cannot achieve.

Append-Only as the Default

The simplest implementation of non-volatility is append-only design: never update or delete existing rows. New records are inserted; corrections are handled by inserting correction/adjustment records. This pattern naturally preserves audit trails and maximizes query stability. Many modern data platforms optimize heavily for append-only workloads.

Exceptions to Non-Volatility

While non-volatility is the default, certain legitimate scenarios require controlled modifications to warehouse data. These are exceptions, not the norm:

Legitimate Exceptions to Non-Volatility
Exception Type	Description	Handling Approach
Error Correction	Source data was loaded incorrectly due to ETL bugs or source system errors	Controlled correction process with audit logging; may require fact table updates or dimension version corrections
Late-Arriving Dimensions	Dimension information arrives after related facts were loaded	Update dimension records to fill missing attributes; may trigger reprocessing of affected aggregations
Late-Arriving Facts	Transaction data arrives days or weeks after the event occurred	Insert with original transaction date (not load date); handle in aggregation refresh
SCD Type 1 Corrections	Attribute corrections that don't require history (e.g., fixing typos)	Direct update to current dimension row; by definition, these don't require historical tracking
Regulatory Deletion	GDPR 'right to be forgotten' or similar compliance requirements	Controlled deletion or anonymization with audit trail; may require cascading to all related records
Aggregation Refreshes	Pre-computed aggregates need recalculation when source data changes	Scheduled or triggered rebuild of affected aggregate tables/materialized views

Changes Require Governance

Every exception to non-volatility should be governed. Who can approve data corrections? What audit trail is created? How are downstream consumers notified? When is it safe to make changes? Without governance, exceptions become backdoors that undermine warehouse reliability.

Correction Patterns:

When corrections are necessary, several patterns preserve as much non-volatility benefit as possible:

Reversal Records — Instead of updating a fact row, insert a reversing entry (negative quantity) followed by a corrected entry. The audit trail is preserved.
Logical Deletion — Instead of physical DELETE, set an 'is_deleted' flag. Row remains for historical queries but is excluded from current views.
Version Increment — For dimensions, create a new version (Type 2) rather than updating in place, backfilling the effective date.
Restatement Tables — Maintain separate 'restated' fact tables for corrected data while preserving original tables for audit.

Non-Volatility and Performance

Non-volatility isn't just about data integrity—it enables significant performance optimizations that would be impossible in volatile, update-heavy environments:

Performance Benefits of Non-Volatility

•Aggressive Indexing — Create many indexes without write penalty. In volatile systems, every index slows INSERT/UPDATE. In non-volatile warehouses, indexes are maintained only during loads.
•Materialized Views and Aggregates — Pre-compute common aggregations knowing they remain valid until the next load. No need for real-time refresh overhead.
•Column-Oriented Storage — Columnar formats (Parquet, ORC) provide massive compression and scan performance but are inefficient for row-level updates. Non-volatility makes columnar ideal.
•Query Result Caching — Cache query results knowing they remain valid until data changes. Modern warehouses (Snowflake, BigQuery) leverage this heavily.
•Partition Immutability — Mark partitions as immutable after load, enabling skip optimizations and avoiding re-scanning.
•Simplified Concurrency — No locks needed for read queries. Writers (loaders) operate in dedicated windows, minimizing contention.
•Compression Optimization — Sort data for optimal compression knowing it won't be randomly updated. Sorted, compressed storage yields 10x+ space savings.

Non-Volatile Optimization Example
SQL
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
-- ==============================================
-- EXAMPLE: Aggressive Indexing in Non-Volatile Warehouse
-- ==============================================
 
-- In a volatile OLTP system, this many indexes would kill write performance.
-- In a non-volatile warehouse, they optimize diverse query patterns with minimal cost.
 
CREATE TABLE fact_sales (
    sale_key            BIGINT PRIMARY KEY,
    customer_key        INT,
    product_key         INT,
    store_key           INT,
    date_key            INT,
    quantity            INT,
    revenue             DECIMAL(12,2),
    cost                DECIMAL(12,2),
    load_date           DATE
);
 
-- Indexes for common access patterns
CREATE INDEX idx_sales_customer ON fact_sales(customer_key);
CREATE INDEX idx_sales_product ON fact_sales(product_key);
CREATE INDEX idx_sales_store ON fact_sales(store_key);
CREATE INDEX idx_sales_date ON fact_sales(date_key);
 
-- Composite indexes for common query combinations
CREATE INDEX idx_sales_date_store ON fact_sales(date_key, store_key);
CREATE INDEX idx_sales_customer_date ON fact_sales(customer_key, date_key);
CREATE INDEX idx_sales_product_date ON fact_sales(product_key, date_key);
 
-- Covering indexes for frequent aggregation queries
CREATE INDEX idx_sales_date_revenue ON fact_sales(date_key) INCLUDE (revenue, quantity);
 
-- Bitmap indexes (in databases that support them) for low-cardinality
CREATE BITMAP INDEX idx_sales_customer_seg ON fact_sales(customer_segment);
 
-- ==============================================
-- MATERIALIZED VIEW: Pre-computed daily aggregates
-- Only refreshed on load; not real-time maintained
-- ==============================================
 
CREATE MATERIALIZED VIEW mv_daily_sales_summary AS
SELECT 
    d.date_key,
    d.calendar_date,
    d.year_month,
    s.store_key,
    st.region,
    SUM(f.revenue) AS total_revenue,
    SUM(f.quantity) AS total_units,
    COUNT(DISTINCT f.customer_key) AS unique_customers
FROM fact_sales f
JOIN dim_date d ON f.date_key = d.date_key
JOIN dim_store st ON f.store_key = s.store_key
GROUP BY d.date_key, d.calendar_date, d.year_month, s.store_key, st.region;
 
-- Refresh after each load cycle, not continuously
-- REFRESH MATERIALIZED VIEW mv_daily_sales_summary;

Non-Volatility vs. Real-Time Requirements

A common tension in warehouse design is balancing non-volatility (stability) with the desire for real-time or near-real-time data. How do we maintain analytical consistency while reducing latency?

The Spectrum of Latency:

Data Latency Patterns and Non-Volatility
Pattern	Latency	Non-Volatility Approach	Use Case
Batch/Nightly	12-24 hours	Full non-volatility. Load once per day. Stable for 24 hours.	Traditional BI, historical analysis, classic reporting
Micro-Batch	5-60 minutes	Near-non-volatile. Frequent small loads. Short stable windows.	Operational reporting, dashboards requiring freshness
Near-Real-Time	Seconds-minutes	Streaming ingestion with periodic snapshots for stability.	Live dashboards, operational monitoring
Real-Time	Sub-second	Hybrid architecture: streaming layer + stable warehouse.	Fraud detection, real-time personalization, alerting

Lambda and Kappa Architectures:

Two architectural patterns address real-time requirements while preserving non-volatility benefits:

Lambda Architecture:

Batch Layer — Non-volatile warehouse with periodic loads. Provides the 'source of truth' for historical analysis.
Speed Layer — Real-time stream processing for recent data. Provides freshness but not history.
Serving Layer — Merges batch and speed results for unified queries.

Kappa Architecture:

Stream-Only — All data flows through streaming pipeline.
Materialized Views — Streams are materialized into queryable tables.
Reprocessing — History reconstructed by replaying the stream.

Both architectures recognize that real-time freshness and stable analysis serve different needs—and sometimes both are required.

The 'Good Enough' Principle

Many organizations discover that 15-minute latency satisfies most 'real-time' requirements. True sub-second needs are rarer than assumed. Design for the latency actually needed—not imagined. Over-engineering for real-time often sacrifices the reliability benefits of non-volatility without delivering proportional value.

Snapshot Isolation and Consistency

Non-volatility is closely related to—but distinct from—snapshot isolation in database systems. Understanding their relationship clarifies how warehouse consistency is achieved:

Snapshot Isolation (Transaction Level)

A database isolation level where each transaction sees a consistent snapshot of data as of transaction start. Changes made by other concurrent transactions are invisible.

•Per-transaction consistency
•Data still changes between transactions
•Implementation via MVCC
•Short-lived consistency window

Non-Volatility (Warehouse Level)

An architectural property where data is loaded periodically and remains stable between loads. The warehouse as a whole has extended stability windows.

•System-wide consistency
•Data truly doesn't change
•Implementation via load patterns
•Extended consistency window (hours/days)

Practical Implications:

In a non-volatile warehouse:

A complex query taking an hour will see consistent data throughout—not because of transaction isolation, but because the underlying data hasn't changed.
Two users running the same report get identical results—not because of shared transaction context, but because the data is genuinely static.
Aggregations can be pre-computed and cached—not because of sophisticated concurrency control, but because the inputs won't change.

Non-volatility provides stronger guarantees than snapshot isolation for analytical workloads, with simpler implementation.

Modern Warehouse Combination

Modern cloud warehouses (Snowflake, BigQuery, Databricks) combine both: they use snapshot isolation for transaction consistency AND encourage non-volatile loading patterns for analytical consistency. Continuous streaming ingest is supported, but best practices still recommend controlled load windows for stable analytical workloads.

Implementing Non-Volatility

Implementing non-volatility requires deliberate architectural and operational choices:

Implementation Best Practices

•Define Load Windows — Establish clear schedules for when data loads occur (e.g., 3 AM - 5 AM daily). Communicate these windows to analysts.
•Separate Load and Query Environments — Use separate compute resources for loading vs. querying, or schedule loading when query activity is minimal.
•Implement Change Control — Require formal approval for any UPDATE or DELETE outside of standard ETL. Track all changes with audit logs.
•Design Append-Only Where Possible — Default to INSERT. Use adjustment records rather than UPDATE. Physical DELETE only for compliance.
•Version Control ETL — Treat ETL code as production software. Changes go through testing and approval.
•Monitor for Violations — Alert on unexpected DML operations (UPDATE/DELETE) outside load windows. Investigate immediately.
•Document Exceptions — When non-volatility must be violated, document why, who approved, what changed, and what downstream impacts occurred.
•Partition by Load Date — Use load timestamps in partitioning to clearly identify data by when it arrived, supporting troubleshooting and rollback.

Non-Volatility Audit Pattern
SQL
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
-- ==============================================
-- AUDIT TABLE: Track all data modifications
-- ==============================================
 
CREATE TABLE etl_data_modification_log (
    modification_id     BIGINT PRIMARY KEY IDENTITY,
    modification_time   TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    table_name          VARCHAR(128) NOT NULL,
    operation           VARCHAR(10) NOT NULL,  -- INSERT, UPDATE, DELETE
    row_count           INT NOT NULL,
    performed_by        VARCHAR(100) NOT NULL,
    
    -- For approved exceptions
    is_standard_load    BOOLEAN DEFAULT TRUE,
    exception_type      VARCHAR(50),  -- 'ERROR_CORRECTION', 'COMPLIANCE', etc.
    approval_ticket     VARCHAR(50),
    
    -- For troubleshooting
    batch_id            VARCHAR(100),
    source_description  VARCHAR(500)
);
 
-- ==============================================
-- TRIGGER: Alert on unexpected modifications
-- ==============================================
 
-- Example: Trigger to log and alert on fact table updates
CREATE OR REPLACE FUNCTION log_unexpected_modification()
RETURNS TRIGGER AS $$
BEGIN
    -- Only INSERT is expected for fact tables
    IF TG_OP IN ('UPDATE', 'DELETE') THEN
        INSERT INTO etl_data_modification_log (
            table_name, operation, row_count, performed_by,
            is_standard_load, exception_type
        ) VALUES (
            TG_TABLE_NAME, TG_OP, 1, CURRENT_USER,
            FALSE, 'UNEXPECTED_MODIFICATION'
        );
        
        -- In production: also send alert to operations team
        -- PERFORM send_alert('Unexpected ' || TG_OP || ' on ' || TG_TABLE_NAME);
    END IF;
    
    RETURN COALESCE(NEW, OLD);
END;
$$ LANGUAGE plpgsql;
 
CREATE TRIGGER trg_fact_sales_modification
BEFORE UPDATE OR DELETE ON fact_sales
FOR EACH ROW EXECUTE FUNCTION log_unexpected_modification();

Summary: Non-Volatility and the Four Pillars Complete

Non-volatility completes the four defining characteristics of a data warehouse. Let's consolidate what we've learned about this final pillar:

Key Takeaways: Non-Volatility

•Warehouse data is stable between loads — Unlike operational systems, data doesn't change during analytical windows.
•Load-only pattern is the default — INSERT during load windows; UPDATE and DELETE are rare exceptions.
•Stability enables reproducibility — Reports run today and tomorrow on same data yield identical results.
•Performance optimizations leverage stability — Aggressive indexing, materialized views, and compression depend on non-volatility.
•Exceptions require governance — Error corrections, compliance deletions, and late-arriving data follow controlled processes.
•Real-time and stability can coexist — Hybrid architectures (Lambda/Kappa) balance freshness with analytical consistency.
•Implementation requires deliberate design — Load windows, audit trails, and change control enforce non-volatility.

The Four Pillars of Data Warehouse Definition - Complete Summary
Characteristic	Core Meaning	Key Implementation
Subject-Oriented	Organized around business subjects, not applications	Dimensional modeling, fact-dimension schemas, business naming
Integrated	Unified from heterogeneous sources with consistent semantics	ETL standardization, entity resolution, conformed dimensions
Time-Variant	Historical depth preserved across extended time horizons	Slowly Changing Dimensions, temporal keys, date dimensions
Non-Volatile	Stable between loads; append-dominant operations	Load windows, read-only periods, change governance

Module Complete: Data Warehouse Concepts

You have now mastered the foundational concepts that define what a data warehouse is and why each characteristic matters. These aren't abstract principles—they're design requirements that shape every decision in warehouse architecture:

Schema design must be subject-oriented
ETL must achieve integration
Temporal tracking must enable time-variance
Operational patterns must maintain non-volatility

With these concepts internalized, you're prepared to explore the practical implementation: star schemas, snowflake schemas, ETL processes, and OLAP operations in the modules ahead.

Module Complete

You now understand all four defining characteristics of a data warehouse as articulated by Bill Inmon. Subject-Orientation organizes data around business subjects. Integration unifies heterogeneous sources. Time-Variance preserves historical depth. Non-Volatility ensures analytical stability. These pillars provide the foundation for all warehouse design decisions ahead.