Snowflake Schema - Learning Module

Loading content...

0/241

Normalized Dimensions

Bringing Normalization to the Warehouse

In the previous module, we explored the star schema—a design pattern that places a central fact table surrounded by denormalized dimension tables. The star schema's simplicity and query performance made it the workhorse of data warehousing for decades. But simplicity comes with trade-offs.

What if your dimension tables grow to contain millions of rows with significant redundancy? What if storage costs become prohibitive, or data quality suffers from update anomalies? What if your dimensions have natural hierarchies that would benefit from explicit structural representation?

Enter the snowflake schema—a normalized variant that applies relational database normalization principles to dimension tables, transforming the simple star into a more complex, multi-level structure resembling a snowflake.

What You Will Learn

By the end of this page, you will understand what normalized dimensions are, why they're used in snowflake schemas, how they differ structurally from denormalized dimensions, and the normalization principles that guide their design. You'll be able to identify dimension hierarchies and understand how they're decomposed into related tables.

The Normalization Foundation

Before we dive into normalized dimensions specifically, let's recall why normalization matters in database design. Normalization is the process of organizing a relational database to reduce data redundancy and improve data integrity.

In OLTP (Online Transaction Processing) systems, normalization to at least Third Normal Form (3NF) is standard practice. This eliminates:

Insertion anomalies: Inability to add data without related data
Update anomalies: Inconsistencies when updating repeated data
Deletion anomalies: Unintended loss of data when deleting records

The snowflake schema applies these same principles to dimension tables in a data warehouse context.

Quick Normalization Refresher

1NF: Atomic values, no repeating groups. 2NF: 1NF plus no partial dependencies (all non-key attributes depend on the entire primary key). 3NF: 2NF plus no transitive dependencies (non-key attributes don't depend on other non-key attributes). The snowflake schema typically normalizes dimensions to 3NF.

Why consider normalization in a warehouse?

Data warehouses were traditionally designed with denormalized dimensions (star schema) for several reasons:

Query simplicity: Fewer joins make SQL easier to write and understand
Aggregation performance: Wide tables with pre-joined data scan efficiently
Historical context: When disk was expensive, but star schemas were already optimized for read patterns

However, modern data challenges have shifted some of these assumptions:

Massive dimension cardinality: Some dimensions contain millions of distinct values
Rapid dimension growth: Product catalogs, customer bases, and geographic hierarchies expand rapidly
Storage costs at scale: Even with cheap storage, petabyte-scale redundancy adds up
Data governance requirements: Maintaining single sources of truth for hierarchical data

These factors have renewed interest in normalized dimension designs—hence the snowflake schema.

What Are Normalized Dimensions?

A normalized dimension is a dimension table that has been decomposed into multiple related tables according to normal form principles, typically Third Normal Form (3NF). Instead of storing all dimension attributes in a single wide table, hierarchical or repeating attributes are extracted into separate tables linked by foreign keys.

The core concept:

In a star schema, a DIM_PRODUCT table might contain:

product_id	product_name	category_name	category_desc	subcategory	brand	brand_country

Notice the redundancy: every product in "Electronics" repeats "Electronics" and its description. Every Apple product repeats "Apple" and "USA".

In a snowflake schema, this becomes multiple related tables:

DIM_PRODUCT: product_id, product_name, subcategory_id (FK), brand_id (FK)
DIM_SUBCATEGORY: subcategory_id, subcategory_name, category_id (FK)
DIM_CATEGORY: category_id, category_name, category_description
DIM_BRAND: brand_id, brand_name, country_id (FK)
DIM_COUNTRY: country_id, country_name

Converting Mermaid diagram...

The 'Snowflake' Visual

The name 'snowflake schema' comes from the visual appearance when you diagram these relationships. Each dimension table branches into multiple related tables, creating a pattern that resembles a snowflake crystal radiating from the central fact table.

Dimension Hierarchies

The key insight driving normalized dimensions is that dimensions often contain natural hierarchies. A hierarchy represents a parent-child relationship where higher-level entities contain or categorize lower-level entities.

Common dimensional hierarchies:

Typical Dimension Hierarchies in Data Warehouses
Dimension	Hierarchy Levels	Example Path
Time/Date	Year → Quarter → Month → Week → Day	2024 → Q1 → January → Week 3 → Jan 15
Geography	Region → Country → State → City → Store	EMEA → Germany → Bavaria → Munich → Store #472
Product	Division → Category → Subcategory → Product	Consumer Electronics → Computers → Laptops → MacBook Pro 16"
Organization	Company → Division → Department → Team	Acme Corp → Engineering → Backend → API Team
Customer	Segment → Industry → Company → Account	Enterprise → Finance → Goldman Sachs → Trading Desk #7

Why hierarchies matter for normalization:

In a denormalized dimension, hierarchy levels are stored as columns in a single table. Every row repeats the higher-level values:

store_id | store_name | city | state | country | region
1        | Downtown   | LA   | CA    | USA     | Americas
2        | Westside   | LA   | CA    | USA     | Americas
3        | Midtown    | NYC  | NY    | USA     | Americas
4        | Soho       | NYC  | NY    | USA     | Americas
...

With 10,000 stores, you store "USA" and "Americas" 10,000 times. With normalization, you store each geographic level once and reference it:

DIM_STORE: store_id, store_name, city_id
DIM_CITY: city_id, city_name, state_id  
DIM_STATE: state_id, state_name, country_id
DIM_COUNTRY: country_id, country_name, region_id
DIM_REGION: region_id, region_name

Depth Varies by Hierarchy

Not all hierarchies have the same depth. Time dimensions typically have 5-6 levels, while a simple status dimension might have only 2 (status category → status). The snowflake schema adapts to represent whatever natural hierarchies exist in your business domain.

Normalization in Practice

Let's walk through normalizing a dimension step by step to understand the transformation process. We'll use a Customer dimension as our example.

Original Denormalized Customer Dimension (Star Schema):

denormalized_customer_dimension.sql
SQL
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
-- Star Schema: Single wide table with all customer attributes
CREATE TABLE DIM_CUSTOMER (
    customer_id         INT PRIMARY KEY,
    customer_name       VARCHAR(100),
    customer_email      VARCHAR(255),
    customer_phone      VARCHAR(20),
    
    -- Address hierarchy (denormalized)
    street_address      VARCHAR(200),
    city_name           VARCHAR(100),
    city_population     INT,          -- Repeated for every customer in this city
    state_name          VARCHAR(100),
    state_code          CHAR(2),
    country_name        VARCHAR(100),
    country_iso_code    CHAR(3),
    region_name         VARCHAR(50),
    
    -- Customer segment hierarchy (denormalized)
    segment_name        VARCHAR(50),  -- "Enterprise", "SMB", "Consumer"
    segment_discount    DECIMAL(5,2), -- Same for all customers in segment
    industry_name       VARCHAR(100),
    industry_code       VARCHAR(10),
    
    -- Account management
    account_manager     VARCHAR(100),
    manager_email       VARCHAR(255),
    manager_region      VARCHAR(50)
);
 
-- Sample data showing redundancy:
-- customer_id | ... | city_name | city_population | state_name | segment_name | segment_discount | ...
-- 1           | ... | Seattle   | 750000          | Washington | Enterprise   | 0.15            | ...
-- 2           | ... | Seattle   | 750000          | Washington | Enterprise   | 0.15            | ...
-- 3           | ... | Seattle   | 750000          | Washington | Consumer     | 0.05            | ...
-- 4           | ... | Portland  | 650000          | Oregon     | Enterprise   | 0.15            | ...
 
-- Every Seattle customer stores 750000 redundantly
-- Every Enterprise customer stores 0.15 redundantly

Identifying Normalization Opportunities:

Analyzing this table, we can identify several functional dependencies that violate 3NF:

city_name → city_population (non-key determines non-key)
state_name → state_code (non-key determines non-key)
country_name → country_iso_code (non-key determines non-key)
segment_name → segment_discount (non-key determines non-key)
account_manager → manager_email, manager_region (non-key determines non-keys)
industry_name → industry_code (non-key determines non-key)

Each of these represents a transitive dependency that normalization will eliminate.

normalized_customer_dimension.sql
SQL
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
-- Snowflake Schema: Normalized dimension tables
 
-- Level 5: Region (highest level of geographic hierarchy)
CREATE TABLE DIM_REGION (
    region_id       INT PRIMARY KEY,
    region_name     VARCHAR(50) NOT NULL
);
 
-- Level 4: Country
CREATE TABLE DIM_COUNTRY (
    country_id      INT PRIMARY KEY,
    country_name    VARCHAR(100) NOT NULL,
    country_iso_code CHAR(3) NOT NULL,
    region_id       INT NOT NULL REFERENCES DIM_REGION(region_id)
);
 
-- Level 3: State/Province
CREATE TABLE DIM_STATE (
    state_id        INT PRIMARY KEY,
    state_name      VARCHAR(100) NOT NULL,
    state_code      CHAR(2),
    country_id      INT NOT NULL REFERENCES DIM_COUNTRY(country_id)
);
 
-- Level 2: City
CREATE TABLE DIM_CITY (
    city_id         INT PRIMARY KEY,
    city_name       VARCHAR(100) NOT NULL,
    city_population INT,
    state_id        INT NOT NULL REFERENCES DIM_STATE(state_id)
);
 
-- Industry reference table
CREATE TABLE DIM_INDUSTRY (
    industry_id     INT PRIMARY KEY,
    industry_name   VARCHAR(100) NOT NULL,
    industry_code   VARCHAR(10) NOT NULL
);
 
-- Customer segment reference table
CREATE TABLE DIM_SEGMENT (
    segment_id      INT PRIMARY KEY,
    segment_name    VARCHAR(50) NOT NULL,
    segment_discount DECIMAL(5,2) NOT NULL
);
 
-- Account manager reference table
CREATE TABLE DIM_ACCOUNT_MANAGER (
    manager_id      INT PRIMARY KEY,
    manager_name    VARCHAR(100) NOT NULL,
    manager_email   VARCHAR(255) NOT NULL,
    manager_region  VARCHAR(50)
);
 
-- Core customer dimension (now slim and normalized)
CREATE TABLE DIM_CUSTOMER (
    customer_id     INT PRIMARY KEY,
    customer_name   VARCHAR(100) NOT NULL,
    customer_email  VARCHAR(255),
    customer_phone  VARCHAR(20),
    street_address  VARCHAR(200),
    
    -- Foreign keys to normalized reference tables
    city_id         INT REFERENCES DIM_CITY(city_id),
    segment_id      INT REFERENCES DIM_SEGMENT(segment_id),
    industry_id     INT REFERENCES DIM_INDUSTRY(industry_id),
    manager_id      INT REFERENCES DIM_ACCOUNT_MANAGER(manager_id)
);

Normalization Results

The original single table has been decomposed into 8 tables. Each table now stores unique data without redundancy. City population is stored once per city. Segment discount is stored once per segment. Manager email is stored once per manager. This eliminates all transitive dependencies.

Structural Implications

Normalizing dimensions has significant implications for the overall warehouse structure. Let's examine the key structural changes.

Structural Changes from Normalization

•Table Count Increases: A single dimension may become 3-8 related tables depending on hierarchy depth. A warehouse with 8 dimensions might have 30-50 dimension tables instead of 8.
•Foreign Key Chains: Rather than direct references from fact to dimension, you get chains: FACT → DIM_PRODUCT → DIM_SUBCATEGORY → DIM_CATEGORY. Each link adds a join.
•Surrogate Key Proliferation: Each normalized table needs its own primary key. This increases the number of surrogate keys to manage in ETL processes.
•Referential Integrity: More foreign key relationships mean more constraints to maintain. This adds overhead to data loading but improves data quality.
•Index Requirements: Each foreign key should typically have an index. More tables mean more indexes to create and maintain.

Star Schema Structure

•Fact table with direct FKs to dimensions
•Each dimension is a single table
•Maximum 1 join from fact to any attribute
•Wide dimension tables (many columns)
•Simple, predictable structure

Snowflake Schema Structure

•Fact table → base dimension → hierarchy tables
•Dimensions decomposed into multiple tables
•Multiple joins for higher hierarchy levels
•Narrow, focused tables (few columns each)
•Complex structure reflecting domain hierarchy

The Join Multiplication Effect:

Consider a query that aggregates sales by region. In a star schema:

SELECT region_name, SUM(amount) 
FROM fact_sales f JOIN dim_store s ON f.store_id = s.store_id
GROUP BY region_name;
-- 1 join

In a snowflake schema:

SELECT r.region_name, SUM(f.amount)
FROM fact_sales f 
  JOIN dim_store st ON f.store_id = st.store_id
  JOIN dim_city c ON st.city_id = c.city_id
  JOIN dim_state s ON c.state_id = s.state_id
  JOIN dim_country co ON s.country_id = co.country_id
  JOIN dim_region r ON co.region_id = r.region_id
GROUP BY r.region_name;
-- 5 joins

This join multiplication is the primary query complexity cost of snowflake schemas.

Data Integrity Benefits

While query complexity increases, normalized dimensions provide significant data integrity advantages that can be critical for certain use cases.

Data Integrity Advantages

•Single Point of Truth: Each piece of information is stored exactly once. A country's ISO code exists in one row of DIM_COUNTRY, not repeated across millions of customer or store records.
•Consistent Updates: When a segment's discount changes, you update one row in DIM_SEGMENT. In denormalized design, you'd update millions of rows in DIM_CUSTOMER and risk inconsistencies.
•Enforced Hierarchies: Foreign key constraints ensure valid relationships. You can't have a city referencing a non-existent state. Referential integrity is structurally enforced.
•Reduced Anomalies: Insertion, update, and deletion anomalies are eliminated or minimized. You don't lose category information when deleting the last product in that category.
•Easier Auditing: With single sources of truth, auditing changes is straightforward. You can track when a city's population was updated without scanning millions of rows.

Real-World Impact

Consider a scenario where a company restructures its sales regions. In a denormalized star schema, this might require updating millions of customer and transaction records. In a normalized snowflake schema, you update the region assignments in DIM_COUNTRY and the change propagates through the hierarchy automatically in queries.

Example: Handling a City Rename

Suppose Bombay is renamed to Mumbai (an actual historical event).

In Star Schema:

UPDATE dim_customer SET city_name = 'Mumbai' WHERE city_name = 'Bombay';
UPDATE dim_store SET city_name = 'Mumbai' WHERE city_name = 'Bombay';
UPDATE dim_supplier SET city_name = 'Mumbai' WHERE city_name = 'Bombay';
-- Potentially millions of rows across multiple tables
-- Risk of inconsistency if any update fails

In Snowflake Schema:

UPDATE dim_city SET city_name = 'Mumbai' WHERE city_name = 'Bombay';
-- One row updated
-- All referencing tables see the change immediately through joins
-- Zero risk of inconsistency

Practical Considerations

When implementing normalized dimensions, several practical factors require attention to ensure a successful design.

ETL Complexity Increases

Normalized dimensions require more sophisticated ETL (Extract, Transform, Load) processes:

Load Order Matters: Parent tables must be loaded before children. DIM_REGION before DIM_COUNTRY before DIM_STATE, etc.
Key Lookups: Each dimension level needs surrogate key lookups during transformation.
Cascading Updates: Changes to hierarchy may need to cascade through multiple levels.
Referential Integrity Checks: More foreign keys mean more constraint validation.

ETL Loading Sequence Example:

1. Load DIM_REGION (no dependencies)
2. Load DIM_COUNTRY (lookup region_id)
3. Load DIM_STATE (lookup country_id)
4. Load DIM_CITY (lookup state_id)
5. Load DIM_INDUSTRY, DIM_SEGMENT, DIM_ACCOUNT_MANAGER (parallel, no dependencies)
6. Load DIM_CUSTOMER (lookup city_id, segment_id, industry_id, manager_id)
7. Load FACT_SALES (lookup customer_id, product_id, date_id, etc.)

Design Principle

The depth of normalization should match the depth of natural hierarchies in your business domain. Don't create artificial hierarchy levels just for normalization—the goal is to represent real-world structures accurately while eliminating true redundancy.

Summary: Normalized Dimensions

We've established the foundation for understanding the snowflake schema by exploring normalized dimensions in depth. Let's consolidate the key concepts:

Key Takeaways

•Normalized dimensions apply 3NF principles to warehouse dimension tables, extracting hierarchical and repeating data into separate related tables.
•Dimension hierarchies are the primary driver — natural structures like Product → Category → Department or Customer → City → State → Country become explicit tables.
•Each hierarchy level becomes its own table with a primary key and foreign key to the parent level, creating chains of relationships.
•Data integrity improves significantly — single sources of truth eliminate update anomalies and ensure consistency across the warehouse.
•Query complexity increases — accessing higher hierarchy levels requires joining through intermediate tables (join multiplication).
•ETL processes become more sophisticated — loading sequence matters, and key lookups multiply.
•Indexing foreign keys is essential — without proper indexes, join performance suffers dramatically.

What's Next:

Now that we understand what normalized dimensions are and how they're structured, the next page will compare snowflake and star schemas directly—examining their trade-offs in terms of query performance, storage efficiency, and operational complexity.

Page Complete

You now understand normalized dimensions—the building blocks of snowflake schema design. You can identify dimension hierarchies, understand the normalization process, and appreciate both the data integrity benefits and structural complexity that result from this approach.