Loading content...
In the previous module, we explored the star schema—a design pattern that places a central fact table surrounded by denormalized dimension tables. The star schema's simplicity and query performance made it the workhorse of data warehousing for decades. But simplicity comes with trade-offs.
What if your dimension tables grow to contain millions of rows with significant redundancy? What if storage costs become prohibitive, or data quality suffers from update anomalies? What if your dimensions have natural hierarchies that would benefit from explicit structural representation?
Enter the snowflake schema—a normalized variant that applies relational database normalization principles to dimension tables, transforming the simple star into a more complex, multi-level structure resembling a snowflake.
By the end of this page, you will understand what normalized dimensions are, why they're used in snowflake schemas, how they differ structurally from denormalized dimensions, and the normalization principles that guide their design. You'll be able to identify dimension hierarchies and understand how they're decomposed into related tables.
Before we dive into normalized dimensions specifically, let's recall why normalization matters in database design. Normalization is the process of organizing a relational database to reduce data redundancy and improve data integrity.
In OLTP (Online Transaction Processing) systems, normalization to at least Third Normal Form (3NF) is standard practice. This eliminates:
The snowflake schema applies these same principles to dimension tables in a data warehouse context.
1NF: Atomic values, no repeating groups. 2NF: 1NF plus no partial dependencies (all non-key attributes depend on the entire primary key). 3NF: 2NF plus no transitive dependencies (non-key attributes don't depend on other non-key attributes). The snowflake schema typically normalizes dimensions to 3NF.
Why consider normalization in a warehouse?
Data warehouses were traditionally designed with denormalized dimensions (star schema) for several reasons:
However, modern data challenges have shifted some of these assumptions:
These factors have renewed interest in normalized dimension designs—hence the snowflake schema.
A normalized dimension is a dimension table that has been decomposed into multiple related tables according to normal form principles, typically Third Normal Form (3NF). Instead of storing all dimension attributes in a single wide table, hierarchical or repeating attributes are extracted into separate tables linked by foreign keys.
The core concept:
In a star schema, a DIM_PRODUCT table might contain:
| product_id | product_name | category_name | category_desc | subcategory | brand | brand_country |
|---|
Notice the redundancy: every product in "Electronics" repeats "Electronics" and its description. Every Apple product repeats "Apple" and "USA".
In a snowflake schema, this becomes multiple related tables:
DIM_PRODUCT: product_id, product_name, subcategory_id (FK), brand_id (FK)DIM_SUBCATEGORY: subcategory_id, subcategory_name, category_id (FK)DIM_CATEGORY: category_id, category_name, category_descriptionDIM_BRAND: brand_id, brand_name, country_id (FK)DIM_COUNTRY: country_id, country_nameThe name 'snowflake schema' comes from the visual appearance when you diagram these relationships. Each dimension table branches into multiple related tables, creating a pattern that resembles a snowflake crystal radiating from the central fact table.
The key insight driving normalized dimensions is that dimensions often contain natural hierarchies. A hierarchy represents a parent-child relationship where higher-level entities contain or categorize lower-level entities.
Common dimensional hierarchies:
| Dimension | Hierarchy Levels | Example Path |
|---|---|---|
| Time/Date | Year → Quarter → Month → Week → Day | 2024 → Q1 → January → Week 3 → Jan 15 |
| Geography | Region → Country → State → City → Store | EMEA → Germany → Bavaria → Munich → Store #472 |
| Product | Division → Category → Subcategory → Product | Consumer Electronics → Computers → Laptops → MacBook Pro 16" |
| Organization | Company → Division → Department → Team | Acme Corp → Engineering → Backend → API Team |
| Customer | Segment → Industry → Company → Account | Enterprise → Finance → Goldman Sachs → Trading Desk #7 |
Why hierarchies matter for normalization:
In a denormalized dimension, hierarchy levels are stored as columns in a single table. Every row repeats the higher-level values:
store_id | store_name | city | state | country | region
1 | Downtown | LA | CA | USA | Americas
2 | Westside | LA | CA | USA | Americas
3 | Midtown | NYC | NY | USA | Americas
4 | Soho | NYC | NY | USA | Americas
...
With 10,000 stores, you store "USA" and "Americas" 10,000 times. With normalization, you store each geographic level once and reference it:
DIM_STORE: store_id, store_name, city_id
DIM_CITY: city_id, city_name, state_id
DIM_STATE: state_id, state_name, country_id
DIM_COUNTRY: country_id, country_name, region_id
DIM_REGION: region_id, region_name
Not all hierarchies have the same depth. Time dimensions typically have 5-6 levels, while a simple status dimension might have only 2 (status category → status). The snowflake schema adapts to represent whatever natural hierarchies exist in your business domain.
Let's walk through normalizing a dimension step by step to understand the transformation process. We'll use a Customer dimension as our example.
Original Denormalized Customer Dimension (Star Schema):
1234567891011121314151617181920212223242526272829303132333435363738
-- Star Schema: Single wide table with all customer attributesCREATE TABLE DIM_CUSTOMER ( customer_id INT PRIMARY KEY, customer_name VARCHAR(100), customer_email VARCHAR(255), customer_phone VARCHAR(20), -- Address hierarchy (denormalized) street_address VARCHAR(200), city_name VARCHAR(100), city_population INT, -- Repeated for every customer in this city state_name VARCHAR(100), state_code CHAR(2), country_name VARCHAR(100), country_iso_code CHAR(3), region_name VARCHAR(50), -- Customer segment hierarchy (denormalized) segment_name VARCHAR(50), -- "Enterprise", "SMB", "Consumer" segment_discount DECIMAL(5,2), -- Same for all customers in segment industry_name VARCHAR(100), industry_code VARCHAR(10), -- Account management account_manager VARCHAR(100), manager_email VARCHAR(255), manager_region VARCHAR(50)); -- Sample data showing redundancy:-- customer_id | ... | city_name | city_population | state_name | segment_name | segment_discount | ...-- 1 | ... | Seattle | 750000 | Washington | Enterprise | 0.15 | ...-- 2 | ... | Seattle | 750000 | Washington | Enterprise | 0.15 | ...-- 3 | ... | Seattle | 750000 | Washington | Consumer | 0.05 | ...-- 4 | ... | Portland | 650000 | Oregon | Enterprise | 0.15 | ... -- Every Seattle customer stores 750000 redundantly-- Every Enterprise customer stores 0.15 redundantlyIdentifying Normalization Opportunities:
Analyzing this table, we can identify several functional dependencies that violate 3NF:
city_name → city_population (non-key determines non-key)state_name → state_code (non-key determines non-key)country_name → country_iso_code (non-key determines non-key)segment_name → segment_discount (non-key determines non-key)account_manager → manager_email, manager_region (non-key determines non-keys)industry_name → industry_code (non-key determines non-key)Each of these represents a transitive dependency that normalization will eliminate.
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768
-- Snowflake Schema: Normalized dimension tables -- Level 5: Region (highest level of geographic hierarchy)CREATE TABLE DIM_REGION ( region_id INT PRIMARY KEY, region_name VARCHAR(50) NOT NULL); -- Level 4: CountryCREATE TABLE DIM_COUNTRY ( country_id INT PRIMARY KEY, country_name VARCHAR(100) NOT NULL, country_iso_code CHAR(3) NOT NULL, region_id INT NOT NULL REFERENCES DIM_REGION(region_id)); -- Level 3: State/ProvinceCREATE TABLE DIM_STATE ( state_id INT PRIMARY KEY, state_name VARCHAR(100) NOT NULL, state_code CHAR(2), country_id INT NOT NULL REFERENCES DIM_COUNTRY(country_id)); -- Level 2: CityCREATE TABLE DIM_CITY ( city_id INT PRIMARY KEY, city_name VARCHAR(100) NOT NULL, city_population INT, state_id INT NOT NULL REFERENCES DIM_STATE(state_id)); -- Industry reference tableCREATE TABLE DIM_INDUSTRY ( industry_id INT PRIMARY KEY, industry_name VARCHAR(100) NOT NULL, industry_code VARCHAR(10) NOT NULL); -- Customer segment reference tableCREATE TABLE DIM_SEGMENT ( segment_id INT PRIMARY KEY, segment_name VARCHAR(50) NOT NULL, segment_discount DECIMAL(5,2) NOT NULL); -- Account manager reference tableCREATE TABLE DIM_ACCOUNT_MANAGER ( manager_id INT PRIMARY KEY, manager_name VARCHAR(100) NOT NULL, manager_email VARCHAR(255) NOT NULL, manager_region VARCHAR(50)); -- Core customer dimension (now slim and normalized)CREATE TABLE DIM_CUSTOMER ( customer_id INT PRIMARY KEY, customer_name VARCHAR(100) NOT NULL, customer_email VARCHAR(255), customer_phone VARCHAR(20), street_address VARCHAR(200), -- Foreign keys to normalized reference tables city_id INT REFERENCES DIM_CITY(city_id), segment_id INT REFERENCES DIM_SEGMENT(segment_id), industry_id INT REFERENCES DIM_INDUSTRY(industry_id), manager_id INT REFERENCES DIM_ACCOUNT_MANAGER(manager_id));The original single table has been decomposed into 8 tables. Each table now stores unique data without redundancy. City population is stored once per city. Segment discount is stored once per segment. Manager email is stored once per manager. This eliminates all transitive dependencies.
Normalizing dimensions has significant implications for the overall warehouse structure. Let's examine the key structural changes.
The Join Multiplication Effect:
Consider a query that aggregates sales by region. In a star schema:
SELECT region_name, SUM(amount)
FROM fact_sales f JOIN dim_store s ON f.store_id = s.store_id
GROUP BY region_name;
-- 1 join
In a snowflake schema:
SELECT r.region_name, SUM(f.amount)
FROM fact_sales f
JOIN dim_store st ON f.store_id = st.store_id
JOIN dim_city c ON st.city_id = c.city_id
JOIN dim_state s ON c.state_id = s.state_id
JOIN dim_country co ON s.country_id = co.country_id
JOIN dim_region r ON co.region_id = r.region_id
GROUP BY r.region_name;
-- 5 joins
This join multiplication is the primary query complexity cost of snowflake schemas.
While query complexity increases, normalized dimensions provide significant data integrity advantages that can be critical for certain use cases.
Consider a scenario where a company restructures its sales regions. In a denormalized star schema, this might require updating millions of customer and transaction records. In a normalized snowflake schema, you update the region assignments in DIM_COUNTRY and the change propagates through the hierarchy automatically in queries.
Example: Handling a City Rename
Suppose Bombay is renamed to Mumbai (an actual historical event).
In Star Schema:
UPDATE dim_customer SET city_name = 'Mumbai' WHERE city_name = 'Bombay';
UPDATE dim_store SET city_name = 'Mumbai' WHERE city_name = 'Bombay';
UPDATE dim_supplier SET city_name = 'Mumbai' WHERE city_name = 'Bombay';
-- Potentially millions of rows across multiple tables
-- Risk of inconsistency if any update fails
In Snowflake Schema:
UPDATE dim_city SET city_name = 'Mumbai' WHERE city_name = 'Bombay';
-- One row updated
-- All referencing tables see the change immediately through joins
-- Zero risk of inconsistency
When implementing normalized dimensions, several practical factors require attention to ensure a successful design.
ETL Complexity Increases
Normalized dimensions require more sophisticated ETL (Extract, Transform, Load) processes:
ETL Loading Sequence Example:
1. Load DIM_REGION (no dependencies)
2. Load DIM_COUNTRY (lookup region_id)
3. Load DIM_STATE (lookup country_id)
4. Load DIM_CITY (lookup state_id)
5. Load DIM_INDUSTRY, DIM_SEGMENT, DIM_ACCOUNT_MANAGER (parallel, no dependencies)
6. Load DIM_CUSTOMER (lookup city_id, segment_id, industry_id, manager_id)
7. Load FACT_SALES (lookup customer_id, product_id, date_id, etc.)
The depth of normalization should match the depth of natural hierarchies in your business domain. Don't create artificial hierarchy levels just for normalization—the goal is to represent real-world structures accurately while eliminating true redundancy.
We've established the foundation for understanding the snowflake schema by exploring normalized dimensions in depth. Let's consolidate the key concepts:
What's Next:
Now that we understand what normalized dimensions are and how they're structured, the next page will compare snowflake and star schemas directly—examining their trade-offs in terms of query performance, storage efficiency, and operational complexity.
You now understand normalized dimensions—the building blocks of snowflake schema design. You can identify dimension hierarchies, understand the normalization process, and appreciate both the data integrity benefits and structural complexity that result from this approach.