Loading content...
The star vs. snowflake debate often implies a binary choice: pick one or the other. In practice, the most effective data warehouse designs are hybrid—selectively applying normalization where it provides value while maintaining denormalization where simplicity and performance matter.
This page explores hybrid patterns that experienced architects use to build pragmatic, high-performance data warehouses that satisfy diverse stakeholder requirements. You'll learn to design schemas that capture the integrity benefits of snowflake where needed and the query simplicity of star where it matters most.
By the end of this page, you'll understand the major hybrid patterns (starflake, galaxy, and layered approaches), know how to implement materialized views for performance, and be able to design pragmatic schemas that balance competing requirements.
The starflake schema (or partial snowflake) is the most common hybrid pattern. It selectively normalizes some dimensions while leaving others denormalized, based on the specific characteristics of each dimension.
In this starflake example:
Design Rationale:
Each dimension is analyzed independently:
| Dimension | Cardinality | Hierarchy Depth | Update Frequency | Self-Service Access | Decision |
|---|---|---|---|---|---|
| Product | 1M rows | 4 levels | Weekly | Specialists only | Normalize ✓ |
| Customer | 500K rows | 2 levels | Rare | Heavy self-service | Denormalize ✓ |
| Store | 10K rows | 3 levels | Monthly | Moderate | Normalize ✓ |
| Time | 10K rows | 5 levels | Never | Everyone | Denormalize ✓ |
Apply the decision criteria from the previous page to each dimension independently. Normalize dimensions with high cardinality, deep hierarchies, and frequent updates. Denormalize dimensions with flat structures, heavy self-service access, or static values.
Let's walk through the SQL implementation of a starflake schema, highlighting the design choices for each dimension type.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163
-- ================================================================-- STARFLAKE SCHEMA IMPLEMENTATION-- Hybrid: Some dimensions normalized, others denormalized-- ================================================================ -- ================================================================-- NORMALIZED DIMENSION: Product (high cardinality, frequent updates)-- ================================================================ CREATE TABLE dim_brand ( brand_id SERIAL PRIMARY KEY, brand_name VARCHAR(100) NOT NULL, brand_country VARCHAR(50), brand_tier VARCHAR(20) -- luxury, premium, standard); CREATE TABLE dim_category ( category_id SERIAL PRIMARY KEY, category_name VARCHAR(100) NOT NULL, category_code VARCHAR(10) NOT NULL, tax_class VARCHAR(20)); CREATE TABLE dim_subcategory ( subcategory_id SERIAL PRIMARY KEY, subcategory_name VARCHAR(100) NOT NULL, category_id INT NOT NULL REFERENCES dim_category(category_id)); CREATE TABLE dim_product ( product_id SERIAL PRIMARY KEY, product_name VARCHAR(200) NOT NULL, product_sku VARCHAR(50) NOT NULL, unit_price DECIMAL(10,2), -- Foreign keys to normalized hierarchy subcategory_id INT NOT NULL REFERENCES dim_subcategory(subcategory_id), brand_id INT NOT NULL REFERENCES dim_brand(brand_id), -- Product-specific attributes (not hierarchical) weight_kg DECIMAL(8,2), is_active BOOLEAN DEFAULT true); -- Indexes for FK joinsCREATE INDEX idx_product_subcategory ON dim_product(subcategory_id);CREATE INDEX idx_product_brand ON dim_product(brand_id);CREATE INDEX idx_subcategory_category ON dim_subcategory(category_id); -- ================================================================-- NORMALIZED DIMENSION: Store (geographic hierarchy)-- ================================================================ CREATE TABLE dim_region ( region_id SERIAL PRIMARY KEY, region_name VARCHAR(50) NOT NULL, region_manager VARCHAR(100)); CREATE TABLE dim_district ( district_id SERIAL PRIMARY KEY, district_name VARCHAR(100) NOT NULL, district_code VARCHAR(10) NOT NULL, region_id INT NOT NULL REFERENCES dim_region(region_id)); CREATE TABLE dim_store ( store_id SERIAL PRIMARY KEY, store_name VARCHAR(100) NOT NULL, store_code VARCHAR(10) NOT NULL, address VARCHAR(200), city VARCHAR(100), state VARCHAR(50), -- Foreign key to normalized hierarchy district_id INT NOT NULL REFERENCES dim_district(district_id), -- Store-specific attributes square_footage INT, open_date DATE, is_active BOOLEAN DEFAULT true); CREATE INDEX idx_store_district ON dim_store(district_id);CREATE INDEX idx_district_region ON dim_district(region_id); -- ================================================================-- DENORMALIZED DIMENSION: Customer (flat, self-service friendly)-- ================================================================ CREATE TABLE dim_customer ( customer_id SERIAL PRIMARY KEY, customer_name VARCHAR(100) NOT NULL, email VARCHAR(255), phone VARCHAR(20), -- Demographics (flat, not hierarchical) age_band VARCHAR(20), gender VARCHAR(10), -- Address (denormalized for simplicity) street_address VARCHAR(200), city VARCHAR(100), state VARCHAR(50), country VARCHAR(50), postal_code VARCHAR(20), -- Segmentation (flat) segment VARCHAR(50), -- Consumer, Business, Enterprise tier VARCHAR(20), -- Bronze, Silver, Gold, Platinum acquisition_channel VARCHAR(50), first_purchase_date DATE); -- ================================================================-- DENORMALIZED DIMENSION: Time (static, no updates ever)-- ================================================================ CREATE TABLE dim_time ( time_id INT PRIMARY KEY, date_actual DATE NOT NULL, -- Day level (all in one table for simplicity) day_of_week SMALLINT, day_name VARCHAR(10), day_of_month SMALLINT, day_of_year SMALLINT, is_weekend BOOLEAN, is_holiday BOOLEAN, -- Week level week_of_year SMALLINT, -- Month level month_of_year SMALLINT, month_name VARCHAR(10), month_start_date DATE, month_end_date DATE, -- Quarter level quarter_of_year SMALLINT, quarter_name VARCHAR(10), -- Q1, Q2, Q3, Q4 -- Year level year_actual SMALLINT, -- Fiscal calendar (company-specific) fiscal_month SMALLINT, fiscal_quarter SMALLINT, fiscal_year SMALLINT); -- ================================================================-- FACT TABLE: References all dimension types uniformly-- ================================================================ CREATE TABLE fact_sales ( sale_id BIGSERIAL PRIMARY KEY, -- Foreign keys (same pattern for normalized and denormalized dims) product_id INT NOT NULL REFERENCES dim_product(product_id), customer_id INT NOT NULL REFERENCES dim_customer(customer_id), store_id INT NOT NULL REFERENCES dim_store(store_id), time_id INT NOT NULL REFERENCES dim_time(time_id), -- Measures quantity INT NOT NULL, unit_price DECIMAL(10,2) NOT NULL, discount_pct DECIMAL(5,2) DEFAULT 0, sales_amount DECIMAL(12,2) NOT NULL, cost_amount DECIMAL(12,2)); -- Fact table indexesCREATE INDEX idx_fact_product ON fact_sales(product_id);CREATE INDEX idx_fact_customer ON fact_sales(customer_id);CREATE INDEX idx_fact_store ON fact_sales(store_id);CREATE INDEX idx_fact_time ON fact_sales(time_id);Notice that the fact table has the same structure regardless of whether dimensions are normalized or denormalized. It always references the base dimension table. The 'snowflaking' happens in the dimension side of the schema, not the fact-to-dimension relationship.
A galaxy schema (also called fact constellation or multi-star schema) contains multiple fact tables that share common dimension tables. This pattern naturally leads to hybrid designs as different fact tables have different requirements.
Galaxy Schema Characteristics:
Shared Dimensions:
Fact-Specific Dimensions:
Design Considerations:
Most enterprise data warehouses are galaxy schemas in practice. Separate subject areas (sales, inventory, HR, finance) have their own fact tables but share common dimensions like time, geography, and organization. The hybrid approach (normalize some, denormalize others) applies to these shared dimensions.
Perhaps the most powerful hybrid pattern is the layered architecture: use snowflake schema as the normalized source layer and materialize star schema views for consumption. This captures benefits of both approaches.
Layer Responsibilities:
1. Raw/Staging Layer
2. Snowflake Layer (Source of Truth)
3. Star Layer (Consumption)
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273747576777879808182838485868788
-- ================================================================-- SNOWFLAKE LAYER: Normalized source of truth-- ================================================================ -- (These are the actual tables storing data)-- dim_product, dim_subcategory, dim_category, dim_brand-- dim_customer, dim_city, dim_state, dim_country-- fact_sales -- ================================================================-- STAR LAYER: Denormalized views for consumption-- ================================================================ -- Wide product dimension view (joins all hierarchy levels)CREATE OR REPLACE VIEW v_dim_product_wide ASSELECT p.product_id, p.product_name, p.product_sku, p.unit_price, sc.subcategory_id, sc.subcategory_name, c.category_id, c.category_name, c.category_code, c.tax_class, b.brand_id, b.brand_name, b.brand_country, b.brand_tierFROM dim_product pJOIN dim_subcategory sc ON p.subcategory_id = sc.subcategory_idJOIN dim_category c ON sc.category_id = c.category_idJOIN dim_brand b ON p.brand_id = b.brand_id; -- Wide customer dimension viewCREATE OR REPLACE VIEW v_dim_customer_wide ASSELECT cu.customer_id, cu.customer_name, cu.email, cu.phone, cu.street_address, ci.city_name, ci.city_population, s.state_name, s.state_code, co.country_name, co.country_iso_code, r.region_name, cu.segment, cu.tierFROM dim_customer cuJOIN dim_city ci ON cu.city_id = ci.city_idJOIN dim_state s ON ci.state_id = s.state_idJOIN dim_country co ON s.country_id = co.country_idJOIN dim_region r ON co.region_id = r.region_id; -- ================================================================-- MATERIALIZED VIEWS for performance-critical paths-- ================================================================ -- Materialized wide product (refreshed hourly)CREATE MATERIALIZED VIEW mv_dim_product_wide ASSELECT * FROM v_dim_product_wide; CREATE UNIQUE INDEX idx_mv_product_id ON mv_dim_product_wide(product_id); -- Pre-aggregated sales by category and month (refreshed daily)CREATE MATERIALIZED VIEW mv_sales_category_monthly ASSELECT c.category_id, c.category_name, t.year_actual, t.month_of_year, SUM(f.sales_amount) AS total_sales, SUM(f.quantity) AS total_quantity, COUNT(*) AS transaction_countFROM fact_sales fJOIN dim_product p ON f.product_id = p.product_idJOIN dim_subcategory sc ON p.subcategory_id = sc.subcategory_idJOIN dim_category c ON sc.category_id = c.category_idJOIN dim_time t ON f.time_id = t.time_idGROUP BY c.category_id, c.category_name, t.year_actual, t.month_of_year; -- Refresh schedule-- REFRESH MATERIALIZED VIEW CONCURRENTLY mv_dim_product_wide;-- REFRESH MATERIALIZED VIEW CONCURRENTLY mv_sales_category_monthly;The layered approach provides: (1) Data integrity through normalized source tables, (2) Query performance through denormalized views, (3) Flexibility to add new consumption patterns without changing source schema, (4) Clear separation of concerns between data modeling and performance optimization.
Hybrid schemas require thoughtful ETL design. Let's examine practical patterns for loading and maintaining hybrid architectures.
dbt (data build tool) is ideal for layered architectures:
# dbt_project.yml
models:
warehouse:
staging: # Raw data cleaning
+materialized: view
intermediate: # Snowflake normalized layer
+materialized: table
marts: # Star consumption layer
core:
+materialized: table
finance:
+materialized: table
dbt Model Example:
-- models/intermediate/dim_product.sql
-- Normalized product dimension
{{ config(materialized='table') }}
SELECT
product_id,
product_name,
subcategory_id,
brand_id
FROM {{ ref('stg_products') }}
-- models/marts/core/dim_product_wide.sql
-- Denormalized for consumption
{{ config(materialized='table') }}
SELECT
p.*,
sc.subcategory_name,
c.category_name,
c.category_code,
b.brand_name,
b.brand_tier
FROM {{ ref('dim_product') }} p
JOIN {{ ref('dim_subcategory') }} sc
ON p.subcategory_id = sc.subcategory_id
JOIN {{ ref('dim_category') }} c
ON sc.category_id = c.category_id
JOIN {{ ref('dim_brand') }} b
ON p.brand_id = b.brand_id
dbt automatically manages dependencies and build order.
dbt has become the de facto standard for transformation in modern data stacks. Its dependency tracking, testing, and documentation features make it particularly well-suited for hybrid architectures where the relationship between source tables and consumption views must be carefully managed.
With multiple schema layers available, how do you route queries to the optimal layer? Let's examine patterns for query routing in hybrid architectures.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960
-- ================================================================-- QUERY ROUTING IMPLEMENTATION-- ================================================================ -- Option 1: Semantic Layer (LookML-like abstraction)-- Users never write SQL; semantic layer routes to optimal source -- Option 2: View-Based Abstraction-- Create a 'public' schema with denormalized views-- Create a 'core' schema with normalized tables CREATE SCHEMA public_reporting; -- Denormalized product view in public schemaCREATE OR REPLACE VIEW public_reporting.products ASSELECT * FROM core.mv_dim_product_wide; -- Users query 'public_reporting.products' -- and automatically get optimized, denormalized data -- Option 3: Materialized View Matching (Database Feature)-- PostgreSQL example - automatic query rewriting -- If user queries the normalized view...EXPLAIN SELECT category_name, SUM(sales_amount)FROM core.fact_sales fJOIN core.dim_product p ON f.product_id = p.product_idJOIN core.dim_subcategory sc ON p.subcategory_id = sc.subcategory_idJOIN core.dim_category c ON sc.category_id = c.category_idJOIN core.dim_time t ON f.time_id = t.time_idWHERE t.year_actual = 2024GROUP BY category_name; -- PostgreSQL can rewrite to use mv_sales_category_monthly instead-- (if configured with appropriate materialized view support) -- Option 4: Stored Procedure RouterCREATE OR REPLACE FUNCTION route_sales_query( p_granularity VARCHAR, -- 'daily', 'monthly', 'yearly' p_grouping VARCHAR -- 'product', 'category', 'region')RETURNS TABLE (...) AS $$BEGIN IF p_granularity = 'monthly' AND p_grouping = 'category' THEN -- Use pre-aggregated materialized view RETURN QUERY SELECT * FROM mv_sales_category_monthly; ELSIF p_granularity = 'daily' THEN -- Use detailed fact with denormalized views RETURN QUERY SELECT * FROM fact_sales f JOIN v_dim_product_wide p ON f.product_id = p.product_id ...; ELSE -- Use normalized tables for complex queries RETURN QUERY SELECT * FROM fact_sales f JOIN dim_product p ON f.product_id = p.product_id JOIN dim_subcategory sc ON p.subcategory_id = sc.subcategory_id ...; END IF;END;$$ LANGUAGE plpgsql;The best query routing is invisible to end users. They should query what they understand (products, customers, sales) and the infrastructure should route to the optimal physical structure. This is the promise of semantic layers and the reason they're becoming standard in modern data stacks.
Let's consolidate the hybrid patterns into a practical implementation guide you can follow for any data warehouse project.
| Use Case | Normalized Dims | Denormalized Dims | Aggregation Strategy |
|---|---|---|---|
| Retail Analytics | Product, Store | Customer, Time | Daily sales by store; Monthly by category |
| Financial Reporting | Account, Cost Center | Time, Geography | Trial balance by period; GL by account |
| Healthcare Analytics | Diagnosis, Provider | Patient, Time | Encounters by facility; Claims by quarter |
| E-commerce | Product, Inventory | Customer, Session | Orders by product; Conversion by funnel step |
| SaaS Metrics | Feature, Plan | User, Time | Usage by feature; ARR by customer segment |
Begin with a pure star schema for fast delivery. Identify pain points (storage costs, update complexity) over time. Selectively normalize dimensions where problems occur. This evolutionary approach is lower risk than designing a complex hybrid schema upfront.
We've explored hybrid schema patterns that combine the strengths of star and snowflake designs. Let's consolidate the key insights from this module:
Module Complete:
You've now mastered snowflake schema design—from understanding normalized dimensions to comparing with star schemas, evaluating trade-offs, identifying when to use each approach, and implementing practical hybrid patterns. You're equipped to make informed, defensible schema decisions for any data warehousing scenario.
Congratulations! You've completed the Snowflake Schema module. You now understand normalized dimensions, can compare star and snowflake schemas across multiple dimensions, evaluate trade-offs for specific contexts, identify when snowflake is the right choice, and implement hybrid patterns that capture the benefits of both approaches. This knowledge will serve you well in designing data warehouses that balance performance, maintainability, and data quality.