Snowflake Schema - Learning Module

Loading content...

0/241

Hybrid Approaches

Beyond the Binary Choice

The star vs. snowflake debate often implies a binary choice: pick one or the other. In practice, the most effective data warehouse designs are hybrid—selectively applying normalization where it provides value while maintaining denormalization where simplicity and performance matter.

This page explores hybrid patterns that experienced architects use to build pragmatic, high-performance data warehouses that satisfy diverse stakeholder requirements. You'll learn to design schemas that capture the integrity benefits of snowflake where needed and the query simplicity of star where it matters most.

What You Will Learn

By the end of this page, you'll understand the major hybrid patterns (starflake, galaxy, and layered approaches), know how to implement materialized views for performance, and be able to design pragmatic schemas that balance competing requirements.

The Starflake Pattern

The starflake schema (or partial snowflake) is the most common hybrid pattern. It selectively normalizes some dimensions while leaving others denormalized, based on the specific characteristics of each dimension.

Converting Mermaid diagram...

In this starflake example:

DIM_PRODUCT: Normalized (high cardinality, deep hierarchy, frequent category changes)
DIM_STORE: Normalized (moderate cardinality, geographic hierarchy for regional analysis)
DIM_CUSTOMER: Denormalized (flat structure, self-service access, sensitive to query complexity)
DIM_TIME: Denormalized (static hierarchy, calendar data, no update anomalies)

Design Rationale:

Each dimension is analyzed independently:

Starflake Dimension Analysis
Dimension	Cardinality	Hierarchy Depth	Update Frequency	Self-Service Access	Decision
Product	1M rows	4 levels	Weekly	Specialists only	Normalize ✓
Customer	500K rows	2 levels	Rare	Heavy self-service	Denormalize ✓
Store	10K rows	3 levels	Monthly	Moderate	Normalize ✓
Time	10K rows	5 levels	Never	Everyone	Denormalize ✓

Starflake Design Principle

Apply the decision criteria from the previous page to each dimension independently. Normalize dimensions with high cardinality, deep hierarchies, and frequent updates. Denormalize dimensions with flat structures, heavy self-service access, or static values.

Implementing Starflake

Let's walk through the SQL implementation of a starflake schema, highlighting the design choices for each dimension type.

starflake_schema.sql
SQL
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
-- ================================================================
-- STARFLAKE SCHEMA IMPLEMENTATION
-- Hybrid: Some dimensions normalized, others denormalized
-- ================================================================
 
-- ================================================================
-- NORMALIZED DIMENSION: Product (high cardinality, frequent updates)
-- ================================================================
 
CREATE TABLE dim_brand (
    brand_id        SERIAL PRIMARY KEY,
    brand_name      VARCHAR(100) NOT NULL,
    brand_country   VARCHAR(50),
    brand_tier      VARCHAR(20)  -- luxury, premium, standard
);
 
CREATE TABLE dim_category (
    category_id     SERIAL PRIMARY KEY,
    category_name   VARCHAR(100) NOT NULL,
    category_code   VARCHAR(10) NOT NULL,
    tax_class       VARCHAR(20)
);
 
CREATE TABLE dim_subcategory (
    subcategory_id  SERIAL PRIMARY KEY,
    subcategory_name VARCHAR(100) NOT NULL,
    category_id     INT NOT NULL REFERENCES dim_category(category_id)
);
 
CREATE TABLE dim_product (
    product_id      SERIAL PRIMARY KEY,
    product_name    VARCHAR(200) NOT NULL,
    product_sku     VARCHAR(50) NOT NULL,
    unit_price      DECIMAL(10,2),
    -- Foreign keys to normalized hierarchy
    subcategory_id  INT NOT NULL REFERENCES dim_subcategory(subcategory_id),
    brand_id        INT NOT NULL REFERENCES dim_brand(brand_id),
    -- Product-specific attributes (not hierarchical)
    weight_kg       DECIMAL(8,2),
    is_active       BOOLEAN DEFAULT true
);
 
-- Indexes for FK joins
CREATE INDEX idx_product_subcategory ON dim_product(subcategory_id);
CREATE INDEX idx_product_brand ON dim_product(brand_id);
CREATE INDEX idx_subcategory_category ON dim_subcategory(category_id);
 
-- ================================================================
-- NORMALIZED DIMENSION: Store (geographic hierarchy)
-- ================================================================
 
CREATE TABLE dim_region (
    region_id       SERIAL PRIMARY KEY,
    region_name     VARCHAR(50) NOT NULL,
    region_manager  VARCHAR(100)
);
 
CREATE TABLE dim_district (
    district_id     SERIAL PRIMARY KEY,
    district_name   VARCHAR(100) NOT NULL,
    district_code   VARCHAR(10) NOT NULL,
    region_id       INT NOT NULL REFERENCES dim_region(region_id)
);
 
CREATE TABLE dim_store (
    store_id        SERIAL PRIMARY KEY,
    store_name      VARCHAR(100) NOT NULL,
    store_code      VARCHAR(10) NOT NULL,
    address         VARCHAR(200),
    city            VARCHAR(100),
    state           VARCHAR(50),
    -- Foreign key to normalized hierarchy
    district_id     INT NOT NULL REFERENCES dim_district(district_id),
    -- Store-specific attributes
    square_footage  INT,
    open_date       DATE,
    is_active       BOOLEAN DEFAULT true
);
 
CREATE INDEX idx_store_district ON dim_store(district_id);
CREATE INDEX idx_district_region ON dim_district(region_id);
 
-- ================================================================
-- DENORMALIZED DIMENSION: Customer (flat, self-service friendly)
-- ================================================================
 
CREATE TABLE dim_customer (
    customer_id     SERIAL PRIMARY KEY,
    customer_name   VARCHAR(100) NOT NULL,
    email           VARCHAR(255),
    phone           VARCHAR(20),
    -- Demographics (flat, not hierarchical)
    age_band        VARCHAR(20),
    gender          VARCHAR(10),
    -- Address (denormalized for simplicity)
    street_address  VARCHAR(200),
    city            VARCHAR(100),
    state           VARCHAR(50),
    country         VARCHAR(50),
    postal_code     VARCHAR(20),
    -- Segmentation (flat)
    segment         VARCHAR(50),  -- Consumer, Business, Enterprise
    tier            VARCHAR(20),  -- Bronze, Silver, Gold, Platinum
    acquisition_channel VARCHAR(50),
    first_purchase_date DATE
);
 
-- ================================================================
-- DENORMALIZED DIMENSION: Time (static, no updates ever)
-- ================================================================
 
CREATE TABLE dim_time (
    time_id             INT PRIMARY KEY,
    date_actual         DATE NOT NULL,
    -- Day level (all in one table for simplicity)
    day_of_week         SMALLINT,
    day_name            VARCHAR(10),
    day_of_month        SMALLINT,
    day_of_year         SMALLINT,
    is_weekend          BOOLEAN,
    is_holiday          BOOLEAN,
    -- Week level
    week_of_year        SMALLINT,
    -- Month level
    month_of_year       SMALLINT,
    month_name          VARCHAR(10),
    month_start_date    DATE,
    month_end_date      DATE,
    -- Quarter level
    quarter_of_year     SMALLINT,
    quarter_name        VARCHAR(10),  -- Q1, Q2, Q3, Q4
    -- Year level
    year_actual         SMALLINT,
    -- Fiscal calendar (company-specific)
    fiscal_month        SMALLINT,
    fiscal_quarter      SMALLINT,
    fiscal_year         SMALLINT
);
 
-- ================================================================
-- FACT TABLE: References all dimension types uniformly
-- ================================================================
 
CREATE TABLE fact_sales (
    sale_id         BIGSERIAL PRIMARY KEY,
    -- Foreign keys (same pattern for normalized and denormalized dims)
    product_id      INT NOT NULL REFERENCES dim_product(product_id),
    customer_id     INT NOT NULL REFERENCES dim_customer(customer_id),
    store_id        INT NOT NULL REFERENCES dim_store(store_id),
    time_id         INT NOT NULL REFERENCES dim_time(time_id),
    -- Measures
    quantity        INT NOT NULL,
    unit_price      DECIMAL(10,2) NOT NULL,
    discount_pct    DECIMAL(5,2) DEFAULT 0,
    sales_amount    DECIMAL(12,2) NOT NULL,
    cost_amount     DECIMAL(12,2)
);
 
-- Fact table indexes
CREATE INDEX idx_fact_product ON fact_sales(product_id);
CREATE INDEX idx_fact_customer ON fact_sales(customer_id);
CREATE INDEX idx_fact_store ON fact_sales(store_id);
CREATE INDEX idx_fact_time ON fact_sales(time_id);

Fact Table Uniformity

Notice that the fact table has the same structure regardless of whether dimensions are normalized or denormalized. It always references the base dimension table. The 'snowflaking' happens in the dimension side of the schema, not the fact-to-dimension relationship.

The Galaxy (Constellation) Schema

A galaxy schema (also called fact constellation or multi-star schema) contains multiple fact tables that share common dimension tables. This pattern naturally leads to hybrid designs as different fact tables have different requirements.

Converting Mermaid diagram...

Galaxy Schema Characteristics:

Shared Dimensions:

DIM_PRODUCT used by all three fact tables → Normalize (high reuse, central importance)
DIM_TIME used by all three fact tables → Denormalize (date math simplicity)
DIM_CUSTOMER used by sales and orders → Denormalize (CRM integration, self-service)

Fact-Specific Dimensions:

DIM_STORE used only by inventory → Normalize (regional rollups needed)
DIM_SHIPMENT used only by orders → Denormalize (simple lookup, low cardinality)

Design Considerations:

Galaxy Schema Best Practices

•Conformed Dimensions: Shared dimensions must have identical definitions across all fact tables. Same grain, same surrogate keys, same business keys.
•Independent Hierarchy Strategy: Each shared dimension can still have its own normalization decision based on its characteristics, not based on which fact uses it.
•Cross-Fact Analysis: Denormalized shared dimensions simplify queries that join multiple facts (e.g., sales and inventory by product).
•Separate ETL Streams: Each fact table can have its own load schedule, but shared dimensions must load before any dependent facts.
•Aggregate Consistency: Pre-aggregated fact tables should use the same dimension definitions as detailed facts.

Real-World Galaxy Schemas

Most enterprise data warehouses are galaxy schemas in practice. Separate subject areas (sales, inventory, HR, finance) have their own fact tables but share common dimensions like time, geography, and organization. The hybrid approach (normalize some, denormalize others) applies to these shared dimensions.

Layered Architecture Pattern

Perhaps the most powerful hybrid pattern is the layered architecture: use snowflake schema as the normalized source layer and materialize star schema views for consumption. This captures benefits of both approaches.

Converting Mermaid diagram...

Layer Responsibilities:

1. Raw/Staging Layer

Ingests source data with minimal transformation
Preserves source format for auditability
Transient; typically refreshed on each load

2. Snowflake Layer (Source of Truth)

Fully normalized dimension tables
Referential integrity enforced
Single source of truth for all data
Optimized for data quality and updates

3. Star Layer (Consumption)

Denormalized views or materialized views
Pre-joined dimensions for query simplicity
Optimized for read performance
May include pre-aggregated facts

layered_architecture_views.sql
SQL
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
-- ================================================================
-- SNOWFLAKE LAYER: Normalized source of truth
-- ================================================================
 
-- (These are the actual tables storing data)
-- dim_product, dim_subcategory, dim_category, dim_brand
-- dim_customer, dim_city, dim_state, dim_country
-- fact_sales
 
-- ================================================================
-- STAR LAYER: Denormalized views for consumption
-- ================================================================
 
-- Wide product dimension view (joins all hierarchy levels)
CREATE OR REPLACE VIEW v_dim_product_wide AS
SELECT 
    p.product_id,
    p.product_name,
    p.product_sku,
    p.unit_price,
    sc.subcategory_id,
    sc.subcategory_name,
    c.category_id,
    c.category_name,
    c.category_code,
    c.tax_class,
    b.brand_id,
    b.brand_name,
    b.brand_country,
    b.brand_tier
FROM dim_product p
JOIN dim_subcategory sc ON p.subcategory_id = sc.subcategory_id
JOIN dim_category c ON sc.category_id = c.category_id
JOIN dim_brand b ON p.brand_id = b.brand_id;
 
-- Wide customer dimension view
CREATE OR REPLACE VIEW v_dim_customer_wide AS
SELECT
    cu.customer_id,
    cu.customer_name,
    cu.email,
    cu.phone,
    cu.street_address,
    ci.city_name,
    ci.city_population,
    s.state_name,
    s.state_code,
    co.country_name,
    co.country_iso_code,
    r.region_name,
    cu.segment,
    cu.tier
FROM dim_customer cu
JOIN dim_city ci ON cu.city_id = ci.city_id
JOIN dim_state s ON ci.state_id = s.state_id
JOIN dim_country co ON s.country_id = co.country_id
JOIN dim_region r ON co.region_id = r.region_id;
 
-- ================================================================
-- MATERIALIZED VIEWS for performance-critical paths
-- ================================================================
 
-- Materialized wide product (refreshed hourly)
CREATE MATERIALIZED VIEW mv_dim_product_wide AS
SELECT * FROM v_dim_product_wide;
 
CREATE UNIQUE INDEX idx_mv_product_id ON mv_dim_product_wide(product_id);
 
-- Pre-aggregated sales by category and month (refreshed daily)
CREATE MATERIALIZED VIEW mv_sales_category_monthly AS
SELECT 
    c.category_id,
    c.category_name,
    t.year_actual,
    t.month_of_year,
    SUM(f.sales_amount) AS total_sales,
    SUM(f.quantity) AS total_quantity,
    COUNT(*) AS transaction_count
FROM fact_sales f
JOIN dim_product p ON f.product_id = p.product_id
JOIN dim_subcategory sc ON p.subcategory_id = sc.subcategory_id
JOIN dim_category c ON sc.category_id = c.category_id
JOIN dim_time t ON f.time_id = t.time_id
GROUP BY c.category_id, c.category_name, t.year_actual, t.month_of_year;
 
-- Refresh schedule
-- REFRESH MATERIALIZED VIEW CONCURRENTLY mv_dim_product_wide;
-- REFRESH MATERIALIZED VIEW CONCURRENTLY mv_sales_category_monthly;

Best of Both Worlds

The layered approach provides: (1) Data integrity through normalized source tables, (2) Query performance through denormalized views, (3) Flexibility to add new consumption patterns without changing source schema, (4) Clear separation of concerns between data modeling and performance optimization.

Hybrid ETL Strategies

Hybrid schemas require thoughtful ETL design. Let's examine practical patterns for loading and maintaining hybrid architectures.

dbt (data build tool) is ideal for layered architectures:

# dbt_project.yml
models:
  warehouse:
    staging:        # Raw data cleaning
      +materialized: view
    intermediate:    # Snowflake normalized layer
      +materialized: table
    marts:           # Star consumption layer
      core:
        +materialized: table
      finance:
        +materialized: table

dbt Model Example:

-- models/intermediate/dim_product.sql
-- Normalized product dimension
{{ config(materialized='table') }}

SELECT
    product_id,
    product_name,
    subcategory_id,
    brand_id
FROM {{ ref('stg_products') }}

-- models/marts/core/dim_product_wide.sql
-- Denormalized for consumption
{{ config(materialized='table') }}

SELECT
    p.*,
    sc.subcategory_name,
    c.category_name,
    c.category_code,
    b.brand_name,
    b.brand_tier
FROM {{ ref('dim_product') }} p
JOIN {{ ref('dim_subcategory') }} sc 
  ON p.subcategory_id = sc.subcategory_id
JOIN {{ ref('dim_category') }} c 
  ON sc.category_id = c.category_id
JOIN {{ ref('dim_brand') }} b 
  ON p.brand_id = b.brand_id

dbt automatically manages dependencies and build order.

Modern Approach

dbt has become the de facto standard for transformation in modern data stacks. Its dependency tracking, testing, and documentation features make it particularly well-suited for hybrid architectures where the relationship between source tables and consumption views must be carefully managed.

Query Routing Patterns

With multiple schema layers available, how do you route queries to the optimal layer? Let's examine patterns for query routing in hybrid architectures.

Query Routing Strategies

•Semantic Layer Routing: Tools like Looker, dbt Semantic Layer, or AtScale translate user queries into optimal SQL targeting the right layer automatically.
•View Abstraction: Present denormalized views as the 'public API'; hide normalized tables from end users entirely.
•Query Rewriting: Use database-level query rewriting to redirect queries to materialized views when available.
•User Segmentation: Power users access normalized tables directly; self-service users access pre-joined views only.
•Workload Classification: Batch/scheduled queries hit normalized layer; interactive queries hit denormalized layer.

query_routing_example.sql
SQL
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
-- ================================================================
-- QUERY ROUTING IMPLEMENTATION
-- ================================================================
 
-- Option 1: Semantic Layer (LookML-like abstraction)
-- Users never write SQL; semantic layer routes to optimal source
 
-- Option 2: View-Based Abstraction
-- Create a 'public' schema with denormalized views
-- Create a 'core' schema with normalized tables
 
CREATE SCHEMA public_reporting;
 
-- Denormalized product view in public schema
CREATE OR REPLACE VIEW public_reporting.products AS
SELECT * FROM core.mv_dim_product_wide;
 
-- Users query 'public_reporting.products' 
-- and automatically get optimized, denormalized data
 
-- Option 3: Materialized View Matching (Database Feature)
-- PostgreSQL example - automatic query rewriting
 
-- If user queries the normalized view...
EXPLAIN SELECT category_name, SUM(sales_amount)
FROM core.fact_sales f
JOIN core.dim_product p ON f.product_id = p.product_id
JOIN core.dim_subcategory sc ON p.subcategory_id = sc.subcategory_id
JOIN core.dim_category c ON sc.category_id = c.category_id
JOIN core.dim_time t ON f.time_id = t.time_id
WHERE t.year_actual = 2024
GROUP BY category_name;
 
-- PostgreSQL can rewrite to use mv_sales_category_monthly instead
-- (if configured with appropriate materialized view support)
 
-- Option 4: Stored Procedure Router
CREATE OR REPLACE FUNCTION route_sales_query(
    p_granularity VARCHAR,  -- 'daily', 'monthly', 'yearly'
    p_grouping VARCHAR      -- 'product', 'category', 'region'
)
RETURNS TABLE (...) AS $$
BEGIN
    IF p_granularity = 'monthly' AND p_grouping = 'category' THEN
        -- Use pre-aggregated materialized view
        RETURN QUERY SELECT * FROM mv_sales_category_monthly;
    ELSIF p_granularity = 'daily' THEN
        -- Use detailed fact with denormalized views
        RETURN QUERY SELECT * FROM fact_sales f
            JOIN v_dim_product_wide p ON f.product_id = p.product_id
            ...;
    ELSE
        -- Use normalized tables for complex queries
        RETURN QUERY SELECT * FROM fact_sales f
            JOIN dim_product p ON f.product_id = p.product_id
            JOIN dim_subcategory sc ON p.subcategory_id = sc.subcategory_id
            ...;
    END IF;
END;
$$ LANGUAGE plpgsql;

Transparent Optimization

The best query routing is invisible to end users. They should query what they understand (products, customers, sales) and the infrastructure should route to the optimal physical structure. This is the promise of semantic layers and the reason they're becoming standard in modern data stacks.

Practical Implementation Guide

Let's consolidate the hybrid patterns into a practical implementation guide you can follow for any data warehouse project.

Hybrid Schema Implementation Checklist

•Inventory all dimensions: List every dimension, its cardinality, hierarchy depth, update frequency, and primary consumers.
•Classify each dimension: Apply the decision criteria (Page 4) to decide normalize vs. denormalize for each dimension independently.
•Identify shared dimensions: In galaxy schemas, flag dimensions used by multiple fact tables; these may have additional constraints.
•Design normalized layer first: Build the fully normalized schema as your source of truth, even if you'll denormalize for consumption.
•Define consumption patterns: Document the primary query patterns—which dimensions are typically joined, at what hierarchy level.
•Create denormalized views: Build views or materialized views that pre-join commonly accessed hierarchies.
•Implement refresh logic: Design ETL/ELT processes that maintain both layers with appropriate refresh frequencies.
•Abstract for consumers: Use semantic layers, view schemas, or APIs to route users to optimal physical structures.
•Monitor and tune: Track query patterns; add materialized views for common bottlenecks; remove unused views.
•Document thoroughly: Maintain documentation of which tables are sources of truth vs. derived; update as schema evolves.

Common Hybrid Configurations by Use Case
Use Case	Normalized Dims	Denormalized Dims	Aggregation Strategy
Retail Analytics	Product, Store	Customer, Time	Daily sales by store; Monthly by category
Financial Reporting	Account, Cost Center	Time, Geography	Trial balance by period; GL by account
Healthcare Analytics	Diagnosis, Provider	Patient, Time	Encounters by facility; Claims by quarter
E-commerce	Product, Inventory	Customer, Session	Orders by product; Conversion by funnel step
SaaS Metrics	Feature, Plan	User, Time	Usage by feature; ARR by customer segment

Start Simple, Evolve Carefully

Begin with a pure star schema for fast delivery. Identify pain points (storage costs, update complexity) over time. Selectively normalize dimensions where problems occur. This evolutionary approach is lower risk than designing a complex hybrid schema upfront.

Summary: Hybrid Approaches

We've explored hybrid schema patterns that combine the strengths of star and snowflake designs. Let's consolidate the key insights from this module:

Key Takeaways

•Starflake (partial snowflake) normalizes some dimensions while leaving others denormalized, based on each dimension's characteristics.
•Galaxy (constellation) schemas with multiple fact tables naturally lead to hybrid designs, as different facts have different requirements.
•Layered architecture uses normalized tables as sources of truth and denormalized views/materialized views for consumption—best of both worlds.
•ETL complexity increases for hybrid schemas, but modern tools like dbt manage dependencies effectively.
•Query routing via semantic layers, view abstraction, or stored procedures ensures users access optimal physical structures transparently.
•Practical implementation starts with dimension inventory, applies decision criteria independently, and evolves incrementally based on observed pain points.
•The hybrid approach is the norm, not the exception—pure star or pure snowflake schemas are rare in production data warehouses.

Module Complete:

You've now mastered snowflake schema design—from understanding normalized dimensions to comparing with star schemas, evaluating trade-offs, identifying when to use each approach, and implementing practical hybrid patterns. You're equipped to make informed, defensible schema decisions for any data warehousing scenario.

Module Complete

Congratulations! You've completed the Snowflake Schema module. You now understand normalized dimensions, can compare star and snowflake schemas across multiple dimensions, evaluate trade-offs for specific contexts, identify when snowflake is the right choice, and implement hybrid patterns that capture the benefits of both approaches. This knowledge will serve you well in designing data warehouses that balance performance, maintainability, and data quality.

Hybrid Approaches

Beyond the Binary Choice

What You Will Learn

The Starflake Pattern

Converting Mermaid diagram...

In this starflake example:

DIM_PRODUCT: Normalized (high cardinality, deep hierarchy, frequent category changes)
DIM_STORE: Normalized (moderate cardinality, geographic hierarchy for regional analysis)
DIM_CUSTOMER: Denormalized (flat structure, self-service access, sensitive to query complexity)
DIM_TIME: Denormalized (static hierarchy, calendar data, no update anomalies)

Design Rationale:

Each dimension is analyzed independently:

Starflake Dimension Analysis
Dimension	Cardinality	Hierarchy Depth	Update Frequency	Self-Service Access	Decision
Product	1M rows	4 levels	Weekly	Specialists only	Normalize ✓
Customer	500K rows	2 levels	Rare	Heavy self-service	Denormalize ✓
Store	10K rows	3 levels	Monthly	Moderate	Normalize ✓
Time	10K rows	5 levels	Never	Everyone	Denormalize ✓

Starflake Design Principle

Implementing Starflake

Let's walk through the SQL implementation of a starflake schema, highlighting the design choices for each dimension type.

starflake_schema.sql
SQL
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
-- ================================================================
-- STARFLAKE SCHEMA IMPLEMENTATION
-- Hybrid: Some dimensions normalized, others denormalized
-- ================================================================
 
-- ================================================================
-- NORMALIZED DIMENSION: Product (high cardinality, frequent updates)
-- ================================================================
 
CREATE TABLE dim_brand (
    brand_id        SERIAL PRIMARY KEY,
    brand_name      VARCHAR(100) NOT NULL,
    brand_country   VARCHAR(50),
    brand_tier      VARCHAR(20)  -- luxury, premium, standard
);
 
CREATE TABLE dim_category (
    category_id     SERIAL PRIMARY KEY,
    category_name   VARCHAR(100) NOT NULL,
    category_code   VARCHAR(10) NOT NULL,
    tax_class       VARCHAR(20)
);
 
CREATE TABLE dim_subcategory (
    subcategory_id  SERIAL PRIMARY KEY,
    subcategory_name VARCHAR(100) NOT NULL,
    category_id     INT NOT NULL REFERENCES dim_category(category_id)
);
 
CREATE TABLE dim_product (
    product_id      SERIAL PRIMARY KEY,
    product_name    VARCHAR(200) NOT NULL,
    product_sku     VARCHAR(50) NOT NULL,
    unit_price      DECIMAL(10,2),
    -- Foreign keys to normalized hierarchy
    subcategory_id  INT NOT NULL REFERENCES dim_subcategory(subcategory_id),
    brand_id        INT NOT NULL REFERENCES dim_brand(brand_id),
    -- Product-specific attributes (not hierarchical)
    weight_kg       DECIMAL(8,2),
    is_active       BOOLEAN DEFAULT true
);
 
-- Indexes for FK joins
CREATE INDEX idx_product_subcategory ON dim_product(subcategory_id);
CREATE INDEX idx_product_brand ON dim_product(brand_id);
CREATE INDEX idx_subcategory_category ON dim_subcategory(category_id);
 
-- ================================================================
-- NORMALIZED DIMENSION: Store (geographic hierarchy)
-- ================================================================
 
CREATE TABLE dim_region (
    region_id       SERIAL PRIMARY KEY,
    region_name     VARCHAR(50) NOT NULL,
    region_manager  VARCHAR(100)
);
 
CREATE TABLE dim_district (
    district_id     SERIAL PRIMARY KEY,
    district_name   VARCHAR(100) NOT NULL,
    district_code   VARCHAR(10) NOT NULL,
    region_id       INT NOT NULL REFERENCES dim_region(region_id)
);
 
CREATE TABLE dim_store (
    store_id        SERIAL PRIMARY KEY,
    store_name      VARCHAR(100) NOT NULL,
    store_code      VARCHAR(10) NOT NULL,
    address         VARCHAR(200),
    city            VARCHAR(100),
    state           VARCHAR(50),
    -- Foreign key to normalized hierarchy
    district_id     INT NOT NULL REFERENCES dim_district(district_id),
    -- Store-specific attributes
    square_footage  INT,
    open_date       DATE,
    is_active       BOOLEAN DEFAULT true
);
 
CREATE INDEX idx_store_district ON dim_store(district_id);
CREATE INDEX idx_district_region ON dim_district(region_id);
 
-- ================================================================
-- DENORMALIZED DIMENSION: Customer (flat, self-service friendly)
-- ================================================================
 
CREATE TABLE dim_customer (
    customer_id     SERIAL PRIMARY KEY,
    customer_name   VARCHAR(100) NOT NULL,
    email           VARCHAR(255),
    phone           VARCHAR(20),
    -- Demographics (flat, not hierarchical)
    age_band        VARCHAR(20),
    gender          VARCHAR(10),
    -- Address (denormalized for simplicity)
    street_address  VARCHAR(200),
    city            VARCHAR(100),
    state           VARCHAR(50),
    country         VARCHAR(50),
    postal_code     VARCHAR(20),
    -- Segmentation (flat)
    segment         VARCHAR(50),  -- Consumer, Business, Enterprise
    tier            VARCHAR(20),  -- Bronze, Silver, Gold, Platinum
    acquisition_channel VARCHAR(50),
    first_purchase_date DATE
);
 
-- ================================================================
-- DENORMALIZED DIMENSION: Time (static, no updates ever)
-- ================================================================
 
CREATE TABLE dim_time (
    time_id             INT PRIMARY KEY,
    date_actual         DATE NOT NULL,
    -- Day level (all in one table for simplicity)
    day_of_week         SMALLINT,
    day_name            VARCHAR(10),
    day_of_month        SMALLINT,
    day_of_year         SMALLINT,
    is_weekend          BOOLEAN,
    is_holiday          BOOLEAN,
    -- Week level
    week_of_year        SMALLINT,
    -- Month level
    month_of_year       SMALLINT,
    month_name          VARCHAR(10),
    month_start_date    DATE,
    month_end_date      DATE,
    -- Quarter level
    quarter_of_year     SMALLINT,
    quarter_name        VARCHAR(10),  -- Q1, Q2, Q3, Q4
    -- Year level
    year_actual         SMALLINT,
    -- Fiscal calendar (company-specific)
    fiscal_month        SMALLINT,
    fiscal_quarter      SMALLINT,
    fiscal_year         SMALLINT
);
 
-- ================================================================
-- FACT TABLE: References all dimension types uniformly
-- ================================================================
 
CREATE TABLE fact_sales (
    sale_id         BIGSERIAL PRIMARY KEY,
    -- Foreign keys (same pattern for normalized and denormalized dims)
    product_id      INT NOT NULL REFERENCES dim_product(product_id),
    customer_id     INT NOT NULL REFERENCES dim_customer(customer_id),
    store_id        INT NOT NULL REFERENCES dim_store(store_id),
    time_id         INT NOT NULL REFERENCES dim_time(time_id),
    -- Measures
    quantity        INT NOT NULL,
    unit_price      DECIMAL(10,2) NOT NULL,
    discount_pct    DECIMAL(5,2) DEFAULT 0,
    sales_amount    DECIMAL(12,2) NOT NULL,
    cost_amount     DECIMAL(12,2)
);
 
-- Fact table indexes
CREATE INDEX idx_fact_product ON fact_sales(product_id);
CREATE INDEX idx_fact_customer ON fact_sales(customer_id);
CREATE INDEX idx_fact_store ON fact_sales(store_id);
CREATE INDEX idx_fact_time ON fact_sales(time_id);

Fact Table Uniformity

The Galaxy (Constellation) Schema

Converting Mermaid diagram...

Galaxy Schema Characteristics:

Shared Dimensions:

DIM_PRODUCT used by all three fact tables → Normalize (high reuse, central importance)
DIM_TIME used by all three fact tables → Denormalize (date math simplicity)
DIM_CUSTOMER used by sales and orders → Denormalize (CRM integration, self-service)

Fact-Specific Dimensions:

DIM_STORE used only by inventory → Normalize (regional rollups needed)
DIM_SHIPMENT used only by orders → Denormalize (simple lookup, low cardinality)

Design Considerations:

Galaxy Schema Best Practices

•Conformed Dimensions: Shared dimensions must have identical definitions across all fact tables. Same grain, same surrogate keys, same business keys.
•Independent Hierarchy Strategy: Each shared dimension can still have its own normalization decision based on its characteristics, not based on which fact uses it.
•Cross-Fact Analysis: Denormalized shared dimensions simplify queries that join multiple facts (e.g., sales and inventory by product).
•Separate ETL Streams: Each fact table can have its own load schedule, but shared dimensions must load before any dependent facts.
•Aggregate Consistency: Pre-aggregated fact tables should use the same dimension definitions as detailed facts.

Real-World Galaxy Schemas

Layered Architecture Pattern

Converting Mermaid diagram...

Layer Responsibilities:

1. Raw/Staging Layer

Ingests source data with minimal transformation
Preserves source format for auditability
Transient; typically refreshed on each load

2. Snowflake Layer (Source of Truth)

Fully normalized dimension tables
Referential integrity enforced
Single source of truth for all data
Optimized for data quality and updates

3. Star Layer (Consumption)

Denormalized views or materialized views
Pre-joined dimensions for query simplicity
Optimized for read performance
May include pre-aggregated facts

layered_architecture_views.sql
SQL
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
-- ================================================================
-- SNOWFLAKE LAYER: Normalized source of truth
-- ================================================================
 
-- (These are the actual tables storing data)
-- dim_product, dim_subcategory, dim_category, dim_brand
-- dim_customer, dim_city, dim_state, dim_country
-- fact_sales
 
-- ================================================================
-- STAR LAYER: Denormalized views for consumption
-- ================================================================
 
-- Wide product dimension view (joins all hierarchy levels)
CREATE OR REPLACE VIEW v_dim_product_wide AS
SELECT 
    p.product_id,
    p.product_name,
    p.product_sku,
    p.unit_price,
    sc.subcategory_id,
    sc.subcategory_name,
    c.category_id,
    c.category_name,
    c.category_code,
    c.tax_class,
    b.brand_id,
    b.brand_name,
    b.brand_country,
    b.brand_tier
FROM dim_product p
JOIN dim_subcategory sc ON p.subcategory_id = sc.subcategory_id
JOIN dim_category c ON sc.category_id = c.category_id
JOIN dim_brand b ON p.brand_id = b.brand_id;
 
-- Wide customer dimension view
CREATE OR REPLACE VIEW v_dim_customer_wide AS
SELECT
    cu.customer_id,
    cu.customer_name,
    cu.email,
    cu.phone,
    cu.street_address,
    ci.city_name,
    ci.city_population,
    s.state_name,
    s.state_code,
    co.country_name,
    co.country_iso_code,
    r.region_name,
    cu.segment,
    cu.tier
FROM dim_customer cu
JOIN dim_city ci ON cu.city_id = ci.city_id
JOIN dim_state s ON ci.state_id = s.state_id
JOIN dim_country co ON s.country_id = co.country_id
JOIN dim_region r ON co.region_id = r.region_id;
 
-- ================================================================
-- MATERIALIZED VIEWS for performance-critical paths
-- ================================================================
 
-- Materialized wide product (refreshed hourly)
CREATE MATERIALIZED VIEW mv_dim_product_wide AS
SELECT * FROM v_dim_product_wide;
 
CREATE UNIQUE INDEX idx_mv_product_id ON mv_dim_product_wide(product_id);
 
-- Pre-aggregated sales by category and month (refreshed daily)
CREATE MATERIALIZED VIEW mv_sales_category_monthly AS
SELECT 
    c.category_id,
    c.category_name,
    t.year_actual,
    t.month_of_year,
    SUM(f.sales_amount) AS total_sales,
    SUM(f.quantity) AS total_quantity,
    COUNT(*) AS transaction_count
FROM fact_sales f
JOIN dim_product p ON f.product_id = p.product_id
JOIN dim_subcategory sc ON p.subcategory_id = sc.subcategory_id
JOIN dim_category c ON sc.category_id = c.category_id
JOIN dim_time t ON f.time_id = t.time_id
GROUP BY c.category_id, c.category_name, t.year_actual, t.month_of_year;
 
-- Refresh schedule
-- REFRESH MATERIALIZED VIEW CONCURRENTLY mv_dim_product_wide;
-- REFRESH MATERIALIZED VIEW CONCURRENTLY mv_sales_category_monthly;

Best of Both Worlds

Hybrid ETL Strategies

Hybrid schemas require thoughtful ETL design. Let's examine practical patterns for loading and maintaining hybrid architectures.

dbt (data build tool) is ideal for layered architectures:

# dbt_project.yml
models:
  warehouse:
    staging:        # Raw data cleaning
      +materialized: view
    intermediate:    # Snowflake normalized layer
      +materialized: table
    marts:           # Star consumption layer
      core:
        +materialized: table
      finance:
        +materialized: table

dbt Model Example:

-- models/intermediate/dim_product.sql
-- Normalized product dimension
{{ config(materialized='table') }}

SELECT
    product_id,
    product_name,
    subcategory_id,
    brand_id
FROM {{ ref('stg_products') }}

-- models/marts/core/dim_product_wide.sql
-- Denormalized for consumption
{{ config(materialized='table') }}

SELECT
    p.*,
    sc.subcategory_name,
    c.category_name,
    c.category_code,
    b.brand_name,
    b.brand_tier
FROM {{ ref('dim_product') }} p
JOIN {{ ref('dim_subcategory') }} sc 
  ON p.subcategory_id = sc.subcategory_id
JOIN {{ ref('dim_category') }} c 
  ON sc.category_id = c.category_id
JOIN {{ ref('dim_brand') }} b 
  ON p.brand_id = b.brand_id

dbt automatically manages dependencies and build order.

Modern Approach

Query Routing Patterns

With multiple schema layers available, how do you route queries to the optimal layer? Let's examine patterns for query routing in hybrid architectures.

Query Routing Strategies

•Semantic Layer Routing: Tools like Looker, dbt Semantic Layer, or AtScale translate user queries into optimal SQL targeting the right layer automatically.
•View Abstraction: Present denormalized views as the 'public API'; hide normalized tables from end users entirely.
•Query Rewriting: Use database-level query rewriting to redirect queries to materialized views when available.
•User Segmentation: Power users access normalized tables directly; self-service users access pre-joined views only.
•Workload Classification: Batch/scheduled queries hit normalized layer; interactive queries hit denormalized layer.

query_routing_example.sql
SQL
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
-- ================================================================
-- QUERY ROUTING IMPLEMENTATION
-- ================================================================
 
-- Option 1: Semantic Layer (LookML-like abstraction)
-- Users never write SQL; semantic layer routes to optimal source
 
-- Option 2: View-Based Abstraction
-- Create a 'public' schema with denormalized views
-- Create a 'core' schema with normalized tables
 
CREATE SCHEMA public_reporting;
 
-- Denormalized product view in public schema
CREATE OR REPLACE VIEW public_reporting.products AS
SELECT * FROM core.mv_dim_product_wide;
 
-- Users query 'public_reporting.products' 
-- and automatically get optimized, denormalized data
 
-- Option 3: Materialized View Matching (Database Feature)
-- PostgreSQL example - automatic query rewriting
 
-- If user queries the normalized view...
EXPLAIN SELECT category_name, SUM(sales_amount)
FROM core.fact_sales f
JOIN core.dim_product p ON f.product_id = p.product_id
JOIN core.dim_subcategory sc ON p.subcategory_id = sc.subcategory_id
JOIN core.dim_category c ON sc.category_id = c.category_id
JOIN core.dim_time t ON f.time_id = t.time_id
WHERE t.year_actual = 2024
GROUP BY category_name;
 
-- PostgreSQL can rewrite to use mv_sales_category_monthly instead
-- (if configured with appropriate materialized view support)
 
-- Option 4: Stored Procedure Router
CREATE OR REPLACE FUNCTION route_sales_query(
    p_granularity VARCHAR,  -- 'daily', 'monthly', 'yearly'
    p_grouping VARCHAR      -- 'product', 'category', 'region'
)
RETURNS TABLE (...) AS $$
BEGIN
    IF p_granularity = 'monthly' AND p_grouping = 'category' THEN
        -- Use pre-aggregated materialized view
        RETURN QUERY SELECT * FROM mv_sales_category_monthly;
    ELSIF p_granularity = 'daily' THEN
        -- Use detailed fact with denormalized views
        RETURN QUERY SELECT * FROM fact_sales f
            JOIN v_dim_product_wide p ON f.product_id = p.product_id
            ...;
    ELSE
        -- Use normalized tables for complex queries
        RETURN QUERY SELECT * FROM fact_sales f
            JOIN dim_product p ON f.product_id = p.product_id
            JOIN dim_subcategory sc ON p.subcategory_id = sc.subcategory_id
            ...;
    END IF;
END;
$$ LANGUAGE plpgsql;

Transparent Optimization

Practical Implementation Guide

Let's consolidate the hybrid patterns into a practical implementation guide you can follow for any data warehouse project.

Hybrid Schema Implementation Checklist

•Inventory all dimensions: List every dimension, its cardinality, hierarchy depth, update frequency, and primary consumers.
•Classify each dimension: Apply the decision criteria (Page 4) to decide normalize vs. denormalize for each dimension independently.
•Identify shared dimensions: In galaxy schemas, flag dimensions used by multiple fact tables; these may have additional constraints.
•Design normalized layer first: Build the fully normalized schema as your source of truth, even if you'll denormalize for consumption.
•Define consumption patterns: Document the primary query patterns—which dimensions are typically joined, at what hierarchy level.
•Create denormalized views: Build views or materialized views that pre-join commonly accessed hierarchies.
•Implement refresh logic: Design ETL/ELT processes that maintain both layers with appropriate refresh frequencies.
•Abstract for consumers: Use semantic layers, view schemas, or APIs to route users to optimal physical structures.
•Monitor and tune: Track query patterns; add materialized views for common bottlenecks; remove unused views.
•Document thoroughly: Maintain documentation of which tables are sources of truth vs. derived; update as schema evolves.

Common Hybrid Configurations by Use Case
Use Case	Normalized Dims	Denormalized Dims	Aggregation Strategy
Retail Analytics	Product, Store	Customer, Time	Daily sales by store; Monthly by category
Financial Reporting	Account, Cost Center	Time, Geography	Trial balance by period; GL by account
Healthcare Analytics	Diagnosis, Provider	Patient, Time	Encounters by facility; Claims by quarter
E-commerce	Product, Inventory	Customer, Session	Orders by product; Conversion by funnel step
SaaS Metrics	Feature, Plan	User, Time	Usage by feature; ARR by customer segment

Start Simple, Evolve Carefully

Summary: Hybrid Approaches

We've explored hybrid schema patterns that combine the strengths of star and snowflake designs. Let's consolidate the key insights from this module:

Key Takeaways

•Starflake (partial snowflake) normalizes some dimensions while leaving others denormalized, based on each dimension's characteristics.
•Galaxy (constellation) schemas with multiple fact tables naturally lead to hybrid designs, as different facts have different requirements.
•Layered architecture uses normalized tables as sources of truth and denormalized views/materialized views for consumption—best of both worlds.
•ETL complexity increases for hybrid schemas, but modern tools like dbt manage dependencies effectively.
•Query routing via semantic layers, view abstraction, or stored procedures ensures users access optimal physical structures transparently.
•Practical implementation starts with dimension inventory, applies decision criteria independently, and evolves incrementally based on observed pain points.
•The hybrid approach is the norm, not the exception—pure star or pure snowflake schemas are rare in production data warehouses.

Module Complete:

Module Complete