Database Management SystemsStar Schema

Star Schema: The Foundation of Dimensional Modeling

LevelIntermediate

Duration75 mins

TopicStar Schema

4 / 5

Schema Design: From Requirements to Star Schema

The Methodology Behind Effective Data Warehouse Design

Designing a star schema is not guesswork. It follows a rigorous methodology that begins with understanding business processes and concludes with validated, documented designs ready for implementation. The gap between a novice designer and an expert is not creativity—it's the systematic application of proven techniques.

This page teaches you the dimensional modeling design process—a methodology refined over decades of enterprise data warehouse implementations. You'll learn to extract requirements from business stakeholders, identify the right facts and dimensions, make sound design decisions, and produce schemas that remain useful for years.

The design decisions you make at this stage cascade through everything that follows: ETL development, query performance, user adoption, and maintainability. Time invested in careful design pays compound returns.

What You Will Learn

By the end of this page, you will understand the dimensional modeling design process, techniques for gathering business requirements, the four-step design process (process → grain → dimensions → facts), common design patterns and anti-patterns, and how to document and validate designs before implementation.

The Dimensional Modeling Design Process

The dimensional modeling design process follows four fundamental steps, each building on the previous. Skipping steps or reversing the order leads to flawed designs.

The Four-Step Design Process

Step 1: Select the Business Process

Identify which business process the data mart will model. Examples: retail sales, order fulfillment, claims processing, website visits, call center interactions.

Step 2: Declare the Grain

Specify exactly what a single row in the fact table represents. This is the most critical decision in the entire design.

Step 3: Identify the Dimensions

Determine which dimensions provide context for the facts. Ask: "How do business users want to analyze this process?"

Step 4: Identify the Facts

Select the numeric measurements that the business process produces. These must be consistent with the declared grain.

The Four-Step Design Process
Step	Question Answered	Outcome	Common Mistakes
Business Process	What process are we modeling?	Process selection	Modeling entities instead of processes
Grain	What does one row represent?	Grain declaration statement	Mixing grains, being vague
Dimensions	How will users analyze?	List of dimension tables	Missing critical dimensions, too few attributes
Facts	What are we measuring?	List of fact measures	Including non-additive ratios, wrong grain facts

Process, Not Entity

A common mistake is modeling entities (customers, products, employees) rather than processes (sales, orders, hires). Entities become dimensions. Processes become fact tables. Always ask: 'What business activity are we measuring?' not 'What things exist in our business?'

Gathering Business Requirements

Effective schema design begins with understanding business needs. This understanding comes from structured conversations with business stakeholders, not from studying source system schemas.

The Business Interview

Before examining any database, conduct interviews with business users. Focus on understanding:

1. Business Processes

What activities does the business perform?
Which processes generate measurable outcomes?
What decisions do users need to make?

2. Key Performance Indicators (KPIs)

What metrics define success?
How is performance measured?
What numbers appear in executive dashboards?

3. Analysis Patterns

How do users currently analyze data?
What reports do they run most frequently?
What questions can't they currently answer?

business_interview_questions.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
# Dimensional Modeling Interview Guide
 
## Process Discovery Questions
- Walk me through [process name] from start to finish.
- What triggers this process? What ends it?
- What gets recorded when this happens?
- How often does this occur? (Per second? Daily? Monthly?)
 
## Measurement Questions  
- What numbers matter most in this process?
- How do you know if [process] is performing well?
- What would you measure if you could measure anything?
- Show me your most important reports—what's on them?
 
## Analysis Pattern Questions
- When you analyze this data, what dimensions do you "slice by"?
- Do you compare time periods? (This week vs last? YoY?)
- What drill-downs do you perform? (Region → Country → City?)
- Show me an analysis you wish you could do but can't.
 
## Dimension Discovery Questions
- Who are the key actors in this process?
- When does this happen? Does time of day matter?
- Where does this occur? Does location matter?
- What products/services are involved?
- Are there promotions, campaigns, or special conditions?
- What categories or classifications are important?
 
## Red Flag Questions (to avoid scope creep)
- Is this analysis needed now, or is it a "nice to have"?
- How often would you actually run this report?
- Who else needs this data? (Validate priority)

Translating Requirements to Design Elements

Business requirements map directly to schema components:

Business Requirement	Schema Component	Example
"We need to track sales"	Business process / fact table	sales_fact
"For each line item"	Grain	One row per line item
"By product category"	Dimension	product_dim with category attribute
"Total revenue"	Additive fact	sales_amount column
"Average price"	Non-additive (store components)	quantity and extended_amount separately
"Compare to last year"	Date dimension + query pattern	Prior year date keys, YoY queries

Step 1: Selecting the Business Process

The first design step is selecting which business process to model. This determines the scope of your star schema.

Business Process Identification

A business process is an operational activity that produces measurable events. Look for activities that:

Occur repeatedly (transactions, not one-time events)
Produce measurable outcomes (quantities, amounts, durations)
Are tracked in operational systems
Are important to business decision-making

Examples of Business Processes:

Retail: Point-of-sale transactions, inventory movements, product returns
Healthcare: Patient visits, prescription fills, insurance claims
Banking: Account transactions, loan applications, customer service calls
E-commerce: Orderplacement, website visits, cart abandonment
Manufacturing: Work orders, quality inspections, equipment downtime

Follow the Value Chain

One effective technique is to trace your organization's value chain—the sequence of activities from raw materials (or customer acquisition) to delivered value. Each major step often represents a modelable business process: procurement, inventory, manufacturing, sales, distribution, service.

Business Process Matrix

For enterprise-wide planning, create a business process matrix showing which processes share which dimensions. This reveals opportunities for conformed dimensions and helps sequence development.

Dimension →	Date	Product	Customer	Store	Employee	Supplier
Point of Sale	✓	✓	✓	✓	✓
Inventory	✓	✓		✓		✓
Purchasing	✓	✓				✓
Returns	✓	✓	✓	✓	✓
HR Payroll	✓			✓	✓

Processes sharing dimensions (Date, Product) are candidates for conformed dimension design.

Step 2: Declaring the Grain

The grain declaration is the single most important design decision. Everything else—dimensions, facts, table size, query possibilities—flows from this choice.

Crafting the Grain Statement

A grain statement should be:

Specific: "One row per..." with a clear definition
Business-readable: Understandable by non-technical stakeholders
Unambiguous: No room for interpretation

Well-formed grain statements:

"One row per line item on a retail transaction"
"One row per patient visit to a clinic"
"One row per product per store per day (inventory snapshot)"
"One row per web page view"
"One row per insurance claim (accumulating snapshot from filing to resolution)"

Poorly-formed grain statements:

"Sales data" (What level?)
"Customer information" (That's a dimension, not a process)
"Monthly summaries" (Summary of what precisely?)

Prefer Atomic Grain

•Maximum analytical flexibility
•Handles unforeseen questions
•Supports drill-through to detail
•Enables any aggregation level
•Aligns with source transactions

When to Use Higher Grain

•Detail not available from source
•Privacy requires aggregation
•Only summary analysis is needed
•Extreme volume requires reduction
•Supplement atomic with aggregates

grain_examples.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
-- Different grain choices for the same business process
 
-- ATOMIC GRAIN: One row per line item
-- Maximum detail, maximum flexibility
CREATE TABLE sales_fact_line_item (
    sale_line_key       BIGINT PRIMARY KEY,
    transaction_id      VARCHAR(20),    -- Degenerate dim
    line_number         INT,
    date_key            INT,
    product_key         INT,
    customer_key        INT,
    store_key           INT,
    quantity            INT,
    unit_price          DECIMAL(10,2),
    line_amount         DECIMAL(12,2)
);
-- Row count: ~10 billion (5 years of retail)
-- Size: ~1 TB
 
-- TRANSACTION GRAIN: One row per transaction (header level)
-- Less detail, cannot analyze individual products
CREATE TABLE sales_fact_transaction (
    transaction_key     BIGINT PRIMARY KEY,
    transaction_id      VARCHAR(20),
    date_key            INT,
    customer_key        INT,
    store_key           INT,
    cashier_key         INT,
    total_items         INT,            -- Count of lines
    total_amount        DECIMAL(12,2)   -- Sum of all lines
);
-- Row count: ~2 billion (fewer rows)
-- Size: ~150 GB
-- LIMITATION: Cannot answer "What products sold together?"
 
-- DAILY SUMMARY GRAIN: One row per product per store per day
-- Aggregated, no individual transactions
CREATE TABLE sales_fact_daily (
    date_key            INT,
    product_key         INT,
    store_key           INT,
    total_transactions  INT,
    total_quantity      INT,
    total_amount        DECIMAL(12,2),
    PRIMARY KEY (date_key, product_key, store_key)
);
-- Row count: ~50 million
-- Size: ~5 GB
-- LIMITATION: Cannot answer "What time of day?" or "Which customers?"

Never Mix Grains

A single fact table must have one and only one grain. If some rows are line items and others are transaction headers, SUM() produces nonsense (double-counting or under-counting). If you need both grains, create two separate fact tables.

Step 3: Identifying the Dimensions

With the grain declared, identify the dimensions that provide context for analysis. Dimensions answer the "who, what, when, where, why" questions about each fact row.

The Dimension Test

For each candidate dimension, apply this test:

"For a single fact row at the declared grain, is there exactly one value for this dimension?"

If YES → Valid dimension for this fact table
If NO → Either wrong grain, needs bridge table, or doesn't apply

Example: At the "line item" grain:

Date: Yes, one date per line item ✓
Product: Yes, one product per line item ✓
Customer: Yes, one customer per transaction (and thus per line) ✓
Store: Yes, one store per transaction ✓
Supplier: No—products may come from multiple suppliers... Could add product→supplier in product dimension, but not a direct fact dimension

Core Dimension Categories

1. Who Dimensions

Customer, patient, employee, supplier, vendor
Anyone who participates in the process

2. What Dimensions

Product, service, item, SKU
The object of the transaction

3. When Dimensions

Date, time, fiscal period
Nearly every fact table needs a date dimension

4. Where Dimensions

Store, location, warehouse, region
Geographic context

5. Why/How Dimensions

Promotion, campaign, channel, payment method
The circumstances under which the event occurred

dimension_identification.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
-- Systematic dimension identification example
-- Business Process: E-commerce Order Fulfillment
-- Grain: One row per order line item
 
-- Identified dimensions and their analysis purpose:
 
-- WHO dimensions
CREATE TABLE customer_dim (...);
-- Analysis: Sales by customer segment, geography, lifetime value
 
CREATE TABLE shipping_carrier_dim (...);
-- Analysis: Delivery performance by carrier
 
-- WHAT dimensions
CREATE TABLE product_dim (...);
-- Analysis: Sales by category, brand, seasonal patterns
 
-- WHEN dimensions
CREATE TABLE date_dim (...);
-- Analysis: Trends, seasonality, YoY comparisons
 
CREATE TABLE time_of_day_dim (...);
-- Analysis: Peak ordering hours, shift performance
 
-- WHERE dimensions
CREATE TABLE warehouse_dim (...);
-- Analysis: Fulfillment efficiency by warehouse
 
CREATE TABLE ship_to_geography_dim (...);
-- Analysis: Delivery times by region
 
-- WHY/HOW dimensions
CREATE TABLE promotion_dim (...);
-- Analysis: Promotion effectiveness, discount impact
 
CREATE TABLE payment_method_dim (...);
-- Analysis: Payment preferences, fraud patterns
 
CREATE TABLE order_source_dim (...);
-- Analysis: Web vs mobile vs call center, channel mix
 
-- Junk dimension for flags
CREATE TABLE order_flags_dim (...);
-- Contains: is_gift, is_expedited, is_international, etc.

More Dimensions Is Usually Better

When in doubt, include a dimension. A dimension not used in queries costs nothing (it's just a foreign key in the fact table). A missing dimension can prevent critical analysis. It's much easier to ignore an unneeded dimension than to add a missing one later.

Step 4: Identifying the Facts

The final step identifies the numeric measurements—the facts—that the business process produces. Facts must be consistent with the declared grain.

The Fact Consistency Test

"Is this measurement meaningful for a single row at the declared grain?"

At line-item grain:

Quantity sold: Yes, meaningful per line ✓
Line total (quantity × price): Yes, meaningful per line ✓
Transaction total: No—this is the sum of all lines, not one line value ✗
Daily revenue: No—this is an aggregation across many transactions ✗

Fact Selection Guidelines

1. Prefer Additive Facts

Additive facts (quantities, amounts) enable the widest range of analysis. They can be summed across any dimension.

2. Store Components, Not Ratios

Instead of storing profit_margin (20%), store profit_amount ($10) and revenue_amount ($50). The margin can be calculated; the components cannot be reverse-engineered.

3. Include All Potentially Useful Measurements

Better to have a column you don't use than to wish you had it later. Storage is cheap; missing data is expensive.

fact_identification.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
-- Systematic fact identification example
-- Business Process: Retail Sales
-- Grain: One row per line item
 
CREATE TABLE sales_fact (
    -- Keys (surrogate + degenerate)
    sale_line_key           BIGINT PRIMARY KEY,
    transaction_number      VARCHAR(20),
    line_number             INT,
    
    -- Foreign keys to dimensions
    date_key                INT NOT NULL,
    time_key                INT NOT NULL,
    product_key             INT NOT NULL,
    customer_key            INT NOT NULL,
    store_key               INT NOT NULL,
    cashier_key             INT NOT NULL,
    promotion_key           INT,
    
    -- ADDITIVE FACTS (can sum across all dimensions)
    quantity_sold           INT NOT NULL,           -- Units sold
    unit_list_price         DECIMAL(10,2),          -- Price before discount
    unit_discount_amount    DECIMAL(10,2),          -- Discount per unit
    extended_sales_amount   DECIMAL(12,2) NOT NULL, -- Revenue = qty × net price
    extended_cost_amount    DECIMAL(12,2) NOT NULL, -- COGS
    extended_profit_amount  DECIMAL(12,2) NOT NULL, -- Revenue - Cost
    extended_discount_amount DECIMAL(12,2),         -- Total discount given
    
    -- SEMI-ADDITIVE FACTS (if this were a snapshot)
    -- (Not applicable here—transaction grain, not snapshot)
    
    -- DERIVED FACTS (calculated at query time, not stored)
    -- profit_margin_pct → calculate as profit/revenue
    -- price_per_unit → calculate as sales_amount/quantity
    
    -- FACTS TO AVOID (wrong grain or non-additive)
    -- transaction_total → Not line-item grain
    -- avg_item_price → Non-additive, calculate from components
    -- yoy_growth_pct → Derived across time, not a raw fact
);
 
-- Example fact calculation at query time
SELECT 
    p.category_name,
    SUM(f.extended_sales_amount) AS revenue,
    SUM(f.extended_profit_amount) AS profit,
    SUM(f.extended_profit_amount) / 
        NULLIF(SUM(f.extended_sales_amount), 0) * 100 
        AS profit_margin_pct,  -- Calculated, not stored
    SUM(f.extended_discount_amount) / 
        NULLIF(SUM(f.extended_sales_amount + f.extended_discount_amount), 0) * 100 
        AS discount_pct        -- Calculated from components
FROM sales_fact f
JOIN product_dim p ON f.product_key = p.product_key
GROUP BY p.category_name;

Fact Type Selection Guide
Measurement	Additive?	Store As	Notes
Revenue	Yes	Fact column	Core additive fact
Quantity	Yes	Fact column	Core additive fact
Profit margin %	No	Calculate from profit/revenue	Never store ratios
Average price	No	Calculate from amount/quantity	Store components
Account balance	Semi (time)	Fact column	Don't sum across time
Duration (hours)	Yes	Fact column	Additive within grain
Weight (kg)	Yes	Fact column	Additive within grain
YoY growth %	No	Query-time calculation	Requires multi-period query

Complete Design Walkthrough: Healthcare Claims

Let's apply the four-step process to a complete example: designing a star schema for healthcare insurance claims.

Step 1: Business Process

Selected Process: Insurance Claim Processing

This process begins when a claim is submitted (by a provider or member) and concludes when the claim is paid (or denied). It's a well-defined, measurable business activity with clear operational tracking.

Step 2: Grain Declaration

Grain: "One row per claim line (a single service on a single claim)"

A healthcare claim typically contains multiple lines—each representing a distinct service (e.g., office visit, lab test, prescription). The line level is the atomic grain.

Step 3: Dimensions Identified

Dimension	Grain Test	Analysis Purpose
Date (service)	✓ One date per line	Trend analysis, seasonality
Date (payment)	✓ One date per line	Cash flow, payment timing
Member	✓ One member per claim	Member demographics, utilization
Provider	✓ One provider per line	Provider performance, network
Procedure	✓ One procedure per line	Service mix, coding patterns
Diagnosis	Bridge needed (many per claim)	Disease analysis, risk
Facility	✓ One facility per line	Cost by location, quality
Plan	✓ One plan per member	Benefit analysis
Claim Status	✓ One status per line	Pipeline analysis

Step 4: Facts Identified

healthcare_claims_fact.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
-- Complete star schema: Healthcare Claims
-- Grain: One row per claim line (service)
 
CREATE TABLE claims_fact (
    -- Surrogate key
    claim_line_key          BIGINT PRIMARY KEY,
    
    -- Degenerate dimensions
    claim_number            VARCHAR(20) NOT NULL,
    claim_line_number       INT NOT NULL,
    
    -- Foreign keys to dimensions
    service_date_key        INT NOT NULL,       -- When service occurred
    payment_date_key        INT,                -- When claim was paid (NULL if unpaid)
    submit_date_key         INT NOT NULL,       -- When claim was submitted
    member_key              INT NOT NULL,       -- Who received service
    provider_key            INT NOT NULL,       -- Who provided service
    facility_key            INT,                -- Where service occurred
    procedure_key           INT NOT NULL,       -- What service (CPT code)
    diagnosis_group_key     INT NOT NULL,       -- Bridge to diagnoses
    plan_key                INT NOT NULL,       -- Insurance plan
    claim_status_key        INT NOT NULL,       -- Current status (paid, denied, etc.)
    denial_reason_key       INT,                -- If denied, why
    
    -- Additive facts
    charge_amount           DECIMAL(12,2) NOT NULL,  -- Provider's billed amount
    allowed_amount          DECIMAL(12,2),           -- Amount eligible for payment
    plan_paid_amount        DECIMAL(12,2),           -- What insurance paid
    member_liability_amount DECIMAL(12,2),           -- What member owes
    copay_amount            DECIMAL(12,2),           -- Fixed copay
    coinsurance_amount      DECIMAL(12,2),           -- % coinsurance
    deductible_amount       DECIMAL(12,2),           -- Applied to deductible
    service_units           DECIMAL(10,2),           -- Units of service
    
    -- Calculated fact supporting columns
    processing_days         INT,                     -- Submit to payment days
    
    -- Flags (could be junk dimension)
    is_in_network           BIT,
    is_emergency            BIT,
    is_prior_authorized     BIT
);
 
-- Diagnosis Bridge (many diagnoses per claim)
CREATE TABLE diagnosis_bridge (
    diagnosis_group_key     INT NOT NULL,
    diagnosis_key           INT NOT NULL,       -- FK to diagnosis_dim
    diagnosis_sequence      INT NOT NULL,       -- Primary=1, Secondary=2, etc.
    is_principal            BIT NOT NULL,
    PRIMARY KEY (diagnosis_group_key, diagnosis_key)
);
 
-- Supporting dimensions (abbreviated)
CREATE TABLE member_dim (
    member_key              INT PRIMARY KEY,
    member_id               VARCHAR(20),
    member_name             VARCHAR(100),
    birth_date              DATE,
    age_band                VARCHAR(20),        -- '0-17', '18-34', '35-54', '55-64', '65+'
    gender                  VARCHAR(10),
    member_state            VARCHAR(2),
    member_region           VARCHAR(20),
    plan_type               VARCHAR(30),
    enrollment_date         DATE,
    is_current              BIT
);
 
CREATE TABLE procedure_dim (
    procedure_key           INT PRIMARY KEY,
    procedure_code          VARCHAR(10),        -- CPT/HCPCS code
    procedure_description   VARCHAR(200),
    procedure_category      VARCHAR(50),
    service_type            VARCHAR(30),        -- 'Inpatient', 'Outpatient', 'Professional'
    specialty_group         VARCHAR(50)
);

Sample Queries This Schema Enables

claims_analysis_queries.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
-- Query 1: Claims by member age and procedure category
SELECT 
    m.age_band,
    p.procedure_category,
    COUNT(*) AS claim_count,
    SUM(f.charge_amount) AS total_charged,
    SUM(f.plan_paid_amount) AS total_paid,
    AVG(f.processing_days) AS avg_processing_days
FROM claims_fact f
JOIN member_dim m ON f.member_key = m.member_key
JOIN procedure_dim p ON f.procedure_key = p.procedure_key
JOIN date_dim d ON f.service_date_key = d.date_key
WHERE d.calendar_year = 2024
GROUP BY m.age_band, p.procedure_category
ORDER BY total_paid DESC;
 
-- Query 2: In-network vs out-of-network utilization
SELECT 
    CASE WHEN f.is_in_network = 1 THEN 'In-Network' ELSE 'Out-of-Network' END AS network_status,
    COUNT(*) AS claim_count,
    SUM(f.charge_amount) AS total_charged,
    SUM(f.allowed_amount) AS total_allowed,
    SUM(f.plan_paid_amount) AS total_paid,
    SUM(f.charge_amount) - SUM(f.allowed_amount) AS amount_not_covered
FROM claims_fact f
JOIN date_dim d ON f.service_date_key = d.date_key
WHERE d.calendar_year = 2024
GROUP BY f.is_in_network;
 
-- Query 3: Denial rate by procedure category
SELECT 
    p.procedure_category,
    COUNT(*) AS total_claims,
    SUM(CASE WHEN cs.claim_status = 'Denied' THEN 1 ELSE 0 END) AS denied_claims,
    SUM(CASE WHEN cs.claim_status = 'Denied' THEN 1 ELSE 0 END) * 100.0 / 
        COUNT(*) AS denial_rate_pct
FROM claims_fact f
JOIN procedure_dim p ON f.procedure_key = p.procedure_key
JOIN claim_status_dim cs ON f.claim_status_key = cs.claim_status_key
GROUP BY p.procedure_category
ORDER BY denial_rate_pct DESC;

Design Anti-Patterns to Avoid

Common Star Schema Design Mistakes

•Mixed grain in one table — Some rows are daily summaries, others are transactions. Aggregations become unreliable.
•Too few dimension attributes — Dimensions with only a key and one description column. Include all useful attributes upfront.
•Natural keys as fact foreign keys — Using VARCHAR product codes instead of integer surrogate keys. Performance and SCD problems follow.
•Snowflaking dimensions — Normalizing dimensions into sub-tables (product → category → department). Adds join complexity without benefit.
•Storing ratios instead of components — Storing profit_margin_pct instead of profit_amount and revenue_amount. Aggregation becomes impossible.
•NULL dimension keys — Using NULL instead of an explicit 'Unknown' dimension row. Complicates joins and counts.
•Operational tables as dimensions — Using the source system's normalized customer table directly. Missing SCD support and denormalization.
•Missing audit columns — No source_system, load_timestamp, or effective_date columns. Debugging and compliance become difficult.

Summary: Mastering Schema Design

Key Takeaways

•Four steps in order — Process → Grain → Dimensions → Facts. Each builds on the previous.
•Start with business needs — Interview stakeholders to understand processes and questions before examining source systems.
•Grain is foundational — The most critical decision. Everything else follows from the grain declaration.
•Dimensions answer who/what/when/where/why — Include all dimensions that enable meaningful analysis.
•Facts are measurements consistent with grain — Store additive components, calculate ratios at query time.
•Avoid common anti-patterns — Mixed grains, snowflaking, natural keys, and missing attributes cause persistent problems.

What's Next:

With design methodology mastered, we conclude this module by exploring the benefits of star schema architecture—why this approach dominates analytical database design and what advantages it provides over alternatives.

Page Complete

You now understand the systematic process for designing star schemas. This methodology—process selection, grain declaration, dimension identification, and fact selection—provides a repeatable approach for any business domain.

4 / 5

Loading learning content...

Database Management SystemsStar Schema

Star Schema: The Foundation of Dimensional Modeling

LevelIntermediate

Duration75 mins

TopicStar Schema

4 / 5

Schema Design: From Requirements to Star Schema

The Methodology Behind Effective Data Warehouse Design

What You Will Learn

The Dimensional Modeling Design Process

The dimensional modeling design process follows four fundamental steps, each building on the previous. Skipping steps or reversing the order leads to flawed designs.

The Four-Step Design Process

Step 1: Select the Business Process

Identify which business process the data mart will model. Examples: retail sales, order fulfillment, claims processing, website visits, call center interactions.

Step 2: Declare the Grain

Specify exactly what a single row in the fact table represents. This is the most critical decision in the entire design.

Step 3: Identify the Dimensions

Determine which dimensions provide context for the facts. Ask: "How do business users want to analyze this process?"

Step 4: Identify the Facts

Select the numeric measurements that the business process produces. These must be consistent with the declared grain.

The Four-Step Design Process
Step	Question Answered	Outcome	Common Mistakes
Business Process	What process are we modeling?	Process selection	Modeling entities instead of processes
Grain	What does one row represent?	Grain declaration statement	Mixing grains, being vague
Dimensions	How will users analyze?	List of dimension tables	Missing critical dimensions, too few attributes
Facts	What are we measuring?	List of fact measures	Including non-additive ratios, wrong grain facts

Process, Not Entity

Gathering Business Requirements

Effective schema design begins with understanding business needs. This understanding comes from structured conversations with business stakeholders, not from studying source system schemas.

The Business Interview

Before examining any database, conduct interviews with business users. Focus on understanding:

1. Business Processes

What activities does the business perform?
Which processes generate measurable outcomes?
What decisions do users need to make?

2. Key Performance Indicators (KPIs)

What metrics define success?
How is performance measured?
What numbers appear in executive dashboards?

3. Analysis Patterns

How do users currently analyze data?
What reports do they run most frequently?
What questions can't they currently answer?

business_interview_questions.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
# Dimensional Modeling Interview Guide
 
## Process Discovery Questions
- Walk me through [process name] from start to finish.
- What triggers this process? What ends it?
- What gets recorded when this happens?
- How often does this occur? (Per second? Daily? Monthly?)
 
## Measurement Questions  
- What numbers matter most in this process?
- How do you know if [process] is performing well?
- What would you measure if you could measure anything?
- Show me your most important reports—what's on them?
 
## Analysis Pattern Questions
- When you analyze this data, what dimensions do you "slice by"?
- Do you compare time periods? (This week vs last? YoY?)
- What drill-downs do you perform? (Region → Country → City?)
- Show me an analysis you wish you could do but can't.
 
## Dimension Discovery Questions
- Who are the key actors in this process?
- When does this happen? Does time of day matter?
- Where does this occur? Does location matter?
- What products/services are involved?
- Are there promotions, campaigns, or special conditions?
- What categories or classifications are important?
 
## Red Flag Questions (to avoid scope creep)
- Is this analysis needed now, or is it a "nice to have"?
- How often would you actually run this report?
- Who else needs this data? (Validate priority)

Translating Requirements to Design Elements

Business requirements map directly to schema components:

Business Requirement	Schema Component	Example
"We need to track sales"	Business process / fact table	sales_fact
"For each line item"	Grain	One row per line item
"By product category"	Dimension	product_dim with category attribute
"Total revenue"	Additive fact	sales_amount column
"Average price"	Non-additive (store components)	quantity and extended_amount separately
"Compare to last year"	Date dimension + query pattern	Prior year date keys, YoY queries

Step 1: Selecting the Business Process

The first design step is selecting which business process to model. This determines the scope of your star schema.

Business Process Identification

A business process is an operational activity that produces measurable events. Look for activities that:

Occur repeatedly (transactions, not one-time events)
Produce measurable outcomes (quantities, amounts, durations)
Are tracked in operational systems
Are important to business decision-making

Examples of Business Processes:

Retail: Point-of-sale transactions, inventory movements, product returns
Healthcare: Patient visits, prescription fills, insurance claims
Banking: Account transactions, loan applications, customer service calls
E-commerce: Orderplacement, website visits, cart abandonment
Manufacturing: Work orders, quality inspections, equipment downtime

Follow the Value Chain

Business Process Matrix

For enterprise-wide planning, create a business process matrix showing which processes share which dimensions. This reveals opportunities for conformed dimensions and helps sequence development.

Dimension →	Date	Product	Customer	Store	Employee	Supplier
Point of Sale	✓	✓	✓	✓	✓
Inventory	✓	✓		✓		✓
Purchasing	✓	✓				✓
Returns	✓	✓	✓	✓	✓
HR Payroll	✓			✓	✓

Processes sharing dimensions (Date, Product) are candidates for conformed dimension design.

Step 2: Declaring the Grain

The grain declaration is the single most important design decision. Everything else—dimensions, facts, table size, query possibilities—flows from this choice.

Crafting the Grain Statement

A grain statement should be:

Specific: "One row per..." with a clear definition
Business-readable: Understandable by non-technical stakeholders
Unambiguous: No room for interpretation

Well-formed grain statements:

"One row per line item on a retail transaction"
"One row per patient visit to a clinic"
"One row per product per store per day (inventory snapshot)"
"One row per web page view"
"One row per insurance claim (accumulating snapshot from filing to resolution)"

Poorly-formed grain statements:

"Sales data" (What level?)
"Customer information" (That's a dimension, not a process)
"Monthly summaries" (Summary of what precisely?)

Prefer Atomic Grain

•Maximum analytical flexibility
•Handles unforeseen questions
•Supports drill-through to detail
•Enables any aggregation level
•Aligns with source transactions

When to Use Higher Grain

•Detail not available from source
•Privacy requires aggregation
•Only summary analysis is needed
•Extreme volume requires reduction
•Supplement atomic with aggregates

grain_examples.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
-- Different grain choices for the same business process
 
-- ATOMIC GRAIN: One row per line item
-- Maximum detail, maximum flexibility
CREATE TABLE sales_fact_line_item (
    sale_line_key       BIGINT PRIMARY KEY,
    transaction_id      VARCHAR(20),    -- Degenerate dim
    line_number         INT,
    date_key            INT,
    product_key         INT,
    customer_key        INT,
    store_key           INT,
    quantity            INT,
    unit_price          DECIMAL(10,2),
    line_amount         DECIMAL(12,2)
);
-- Row count: ~10 billion (5 years of retail)
-- Size: ~1 TB
 
-- TRANSACTION GRAIN: One row per transaction (header level)
-- Less detail, cannot analyze individual products
CREATE TABLE sales_fact_transaction (
    transaction_key     BIGINT PRIMARY KEY,
    transaction_id      VARCHAR(20),
    date_key            INT,
    customer_key        INT,
    store_key           INT,
    cashier_key         INT,
    total_items         INT,            -- Count of lines
    total_amount        DECIMAL(12,2)   -- Sum of all lines
);
-- Row count: ~2 billion (fewer rows)
-- Size: ~150 GB
-- LIMITATION: Cannot answer "What products sold together?"
 
-- DAILY SUMMARY GRAIN: One row per product per store per day
-- Aggregated, no individual transactions
CREATE TABLE sales_fact_daily (
    date_key            INT,
    product_key         INT,
    store_key           INT,
    total_transactions  INT,
    total_quantity      INT,
    total_amount        DECIMAL(12,2),
    PRIMARY KEY (date_key, product_key, store_key)
);
-- Row count: ~50 million
-- Size: ~5 GB
-- LIMITATION: Cannot answer "What time of day?" or "Which customers?"

Never Mix Grains

Step 3: Identifying the Dimensions

With the grain declared, identify the dimensions that provide context for analysis. Dimensions answer the "who, what, when, where, why" questions about each fact row.

The Dimension Test

For each candidate dimension, apply this test:

"For a single fact row at the declared grain, is there exactly one value for this dimension?"

If YES → Valid dimension for this fact table
If NO → Either wrong grain, needs bridge table, or doesn't apply

Example: At the "line item" grain:

Date: Yes, one date per line item ✓
Product: Yes, one product per line item ✓
Customer: Yes, one customer per transaction (and thus per line) ✓
Store: Yes, one store per transaction ✓
Supplier: No—products may come from multiple suppliers... Could add product→supplier in product dimension, but not a direct fact dimension

Core Dimension Categories

1. Who Dimensions

Customer, patient, employee, supplier, vendor
Anyone who participates in the process

2. What Dimensions

Product, service, item, SKU
The object of the transaction

3. When Dimensions

Date, time, fiscal period
Nearly every fact table needs a date dimension

4. Where Dimensions

Store, location, warehouse, region
Geographic context

5. Why/How Dimensions

Promotion, campaign, channel, payment method
The circumstances under which the event occurred

dimension_identification.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
-- Systematic dimension identification example
-- Business Process: E-commerce Order Fulfillment
-- Grain: One row per order line item
 
-- Identified dimensions and their analysis purpose:
 
-- WHO dimensions
CREATE TABLE customer_dim (...);
-- Analysis: Sales by customer segment, geography, lifetime value
 
CREATE TABLE shipping_carrier_dim (...);
-- Analysis: Delivery performance by carrier
 
-- WHAT dimensions
CREATE TABLE product_dim (...);
-- Analysis: Sales by category, brand, seasonal patterns
 
-- WHEN dimensions
CREATE TABLE date_dim (...);
-- Analysis: Trends, seasonality, YoY comparisons
 
CREATE TABLE time_of_day_dim (...);
-- Analysis: Peak ordering hours, shift performance
 
-- WHERE dimensions
CREATE TABLE warehouse_dim (...);
-- Analysis: Fulfillment efficiency by warehouse
 
CREATE TABLE ship_to_geography_dim (...);
-- Analysis: Delivery times by region
 
-- WHY/HOW dimensions
CREATE TABLE promotion_dim (...);
-- Analysis: Promotion effectiveness, discount impact
 
CREATE TABLE payment_method_dim (...);
-- Analysis: Payment preferences, fraud patterns
 
CREATE TABLE order_source_dim (...);
-- Analysis: Web vs mobile vs call center, channel mix
 
-- Junk dimension for flags
CREATE TABLE order_flags_dim (...);
-- Contains: is_gift, is_expedited, is_international, etc.

More Dimensions Is Usually Better

Step 4: Identifying the Facts

The final step identifies the numeric measurements—the facts—that the business process produces. Facts must be consistent with the declared grain.

The Fact Consistency Test

"Is this measurement meaningful for a single row at the declared grain?"

At line-item grain:

Quantity sold: Yes, meaningful per line ✓
Line total (quantity × price): Yes, meaningful per line ✓
Transaction total: No—this is the sum of all lines, not one line value ✗
Daily revenue: No—this is an aggregation across many transactions ✗

Fact Selection Guidelines

1. Prefer Additive Facts

Additive facts (quantities, amounts) enable the widest range of analysis. They can be summed across any dimension.

2. Store Components, Not Ratios

Instead of storing profit_margin (20%), store profit_amount ($10) and revenue_amount ($50). The margin can be calculated; the components cannot be reverse-engineered.

3. Include All Potentially Useful Measurements

Better to have a column you don't use than to wish you had it later. Storage is cheap; missing data is expensive.

fact_identification.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
-- Systematic fact identification example
-- Business Process: Retail Sales
-- Grain: One row per line item
 
CREATE TABLE sales_fact (
    -- Keys (surrogate + degenerate)
    sale_line_key           BIGINT PRIMARY KEY,
    transaction_number      VARCHAR(20),
    line_number             INT,
    
    -- Foreign keys to dimensions
    date_key                INT NOT NULL,
    time_key                INT NOT NULL,
    product_key             INT NOT NULL,
    customer_key            INT NOT NULL,
    store_key               INT NOT NULL,
    cashier_key             INT NOT NULL,
    promotion_key           INT,
    
    -- ADDITIVE FACTS (can sum across all dimensions)
    quantity_sold           INT NOT NULL,           -- Units sold
    unit_list_price         DECIMAL(10,2),          -- Price before discount
    unit_discount_amount    DECIMAL(10,2),          -- Discount per unit
    extended_sales_amount   DECIMAL(12,2) NOT NULL, -- Revenue = qty × net price
    extended_cost_amount    DECIMAL(12,2) NOT NULL, -- COGS
    extended_profit_amount  DECIMAL(12,2) NOT NULL, -- Revenue - Cost
    extended_discount_amount DECIMAL(12,2),         -- Total discount given
    
    -- SEMI-ADDITIVE FACTS (if this were a snapshot)
    -- (Not applicable here—transaction grain, not snapshot)
    
    -- DERIVED FACTS (calculated at query time, not stored)
    -- profit_margin_pct → calculate as profit/revenue
    -- price_per_unit → calculate as sales_amount/quantity
    
    -- FACTS TO AVOID (wrong grain or non-additive)
    -- transaction_total → Not line-item grain
    -- avg_item_price → Non-additive, calculate from components
    -- yoy_growth_pct → Derived across time, not a raw fact
);
 
-- Example fact calculation at query time
SELECT 
    p.category_name,
    SUM(f.extended_sales_amount) AS revenue,
    SUM(f.extended_profit_amount) AS profit,
    SUM(f.extended_profit_amount) / 
        NULLIF(SUM(f.extended_sales_amount), 0) * 100 
        AS profit_margin_pct,  -- Calculated, not stored
    SUM(f.extended_discount_amount) / 
        NULLIF(SUM(f.extended_sales_amount + f.extended_discount_amount), 0) * 100 
        AS discount_pct        -- Calculated from components
FROM sales_fact f
JOIN product_dim p ON f.product_key = p.product_key
GROUP BY p.category_name;

Fact Type Selection Guide
Measurement	Additive?	Store As	Notes
Revenue	Yes	Fact column	Core additive fact
Quantity	Yes	Fact column	Core additive fact
Profit margin %	No	Calculate from profit/revenue	Never store ratios
Average price	No	Calculate from amount/quantity	Store components
Account balance	Semi (time)	Fact column	Don't sum across time
Duration (hours)	Yes	Fact column	Additive within grain
Weight (kg)	Yes	Fact column	Additive within grain
YoY growth %	No	Query-time calculation	Requires multi-period query

Complete Design Walkthrough: Healthcare Claims

Let's apply the four-step process to a complete example: designing a star schema for healthcare insurance claims.

Step 1: Business Process

Selected Process: Insurance Claim Processing

Step 2: Grain Declaration

Grain: "One row per claim line (a single service on a single claim)"

A healthcare claim typically contains multiple lines—each representing a distinct service (e.g., office visit, lab test, prescription). The line level is the atomic grain.

Step 3: Dimensions Identified

Dimension	Grain Test	Analysis Purpose
Date (service)	✓ One date per line	Trend analysis, seasonality
Date (payment)	✓ One date per line	Cash flow, payment timing
Member	✓ One member per claim	Member demographics, utilization
Provider	✓ One provider per line	Provider performance, network
Procedure	✓ One procedure per line	Service mix, coding patterns
Diagnosis	Bridge needed (many per claim)	Disease analysis, risk
Facility	✓ One facility per line	Cost by location, quality
Plan	✓ One plan per member	Benefit analysis
Claim Status	✓ One status per line	Pipeline analysis

Step 4: Facts Identified

healthcare_claims_fact.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
-- Complete star schema: Healthcare Claims
-- Grain: One row per claim line (service)
 
CREATE TABLE claims_fact (
    -- Surrogate key
    claim_line_key          BIGINT PRIMARY KEY,
    
    -- Degenerate dimensions
    claim_number            VARCHAR(20) NOT NULL,
    claim_line_number       INT NOT NULL,
    
    -- Foreign keys to dimensions
    service_date_key        INT NOT NULL,       -- When service occurred
    payment_date_key        INT,                -- When claim was paid (NULL if unpaid)
    submit_date_key         INT NOT NULL,       -- When claim was submitted
    member_key              INT NOT NULL,       -- Who received service
    provider_key            INT NOT NULL,       -- Who provided service
    facility_key            INT,                -- Where service occurred
    procedure_key           INT NOT NULL,       -- What service (CPT code)
    diagnosis_group_key     INT NOT NULL,       -- Bridge to diagnoses
    plan_key                INT NOT NULL,       -- Insurance plan
    claim_status_key        INT NOT NULL,       -- Current status (paid, denied, etc.)
    denial_reason_key       INT,                -- If denied, why
    
    -- Additive facts
    charge_amount           DECIMAL(12,2) NOT NULL,  -- Provider's billed amount
    allowed_amount          DECIMAL(12,2),           -- Amount eligible for payment
    plan_paid_amount        DECIMAL(12,2),           -- What insurance paid
    member_liability_amount DECIMAL(12,2),           -- What member owes
    copay_amount            DECIMAL(12,2),           -- Fixed copay
    coinsurance_amount      DECIMAL(12,2),           -- % coinsurance
    deductible_amount       DECIMAL(12,2),           -- Applied to deductible
    service_units           DECIMAL(10,2),           -- Units of service
    
    -- Calculated fact supporting columns
    processing_days         INT,                     -- Submit to payment days
    
    -- Flags (could be junk dimension)
    is_in_network           BIT,
    is_emergency            BIT,
    is_prior_authorized     BIT
);
 
-- Diagnosis Bridge (many diagnoses per claim)
CREATE TABLE diagnosis_bridge (
    diagnosis_group_key     INT NOT NULL,
    diagnosis_key           INT NOT NULL,       -- FK to diagnosis_dim
    diagnosis_sequence      INT NOT NULL,       -- Primary=1, Secondary=2, etc.
    is_principal            BIT NOT NULL,
    PRIMARY KEY (diagnosis_group_key, diagnosis_key)
);
 
-- Supporting dimensions (abbreviated)
CREATE TABLE member_dim (
    member_key              INT PRIMARY KEY,
    member_id               VARCHAR(20),
    member_name             VARCHAR(100),
    birth_date              DATE,
    age_band                VARCHAR(20),        -- '0-17', '18-34', '35-54', '55-64', '65+'
    gender                  VARCHAR(10),
    member_state            VARCHAR(2),
    member_region           VARCHAR(20),
    plan_type               VARCHAR(30),
    enrollment_date         DATE,
    is_current              BIT
);
 
CREATE TABLE procedure_dim (
    procedure_key           INT PRIMARY KEY,
    procedure_code          VARCHAR(10),        -- CPT/HCPCS code
    procedure_description   VARCHAR(200),
    procedure_category      VARCHAR(50),
    service_type            VARCHAR(30),        -- 'Inpatient', 'Outpatient', 'Professional'
    specialty_group         VARCHAR(50)
);

Sample Queries This Schema Enables

claims_analysis_queries.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
-- Query 1: Claims by member age and procedure category
SELECT 
    m.age_band,
    p.procedure_category,
    COUNT(*) AS claim_count,
    SUM(f.charge_amount) AS total_charged,
    SUM(f.plan_paid_amount) AS total_paid,
    AVG(f.processing_days) AS avg_processing_days
FROM claims_fact f
JOIN member_dim m ON f.member_key = m.member_key
JOIN procedure_dim p ON f.procedure_key = p.procedure_key
JOIN date_dim d ON f.service_date_key = d.date_key
WHERE d.calendar_year = 2024
GROUP BY m.age_band, p.procedure_category
ORDER BY total_paid DESC;
 
-- Query 2: In-network vs out-of-network utilization
SELECT 
    CASE WHEN f.is_in_network = 1 THEN 'In-Network' ELSE 'Out-of-Network' END AS network_status,
    COUNT(*) AS claim_count,
    SUM(f.charge_amount) AS total_charged,
    SUM(f.allowed_amount) AS total_allowed,
    SUM(f.plan_paid_amount) AS total_paid,
    SUM(f.charge_amount) - SUM(f.allowed_amount) AS amount_not_covered
FROM claims_fact f
JOIN date_dim d ON f.service_date_key = d.date_key
WHERE d.calendar_year = 2024
GROUP BY f.is_in_network;
 
-- Query 3: Denial rate by procedure category
SELECT 
    p.procedure_category,
    COUNT(*) AS total_claims,
    SUM(CASE WHEN cs.claim_status = 'Denied' THEN 1 ELSE 0 END) AS denied_claims,
    SUM(CASE WHEN cs.claim_status = 'Denied' THEN 1 ELSE 0 END) * 100.0 / 
        COUNT(*) AS denial_rate_pct
FROM claims_fact f
JOIN procedure_dim p ON f.procedure_key = p.procedure_key
JOIN claim_status_dim cs ON f.claim_status_key = cs.claim_status_key
GROUP BY p.procedure_category
ORDER BY denial_rate_pct DESC;

Design Anti-Patterns to Avoid

Common Star Schema Design Mistakes

•Mixed grain in one table — Some rows are daily summaries, others are transactions. Aggregations become unreliable.
•Too few dimension attributes — Dimensions with only a key and one description column. Include all useful attributes upfront.
•Natural keys as fact foreign keys — Using VARCHAR product codes instead of integer surrogate keys. Performance and SCD problems follow.
•Snowflaking dimensions — Normalizing dimensions into sub-tables (product → category → department). Adds join complexity without benefit.
•Storing ratios instead of components — Storing profit_margin_pct instead of profit_amount and revenue_amount. Aggregation becomes impossible.
•NULL dimension keys — Using NULL instead of an explicit 'Unknown' dimension row. Complicates joins and counts.
•Operational tables as dimensions — Using the source system's normalized customer table directly. Missing SCD support and denormalization.
•Missing audit columns — No source_system, load_timestamp, or effective_date columns. Debugging and compliance become difficult.

Summary: Mastering Schema Design

Key Takeaways

•Four steps in order — Process → Grain → Dimensions → Facts. Each builds on the previous.
•Start with business needs — Interview stakeholders to understand processes and questions before examining source systems.
•Grain is foundational — The most critical decision. Everything else follows from the grain declaration.
•Dimensions answer who/what/when/where/why — Include all dimensions that enable meaningful analysis.
•Facts are measurements consistent with grain — Store additive components, calculate ratios at query time.
•Avoid common anti-patterns — Mixed grains, snowflaking, natural keys, and missing attributes cause persistent problems.

What's Next:

Page Complete

4 / 5