Loading learning content...
Designing a star schema is not guesswork. It follows a rigorous methodology that begins with understanding business processes and concludes with validated, documented designs ready for implementation. The gap between a novice designer and an expert is not creativity—it's the systematic application of proven techniques.
This page teaches you the dimensional modeling design process—a methodology refined over decades of enterprise data warehouse implementations. You'll learn to extract requirements from business stakeholders, identify the right facts and dimensions, make sound design decisions, and produce schemas that remain useful for years.
The design decisions you make at this stage cascade through everything that follows: ETL development, query performance, user adoption, and maintainability. Time invested in careful design pays compound returns.
By the end of this page, you will understand the dimensional modeling design process, techniques for gathering business requirements, the four-step design process (process → grain → dimensions → facts), common design patterns and anti-patterns, and how to document and validate designs before implementation.
The dimensional modeling design process follows four fundamental steps, each building on the previous. Skipping steps or reversing the order leads to flawed designs.
Step 1: Select the Business Process
Identify which business process the data mart will model. Examples: retail sales, order fulfillment, claims processing, website visits, call center interactions.
Step 2: Declare the Grain
Specify exactly what a single row in the fact table represents. This is the most critical decision in the entire design.
Step 3: Identify the Dimensions
Determine which dimensions provide context for the facts. Ask: "How do business users want to analyze this process?"
Step 4: Identify the Facts
Select the numeric measurements that the business process produces. These must be consistent with the declared grain.
| Step | Question Answered | Outcome | Common Mistakes |
|---|---|---|---|
| What process are we modeling? | Process selection | Modeling entities instead of processes |
| What does one row represent? | Grain declaration statement | Mixing grains, being vague |
| How will users analyze? | List of dimension tables | Missing critical dimensions, too few attributes |
| What are we measuring? | List of fact measures | Including non-additive ratios, wrong grain facts |
A common mistake is modeling entities (customers, products, employees) rather than processes (sales, orders, hires). Entities become dimensions. Processes become fact tables. Always ask: 'What business activity are we measuring?' not 'What things exist in our business?'
Effective schema design begins with understanding business needs. This understanding comes from structured conversations with business stakeholders, not from studying source system schemas.
Before examining any database, conduct interviews with business users. Focus on understanding:
1. Business Processes
2. Key Performance Indicators (KPIs)
3. Analysis Patterns
1234567891011121314151617181920212223242526272829303132
# Dimensional Modeling Interview Guide ## Process Discovery Questions- Walk me through [process name] from start to finish.- What triggers this process? What ends it?- What gets recorded when this happens?- How often does this occur? (Per second? Daily? Monthly?) ## Measurement Questions - What numbers matter most in this process?- How do you know if [process] is performing well?- What would you measure if you could measure anything?- Show me your most important reports—what's on them? ## Analysis Pattern Questions- When you analyze this data, what dimensions do you "slice by"?- Do you compare time periods? (This week vs last? YoY?)- What drill-downs do you perform? (Region → Country → City?)- Show me an analysis you wish you could do but can't. ## Dimension Discovery Questions- Who are the key actors in this process?- When does this happen? Does time of day matter?- Where does this occur? Does location matter?- What products/services are involved?- Are there promotions, campaigns, or special conditions?- What categories or classifications are important? ## Red Flag Questions (to avoid scope creep)- Is this analysis needed now, or is it a "nice to have"?- How often would you actually run this report?- Who else needs this data? (Validate priority)Business requirements map directly to schema components:
| Business Requirement | Schema Component | Example |
|---|---|---|
| "We need to track sales" | Business process / fact table | sales_fact |
| "For each line item" | Grain | One row per line item |
| "By product category" | Dimension | product_dim with category attribute |
| "Total revenue" | Additive fact | sales_amount column |
| "Average price" | Non-additive (store components) | quantity and extended_amount separately |
| "Compare to last year" | Date dimension + query pattern | Prior year date keys, YoY queries |
The first design step is selecting which business process to model. This determines the scope of your star schema.
A business process is an operational activity that produces measurable events. Look for activities that:
Examples of Business Processes:
One effective technique is to trace your organization's value chain—the sequence of activities from raw materials (or customer acquisition) to delivered value. Each major step often represents a modelable business process: procurement, inventory, manufacturing, sales, distribution, service.
For enterprise-wide planning, create a business process matrix showing which processes share which dimensions. This reveals opportunities for conformed dimensions and helps sequence development.
| Dimension → | Date | Product | Customer | Store | Employee | Supplier |
|---|---|---|---|---|---|---|
| Point of Sale | ✓ | ✓ | ✓ | ✓ | ✓ | |
| Inventory | ✓ | ✓ | ✓ | ✓ | ||
| Purchasing | ✓ | ✓ | ✓ | |||
| Returns | ✓ | ✓ | ✓ | ✓ | ✓ | |
| HR Payroll | ✓ | ✓ | ✓ |
Processes sharing dimensions (Date, Product) are candidates for conformed dimension design.
The grain declaration is the single most important design decision. Everything else—dimensions, facts, table size, query possibilities—flows from this choice.
A grain statement should be:
Well-formed grain statements:
Poorly-formed grain statements:
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849
-- Different grain choices for the same business process -- ATOMIC GRAIN: One row per line item-- Maximum detail, maximum flexibilityCREATE TABLE sales_fact_line_item ( sale_line_key BIGINT PRIMARY KEY, transaction_id VARCHAR(20), -- Degenerate dim line_number INT, date_key INT, product_key INT, customer_key INT, store_key INT, quantity INT, unit_price DECIMAL(10,2), line_amount DECIMAL(12,2));-- Row count: ~10 billion (5 years of retail)-- Size: ~1 TB -- TRANSACTION GRAIN: One row per transaction (header level)-- Less detail, cannot analyze individual productsCREATE TABLE sales_fact_transaction ( transaction_key BIGINT PRIMARY KEY, transaction_id VARCHAR(20), date_key INT, customer_key INT, store_key INT, cashier_key INT, total_items INT, -- Count of lines total_amount DECIMAL(12,2) -- Sum of all lines);-- Row count: ~2 billion (fewer rows)-- Size: ~150 GB-- LIMITATION: Cannot answer "What products sold together?" -- DAILY SUMMARY GRAIN: One row per product per store per day-- Aggregated, no individual transactionsCREATE TABLE sales_fact_daily ( date_key INT, product_key INT, store_key INT, total_transactions INT, total_quantity INT, total_amount DECIMAL(12,2), PRIMARY KEY (date_key, product_key, store_key));-- Row count: ~50 million-- Size: ~5 GB-- LIMITATION: Cannot answer "What time of day?" or "Which customers?"A single fact table must have one and only one grain. If some rows are line items and others are transaction headers, SUM() produces nonsense (double-counting or under-counting). If you need both grains, create two separate fact tables.
With the grain declared, identify the dimensions that provide context for analysis. Dimensions answer the "who, what, when, where, why" questions about each fact row.
For each candidate dimension, apply this test:
"For a single fact row at the declared grain, is there exactly one value for this dimension?"
Example: At the "line item" grain:
1. Who Dimensions
2. What Dimensions
3. When Dimensions
4. Where Dimensions
5. Why/How Dimensions
1234567891011121314151617181920212223242526272829303132333435363738394041424344
-- Systematic dimension identification example-- Business Process: E-commerce Order Fulfillment-- Grain: One row per order line item -- Identified dimensions and their analysis purpose: -- WHO dimensionsCREATE TABLE customer_dim (...);-- Analysis: Sales by customer segment, geography, lifetime value CREATE TABLE shipping_carrier_dim (...);-- Analysis: Delivery performance by carrier -- WHAT dimensionsCREATE TABLE product_dim (...);-- Analysis: Sales by category, brand, seasonal patterns -- WHEN dimensionsCREATE TABLE date_dim (...);-- Analysis: Trends, seasonality, YoY comparisons CREATE TABLE time_of_day_dim (...);-- Analysis: Peak ordering hours, shift performance -- WHERE dimensionsCREATE TABLE warehouse_dim (...);-- Analysis: Fulfillment efficiency by warehouse CREATE TABLE ship_to_geography_dim (...);-- Analysis: Delivery times by region -- WHY/HOW dimensionsCREATE TABLE promotion_dim (...);-- Analysis: Promotion effectiveness, discount impact CREATE TABLE payment_method_dim (...);-- Analysis: Payment preferences, fraud patterns CREATE TABLE order_source_dim (...);-- Analysis: Web vs mobile vs call center, channel mix -- Junk dimension for flagsCREATE TABLE order_flags_dim (...);-- Contains: is_gift, is_expedited, is_international, etc.When in doubt, include a dimension. A dimension not used in queries costs nothing (it's just a foreign key in the fact table). A missing dimension can prevent critical analysis. It's much easier to ignore an unneeded dimension than to add a missing one later.
The final step identifies the numeric measurements—the facts—that the business process produces. Facts must be consistent with the declared grain.
"Is this measurement meaningful for a single row at the declared grain?"
At line-item grain:
1. Prefer Additive Facts
Additive facts (quantities, amounts) enable the widest range of analysis. They can be summed across any dimension.
2. Store Components, Not Ratios
Instead of storing profit_margin (20%), store profit_amount ($10) and revenue_amount ($50). The margin can be calculated; the components cannot be reverse-engineered.
3. Include All Potentially Useful Measurements
Better to have a column you don't use than to wish you had it later. Storage is cheap; missing data is expensive.
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455
-- Systematic fact identification example-- Business Process: Retail Sales-- Grain: One row per line item CREATE TABLE sales_fact ( -- Keys (surrogate + degenerate) sale_line_key BIGINT PRIMARY KEY, transaction_number VARCHAR(20), line_number INT, -- Foreign keys to dimensions date_key INT NOT NULL, time_key INT NOT NULL, product_key INT NOT NULL, customer_key INT NOT NULL, store_key INT NOT NULL, cashier_key INT NOT NULL, promotion_key INT, -- ADDITIVE FACTS (can sum across all dimensions) quantity_sold INT NOT NULL, -- Units sold unit_list_price DECIMAL(10,2), -- Price before discount unit_discount_amount DECIMAL(10,2), -- Discount per unit extended_sales_amount DECIMAL(12,2) NOT NULL, -- Revenue = qty × net price extended_cost_amount DECIMAL(12,2) NOT NULL, -- COGS extended_profit_amount DECIMAL(12,2) NOT NULL, -- Revenue - Cost extended_discount_amount DECIMAL(12,2), -- Total discount given -- SEMI-ADDITIVE FACTS (if this were a snapshot) -- (Not applicable here—transaction grain, not snapshot) -- DERIVED FACTS (calculated at query time, not stored) -- profit_margin_pct → calculate as profit/revenue -- price_per_unit → calculate as sales_amount/quantity -- FACTS TO AVOID (wrong grain or non-additive) -- transaction_total → Not line-item grain -- avg_item_price → Non-additive, calculate from components -- yoy_growth_pct → Derived across time, not a raw fact); -- Example fact calculation at query timeSELECT p.category_name, SUM(f.extended_sales_amount) AS revenue, SUM(f.extended_profit_amount) AS profit, SUM(f.extended_profit_amount) / NULLIF(SUM(f.extended_sales_amount), 0) * 100 AS profit_margin_pct, -- Calculated, not stored SUM(f.extended_discount_amount) / NULLIF(SUM(f.extended_sales_amount + f.extended_discount_amount), 0) * 100 AS discount_pct -- Calculated from componentsFROM sales_fact fJOIN product_dim p ON f.product_key = p.product_keyGROUP BY p.category_name;| Measurement | Additive? | Store As | Notes |
|---|---|---|---|
| Revenue | Yes | Fact column | Core additive fact |
| Quantity | Yes | Fact column | Core additive fact |
| Profit margin % | No | Calculate from profit/revenue | Never store ratios |
| Average price | No | Calculate from amount/quantity | Store components |
| Account balance | Semi (time) | Fact column | Don't sum across time |
| Duration (hours) | Yes | Fact column | Additive within grain |
| Weight (kg) | Yes | Fact column | Additive within grain |
| YoY growth % | No | Query-time calculation | Requires multi-period query |
Let's apply the four-step process to a complete example: designing a star schema for healthcare insurance claims.
Selected Process: Insurance Claim Processing
This process begins when a claim is submitted (by a provider or member) and concludes when the claim is paid (or denied). It's a well-defined, measurable business activity with clear operational tracking.
Grain: "One row per claim line (a single service on a single claim)"
A healthcare claim typically contains multiple lines—each representing a distinct service (e.g., office visit, lab test, prescription). The line level is the atomic grain.
| Dimension | Grain Test | Analysis Purpose |
|---|---|---|
| Date (service) | ✓ One date per line | Trend analysis, seasonality |
| Date (payment) | ✓ One date per line | Cash flow, payment timing |
| Member | ✓ One member per claim | Member demographics, utilization |
| Provider | ✓ One provider per line | Provider performance, network |
| Procedure | ✓ One procedure per line | Service mix, coding patterns |
| Diagnosis | Bridge needed (many per claim) | Disease analysis, risk |
| Facility | ✓ One facility per line | Cost by location, quality |
| Plan | ✓ One plan per member | Benefit analysis |
| Claim Status | ✓ One status per line | Pipeline analysis |
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475
-- Complete star schema: Healthcare Claims-- Grain: One row per claim line (service) CREATE TABLE claims_fact ( -- Surrogate key claim_line_key BIGINT PRIMARY KEY, -- Degenerate dimensions claim_number VARCHAR(20) NOT NULL, claim_line_number INT NOT NULL, -- Foreign keys to dimensions service_date_key INT NOT NULL, -- When service occurred payment_date_key INT, -- When claim was paid (NULL if unpaid) submit_date_key INT NOT NULL, -- When claim was submitted member_key INT NOT NULL, -- Who received service provider_key INT NOT NULL, -- Who provided service facility_key INT, -- Where service occurred procedure_key INT NOT NULL, -- What service (CPT code) diagnosis_group_key INT NOT NULL, -- Bridge to diagnoses plan_key INT NOT NULL, -- Insurance plan claim_status_key INT NOT NULL, -- Current status (paid, denied, etc.) denial_reason_key INT, -- If denied, why -- Additive facts charge_amount DECIMAL(12,2) NOT NULL, -- Provider's billed amount allowed_amount DECIMAL(12,2), -- Amount eligible for payment plan_paid_amount DECIMAL(12,2), -- What insurance paid member_liability_amount DECIMAL(12,2), -- What member owes copay_amount DECIMAL(12,2), -- Fixed copay coinsurance_amount DECIMAL(12,2), -- % coinsurance deductible_amount DECIMAL(12,2), -- Applied to deductible service_units DECIMAL(10,2), -- Units of service -- Calculated fact supporting columns processing_days INT, -- Submit to payment days -- Flags (could be junk dimension) is_in_network BIT, is_emergency BIT, is_prior_authorized BIT); -- Diagnosis Bridge (many diagnoses per claim)CREATE TABLE diagnosis_bridge ( diagnosis_group_key INT NOT NULL, diagnosis_key INT NOT NULL, -- FK to diagnosis_dim diagnosis_sequence INT NOT NULL, -- Primary=1, Secondary=2, etc. is_principal BIT NOT NULL, PRIMARY KEY (diagnosis_group_key, diagnosis_key)); -- Supporting dimensions (abbreviated)CREATE TABLE member_dim ( member_key INT PRIMARY KEY, member_id VARCHAR(20), member_name VARCHAR(100), birth_date DATE, age_band VARCHAR(20), -- '0-17', '18-34', '35-54', '55-64', '65+' gender VARCHAR(10), member_state VARCHAR(2), member_region VARCHAR(20), plan_type VARCHAR(30), enrollment_date DATE, is_current BIT); CREATE TABLE procedure_dim ( procedure_key INT PRIMARY KEY, procedure_code VARCHAR(10), -- CPT/HCPCS code procedure_description VARCHAR(200), procedure_category VARCHAR(50), service_type VARCHAR(30), -- 'Inpatient', 'Outpatient', 'Professional' specialty_group VARCHAR(50));1234567891011121314151617181920212223242526272829303132333435363738394041
-- Query 1: Claims by member age and procedure categorySELECT m.age_band, p.procedure_category, COUNT(*) AS claim_count, SUM(f.charge_amount) AS total_charged, SUM(f.plan_paid_amount) AS total_paid, AVG(f.processing_days) AS avg_processing_daysFROM claims_fact fJOIN member_dim m ON f.member_key = m.member_keyJOIN procedure_dim p ON f.procedure_key = p.procedure_keyJOIN date_dim d ON f.service_date_key = d.date_keyWHERE d.calendar_year = 2024GROUP BY m.age_band, p.procedure_categoryORDER BY total_paid DESC; -- Query 2: In-network vs out-of-network utilizationSELECT CASE WHEN f.is_in_network = 1 THEN 'In-Network' ELSE 'Out-of-Network' END AS network_status, COUNT(*) AS claim_count, SUM(f.charge_amount) AS total_charged, SUM(f.allowed_amount) AS total_allowed, SUM(f.plan_paid_amount) AS total_paid, SUM(f.charge_amount) - SUM(f.allowed_amount) AS amount_not_coveredFROM claims_fact fJOIN date_dim d ON f.service_date_key = d.date_keyWHERE d.calendar_year = 2024GROUP BY f.is_in_network; -- Query 3: Denial rate by procedure categorySELECT p.procedure_category, COUNT(*) AS total_claims, SUM(CASE WHEN cs.claim_status = 'Denied' THEN 1 ELSE 0 END) AS denied_claims, SUM(CASE WHEN cs.claim_status = 'Denied' THEN 1 ELSE 0 END) * 100.0 / COUNT(*) AS denial_rate_pctFROM claims_fact fJOIN procedure_dim p ON f.procedure_key = p.procedure_keyJOIN claim_status_dim cs ON f.claim_status_key = cs.claim_status_keyGROUP BY p.procedure_categoryORDER BY denial_rate_pct DESC;What's Next:
With design methodology mastered, we conclude this module by exploring the benefits of star schema architecture—why this approach dominates analytical database design and what advantages it provides over alternatives.
You now understand the systematic process for designing star schemas. This methodology—process selection, grain declaration, dimension identification, and fact selection—provides a repeatable approach for any business domain.