Database Management SystemsStar Schema

Star Schema: The Foundation of Dimensional Modeling

LevelIntermediate

Duration75 mins

TopicStar Schema

1 / 5

Fact Table: The Heart of the Star Schema

The Central Table That Tells Your Business Story

Every business generates measurements—transactions, events, metrics, quantities. Somewhere in your organization right now, orders are being placed, calls are being logged, website pages are being viewed, and products are being shipped. Each of these events represents a fact—a quantifiable occurrence in your business that, when aggregated and analyzed, tells the story of how your organization operates.

The fact table stands at the center of the star schema, serving as the repository for these business measurements. It is the single most important table in any data warehouse, representing the primary subject of analysis around which all other structures orbit.

Understanding fact tables is not merely an academic exercise—it is the gateway to designing analytical systems that can answer the questions executives ask: How much did we sell last quarter? Which products are trending? Where are we losing customers? The quality of your fact table design directly determines whether these questions can be answered in seconds or never at all.

What You Will Learn

By the end of this page, you will understand the anatomy of fact tables, distinguish between different fact types (transactional, periodic, accumulating), master granularity decisions that make or break warehouse designs, and apply expert-level patterns for designing fact tables that scale to billions of rows while remaining query-performant.

Understanding Fact Tables

A fact table is a database table that stores measurements, metrics, or facts about a business process. Unlike operational tables that capture complete entity states, fact tables are specifically designed to record what happened—the events, transactions, or periodic snapshots that constitute your business activity.

The Fundamental Characteristics

Fact tables possess several distinguishing characteristics that separate them from other database tables:

1. Numeric Measurements (Facts)

The primary content of a fact table consists of numeric, quantitative values called facts or measures. These are the numbers that analysts aggregate, summarize, and compare. Examples include:

Sales amount (revenue dollars)
Quantity sold (unit count)
Duration (time in seconds)
Weight (kilograms shipped)
Count (number of clicks, views, events)

2. Foreign Keys to Dimensions

Fact tables contain foreign keys that reference dimension tables, establishing the context for each measurement. A sales fact row might contain keys pointing to:

Which customer made the purchase
Which product was purchased
When the purchase occurred
Where the purchase happened
Which employee processed the sale

3. Narrow and Deep Structure

Fact tables are typically narrow (relatively few columns—mostly foreign keys and a handful of fact columns) but deep (containing millions or billions of rows). This contrasts sharply with dimension tables, which are wide and shallow.

Anatomy of a Sales Fact Table Row
Column	Type	Example Value	Purpose
sale_key	Surrogate Key	7492031	Unique identifier for this fact row
date_key	Foreign Key	20240115	Links to date dimension
product_key	Foreign Key	4521	Links to product dimension
customer_key	Foreign Key	89234	Links to customer dimension
store_key	Foreign Key	127	Links to store dimension
promotion_key	Foreign Key	18	Links to promotion dimension
quantity_sold	Fact (Additive)	3	Units purchased in this transaction
unit_price	Fact (Non-Additive)	29.99	Price per unit at time of sale
sales_amount	Fact (Additive)	89.97	Total revenue = quantity × price
cost_amount	Fact (Additive)	54.00	Total cost of goods sold
profit_amount	Fact (Derived)	35.97	Revenue minus cost

The Grain Is Everything

Notice that each row represents a single line item in a single transaction. This is the grain of the fact table—the level of detail at which facts are recorded. The grain decision is the most important design choice you will make, as we explore in detail later in this page.

Types of Facts: Understanding Additivity

Not all numeric measures behave the same way when aggregated. Understanding additivity—which facts can be meaningfully summed across which dimensions—is crucial for designing useful analytical systems.

Additive Facts

Additive facts can be summed across all dimensions without producing meaningless results. These are the most useful facts because they support the full range of analytical operations.

Examples of Additive Facts:

Sales Revenue: You can sum revenue across time (total for the year), across products (total for all products), across customers (total for all customers), or any combination.
Units Sold: Quantities are additive—100 units sold yesterday plus 150 today equals 250 total.
Cost of Goods Sold: Costs aggregate meaningfully across all dimensions.

The mathematical property that makes these facts additive is that the same unit of measure applies regardless of the dimension. A dollar is a dollar whether we're counting across days or across products.

Design for Additivity

When designing fact tables, prioritize additive facts wherever possible. They provide the maximum analytical flexibility. If you have a choice between storing a ratio or storing the numerator and denominator separately, store both components—the ratio can always be calculated, but component values cannot be recovered from stored ratios.

Semi-Additive Facts

Semi-additive facts can be summed across some dimensions but not others—typically not across time. These usually represent balance or level measurements at a point in time.

Examples of Semi-Additive Facts:

Account Balance: You can sum balances across all accounts at a single point in time to get total assets, but summing Monday's balance plus Tuesday's balance is meaningless.
Inventory Quantity: Current stock levels across all warehouses sum to total inventory, but summing today's level plus yesterday's level produces nonsense.
Headcount: Employee count across all departments makes sense at a single date, but not summed across dates.

For semi-additive facts, typical analytical operations use AVG, MIN, MAX, or last value across the time dimension rather than SUM.

Aggregation Behavior by Fact Type
Fact Type	Sum Across Products	Sum Across Customers	Sum Across Time	Typical Operations
Sales Revenue (Additive)	✓ Valid	✓ Valid	✓ Valid	SUM, AVG, COUNT
Account Balance (Semi-Additive)	✓ Valid	✓ Valid	✗ Invalid	AVG, MIN, MAX, LAST
Unit Price (Non-Additive)	✗ Invalid	✗ Invalid	✗ Invalid	AVG, MIN, MAX, weighted calculations
Percentage (Non-Additive)	✗ Invalid	✗ Invalid	✗ Invalid	Recalculate from components

Non-Additive Facts

Non-Additive facts cannot be meaningfully summed across any dimension. These are typically ratios, percentages, averages, or other derived calculations.

Examples of Non-Additive Facts:

Unit Price: Summing prices across products or time produces meaningless numbers. Average price requires a weighted calculation.
Profit Margin Percentage: A 10% margin plus a 20% margin does not equal a 30% margin.
Temperature / Rate / Ratio: Any measurement where the unit of measure changes when aggregated.

The Solution: Store Components, Not Ratios

The expert approach to handling non-additive facts is to store the components rather than the calculated ratio. Instead of storing profit_margin_percent, store profit_amount and revenue_amount. The percentage can be calculated at query time:

SELECT 
    product_name,
    SUM(profit_amount) / SUM(revenue_amount) * 100 AS profit_margin_percent
FROM sales_fact f
JOIN product_dim p ON f.product_key = p.product_key
GROUP BY product_name;

This approach enables correct weighted averages across any dimension combination.

Types of Fact Tables

Different business processes require different fact table structures. The three fundamental types—transactional, periodic snapshot, and accumulating snapshot—each serve distinct analytical needs.

Transaction Fact Tables

Transaction fact tables (also called transaction grain fact tables) record individual events at the most atomic level. Each row represents a single, instantaneous occurrence in the business process.

Characteristics:

One row per discrete event
Sparse data—rows only exist when events occur
Most flexible for drill-down analysis
Typically the largest tables in the warehouse

Examples:

Point-of-sale transactions (one row per line item purchased)
Web clickstream events (one row per page view or click)
Call center records (one row per call)
Financial transactions (one row per debit or credit)

transaction_fact_example.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
-- Transaction Fact Table: Retail Sales
-- Grain: One row per line item per transaction
 
CREATE TABLE sales_transaction_fact (
    -- Surrogate key
    transaction_line_key    BIGINT PRIMARY KEY,
    
    -- Degenerate dimension (transaction header info)
    transaction_number      VARCHAR(20) NOT NULL,
    line_number             INT NOT NULL,
    
    -- Foreign keys to dimensions
    date_key                INT NOT NULL,        -- When
    time_key                INT NOT NULL,        -- What time of day
    product_key             INT NOT NULL,        -- What
    customer_key            INT NOT NULL,        -- Who
    store_key               INT NOT NULL,        -- Where
    cashier_key             INT NOT NULL,        -- By whom
    promotion_key           INT NOT NULL,        -- Under what promotion
    payment_method_key      INT NOT NULL,        -- How paid
    
    -- Additive facts
    quantity_sold           INT NOT NULL,
    unit_list_price         DECIMAL(10,2),
    unit_discount_amount    DECIMAL(10,2),
    extended_sales_amount   DECIMAL(12,2) NOT NULL,
    extended_cost_amount    DECIMAL(12,2) NOT NULL,
    extended_profit_amount  DECIMAL(12,2) NOT NULL,
    
    -- Timestamps for ETL tracking
    etl_load_timestamp      TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);
 
-- Example query: Daily sales by product category
SELECT 
    d.calendar_date,
    p.category_name,
    SUM(f.quantity_sold) AS total_units,
    SUM(f.extended_sales_amount) AS total_revenue,
    SUM(f.extended_profit_amount) AS total_profit
FROM sales_transaction_fact f
JOIN date_dim d ON f.date_key = d.date_key
JOIN product_dim p ON f.product_key = p.product_key
WHERE d.calendar_year = 2024
GROUP BY d.calendar_date, p.category_name
ORDER BY d.calendar_date, total_revenue DESC;

Periodic Snapshot Fact Tables

Periodic snapshot fact tables capture the state of a measurement at regular, predictable intervals. Unlike transaction tables that record events as they occur, periodic snapshots take a "photograph" of cumulative metrics at the end of each period.

Characteristics:

One row per entity per period (day, week, month)
Dense data—every row exists whether or not activity occurred
Contains semi-additive facts (balances, levels)
Excellent for trend analysis and period-over-period comparisons

Examples:

Daily inventory levels (one row per product per warehouse per day)
Monthly account balances (one row per account per month)
Weekly sales summaries (one row per product per store per week)
Daily website traffic aggregates

periodic_snapshot_example.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
-- Periodic Snapshot Fact Table: Daily Inventory
-- Grain: One row per product per warehouse per day
 
CREATE TABLE inventory_daily_snapshot_fact (
    -- Composite key (could also use surrogate)
    date_key                INT NOT NULL,
    product_key             INT NOT NULL,
    warehouse_key           INT NOT NULL,
    
    -- Semi-additive facts (point-in-time levels)
    quantity_on_hand        INT NOT NULL,           -- Current stock
    quantity_on_order       INT NOT NULL,           -- Pending receipts
    quantity_reserved       INT NOT NULL,           -- Allocated to orders
    quantity_available      INT AS (quantity_on_hand + quantity_on_order - quantity_reserved),
    
    -- Additive facts (activity during the day)
    quantity_received       INT DEFAULT 0,          -- Units received today
    quantity_shipped        INT DEFAULT 0,          -- Units shipped today
    quantity_returned       INT DEFAULT 0,          -- Units returned today
    quantity_adjusted       INT DEFAULT 0,          -- Inventory adjustments
    
    -- Value facts
    unit_cost               DECIMAL(10,2),
    total_inventory_cost    DECIMAL(14,2),          -- on_hand × unit_cost
    
    PRIMARY KEY (date_key, product_key, warehouse_key)
);
 
-- Example query: Average inventory levels over time
SELECT 
    p.category_name,
    d.month_name,
    AVG(f.quantity_on_hand) AS avg_daily_inventory,
    MAX(f.quantity_on_hand) AS peak_inventory,
    MIN(f.quantity_on_hand) AS trough_inventory
FROM inventory_daily_snapshot_fact f
JOIN date_dim d ON f.date_key = d.date_key
JOIN product_dim p ON f.product_key = p.product_key
WHERE d.calendar_year = 2024
GROUP BY p.category_name, d.month_name, d.month_number
ORDER BY p.category_name, d.month_number;

Accumulating Snapshot Fact Tables

Accumulating snapshot fact tables track the complete lifecycle of a process that has a well-defined beginning, intermediate steps, and end. Unlike transaction tables (which create new rows for each event) or periodic snapshots (which are replaced each period), accumulating snapshots have rows that are updated as the process progresses.

Characteristics:

One row per process instance (e.g., one row per order, one row per loan application)
Rows are updated multiple times as milestones are reached
Contains multiple date keys (one for each significant milestone)
Tracks lag times between process stages

Examples:

Order fulfillment (order received → picked → shipped → delivered)
Loan processing (application → review → approval → funding → closing)
Manufacturing (work order → started → QA → completed)
Insurance claims (filed → reviewed → approved → paid)

accumulating_snapshot_example.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
-- Accumulating Snapshot Fact Table: Order Fulfillment
-- Grain: One row per order (updated as order progresses)
 
CREATE TABLE order_fulfillment_fact (
    order_key               BIGINT PRIMARY KEY,
    order_number            VARCHAR(20) NOT NULL,       -- Degenerate dimension
    
    -- Foreign keys
    customer_key            INT NOT NULL,
    product_key             INT NOT NULL,
    source_warehouse_key    INT,
    shipping_carrier_key    INT,
    
    -- Milestone date keys (updated as process progresses)
    order_date_key          INT NOT NULL,               -- Order placed
    warehouse_assign_date_key INT,                      -- Warehouse assigned
    pick_start_date_key     INT,                        -- Picking began
    pick_complete_date_key  INT,                        -- Picking finished
    pack_date_key           INT,                        -- Packed
    ship_date_key           INT,                        -- Shipped
    delivery_date_key       INT,                        -- Delivered
    
    -- Lag measurements (in days or hours)
    order_to_assign_days    INT,                        -- Assignment delay
    pick_duration_hours     DECIMAL(6,2),               -- Time to pick
    assign_to_ship_days     INT,                        -- Warehouse processing time
    ship_to_delivery_days   INT,                        -- Transit time
    total_fulfillment_days  INT,                        -- End-to-end cycle time
    
    -- Facts
    quantity_ordered        INT NOT NULL,
    quantity_shipped        INT,
    quantity_delivered      INT,
    extended_amount         DECIMAL(12,2),
    shipping_cost           DECIMAL(10,2),
    
    -- Status tracking
    current_status          VARCHAR(20),                -- 'ordered','picking','shipped','delivered'
    is_late                 BOOLEAN DEFAULT FALSE,
    promised_delivery_date_key INT,
    
    -- ETL metadata
    last_update_timestamp   TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);
 
-- Example query: Order fulfillment performance analysis
SELECT 
    d.month_name,
    AVG(total_fulfillment_days) AS avg_cycle_time,
    AVG(assign_to_ship_days) AS avg_warehouse_time,
    AVG(ship_to_delivery_days) AS avg_transit_time,
    COUNT(*) AS total_orders,
    SUM(CASE WHEN is_late THEN 1 ELSE 0 END) AS late_orders,
    SUM(CASE WHEN is_late THEN 1 ELSE 0 END) * 100.0 / COUNT(*) AS late_pct
FROM order_fulfillment_fact f
JOIN date_dim d ON f.order_date_key = d.date_key
WHERE f.delivery_date_key IS NOT NULL  -- Only delivered orders
  AND d.calendar_year = 2024
GROUP BY d.month_name, d.month_number
ORDER BY d.month_number;

ETL Complexity Trade-off

Accumulating snapshot tables require UPDATE operations during ETL, which is more complex than the INSERT-only pattern of transaction and periodic snapshot tables. The operational overhead is justified when analyzing pipeline efficiency, cycle times, and bottleneck detection are critical business needs.

Granularity: The Single Most Important Design Decision

Granularity (also called grain) defines the level of detail stored in a fact table. It answers the question: What does a single row in this fact table represent?

The grain declaration is the most critical decision in fact table design. Every other design choice—which dimensions are needed, which facts make sense, how large the table will grow—follows directly from this single decision.

The Grain Declaration Statement

Every fact table should have a clear, unambiguous grain statement. This statement should be understandable to business users, not just technical staff:

Good grain statements:

"One row per line item on a retail transaction"
"One row per product per store per day"
"One row per website session"
"One row per patient visit to a healthcare facility"

Poor (ambiguous) grain statements:

"Sales data" (What level of detail?)
"Customer transactions" (Per transaction? Per line item? Per customer per day?)
"Financial information" (Meaningless without specificity)

The Cardinal Rule of Grain

Never, under any circumstances, mix grains in a single fact table. A table where some rows represent individual transactions while other rows represent daily summaries corrupts algebraic properties. SUM, AVG, and COUNT become unreliable. This mistake is nearly impossible to correct without completely rebuilding the table.

Choosing the Right Grain

The grain choice involves balancing several competing concerns:

Favor Atomic Grain

The atomic grain is the lowest level of detail captured by a business process. For a retail point-of-sale system, this is typically the individual line item (one product, one quantity, one price) within a transaction.

Why atomic grain is preferred:

Maximum analytical flexibility: You can always aggregate atomic data upward (daily sales from transactions), but you cannot decompose aggregated data downward (transactions from daily summaries).
Handles unforeseen questions: Business users will inevitably ask questions you didn't anticipate. Atomic grain data lets you answer them.
Supports drill-through: When investigating anomalies, analysts need to see the underlying transactions.

Arguments for Atomic Grain

•Enables any level of aggregation
•Answers questions not yet imagined
•Allows drill-down to source data
•Preserves full audit trail
•Supports complex analytical queries
•Natural alignment with operational data

Considerations for Higher Grain

•Smaller table size (storage savings)
•Faster query performance for summaries
•Less ETL complexity
•May be sufficient for limited use cases
•Aggregate tables supplement atomic tables
•Privacy/security may require aggregation

Grain and Dimensions

The grain declaration determines which dimensions are applicable. If the grain is "one row per line item on a retail transaction," then valid dimensions include:

Date (when the transaction occurred)
Product (what was purchased)
Customer (who made the purchase)
Store (where the transaction occurred)
Employee (who processed the sale)
Promotion (what discount applied)

However, a dimension like "supplier" might not be directly applicable—while products have suppliers, the transaction itself doesn't have a single supplier if multiple products from different suppliers were purchased.

The test: Ask yourself, "For a single fact row, is there exactly one value for this dimension?" If yes, the dimension is compatible with the grain. If no, either the dimension doesn't belong, or you need a bridge table or different grain.

Degenerate Dimensions

Degenerate dimensions are dimension keys that exist in the fact table but have no corresponding dimension table. They are "degenerate" because they have been stripped of any descriptive attributes—only the key value itself remains meaningful.

Why Degenerate Dimensions Exist

Consider a retail fact table at the line-item grain. Each line item belongs to a transaction, identified by a transaction number. But what attributes would a "transaction dimension" contain?

Transaction date? Already in the date dimension.
Store? Already in the store dimension.
Customer? Already in the customer dimension.
Cashier? Already in the employee dimension.

After referencing all the "real" dimensions, the transaction number itself has no remaining descriptive attributes. It's just an identifier. Creating a dimension table with a single column (the transaction number) wastes space and adds join overhead for no benefit.

Solution: Store the transaction number directly in the fact table as a degenerate dimension.

degenerate_dimension_example.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
-- Fact table with degenerate dimensions
CREATE TABLE retail_sales_fact (
    sale_line_key           BIGINT PRIMARY KEY,
    
    -- Degenerate dimensions (stored in fact, no separate dimension table)
    invoice_number          VARCHAR(15) NOT NULL,    -- Transaction identifier
    line_number             INT NOT NULL,            -- Line within transaction
    po_number               VARCHAR(20),             -- Purchase order reference
    
    -- Regular foreign keys to dimensions
    date_key                INT NOT NULL,
    customer_key            INT NOT NULL,
    product_key             INT NOT NULL,
    store_key               INT NOT NULL,
    
    -- Facts
    quantity                INT NOT NULL,
    sales_amount            DECIMAL(12,2) NOT NULL,
    cost_amount             DECIMAL(12,2) NOT NULL
);
 
-- Query using degenerate dimension for drill-through
-- "Show me all line items from invoice ABC-2024-00042"
SELECT 
    f.invoice_number,
    f.line_number,
    p.product_name,
    f.quantity,
    f.sales_amount
FROM retail_sales_fact f
JOIN product_dim p ON f.product_key = p.product_key
WHERE f.invoice_number = 'ABC-2024-00042'
ORDER BY f.line_number;
 
-- Transaction-level aggregation using degenerate dimension
-- "Average items per transaction"
SELECT 
    d.calendar_month,
    COUNT(DISTINCT f.invoice_number) AS transaction_count,
    COUNT(*) AS line_item_count,
    COUNT(*) * 1.0 / COUNT(DISTINCT f.invoice_number) AS avg_items_per_transaction
FROM retail_sales_fact f
JOIN date_dim d ON f.date_key = d.date_key
GROUP BY d.calendar_month
ORDER BY d.calendar_month;

Common Degenerate Dimensions

Invoice numbers, order numbers, confirmation codes, ticket numbers, claim numbers, and transaction IDs are all common candidates for degenerate dimensions. Any operational system identifier that groups fact rows together but has no additional attributes can be stored directly in the fact table.

Factless Fact Tables

A factless fact table is a fact table that contains only dimension foreign keys—no numeric fact columns at all. This might seem paradoxical (a fact table without facts?), but it serves crucial analytical purposes.

Why Factless Fact Tables Are Useful

Type 1: Event Tracking

Some business events are meaningful simply because they occurred, even if there's no associated measurement. The fact worth recording is the intersection of dimensions—that these particular dimension values occurred together.

Examples:

Class attendance: Student A attended Class B on Date C (no numeric fact, just the occurrence)
Website browsing: User A viewed Page B at Time C on Device D
Coverage/association: Product A is covered by Promotion B during Period C

factless_event_example.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
-- Factless Fact Table: Student Attendance
-- Grain: One row per student per class per day attended
 
CREATE TABLE student_attendance_fact (
    date_key                INT NOT NULL,
    student_key             INT NOT NULL,
    class_key               INT NOT NULL,
    instructor_key          INT NOT NULL,
    room_key                INT NOT NULL,
    
    -- Optionally include a dummy fact for counting
    attendance_count        INT DEFAULT 1,
    
    PRIMARY KEY (date_key, student_key, class_key)
);
 
-- Query: Attendance rate by class
SELECT 
    c.class_name,
    COUNT(*) AS days_attended,
    (SELECT COUNT(DISTINCT date_key) FROM date_dim 
     WHERE calendar_year = 2024 AND is_school_day = TRUE) AS school_days,
    COUNT(*) * 100.0 / 
        (SELECT COUNT(DISTINCT date_key) FROM date_dim 
         WHERE calendar_year = 2024 AND is_school_day = TRUE) AS attendance_rate
FROM student_attendance_fact f
JOIN class_dim c ON f.class_key = c.class_key
JOIN date_dim d ON f.date_key = d.date_key
WHERE d.calendar_year = 2024
GROUP BY c.class_name;

Type 2: Coverage Analysis

The second use of factless fact tables is to answer "what could have happened but didn't?" questions. These are coverage or eligibility tables that define all possible combinations.

The problem:

Your sales fact table records what products were sold at each store. But you also want to analyze what products were available at each store but weren't sold. The absence of a fact row means "no event," but you need a baseline to know what was possible.

factless_coverage_example.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
-- Factless Coverage Table: Product-Store-Promotion Coverage
-- "What promotions were available for what products at what stores?"
 
CREATE TABLE promotion_coverage_fact (
    date_key                INT NOT NULL,
    product_key             INT NOT NULL,
    store_key               INT NOT NULL,
    promotion_key           INT NOT NULL,
    
    PRIMARY KEY (date_key, product_key, store_key, promotion_key)
);
 
-- Query: Products with promotions that had ZERO sales
-- (Identify underperforming promotions)
SELECT 
    prom.promotion_name,
    p.product_name,
    s.store_name,
    d.calendar_date
FROM promotion_coverage_fact cov
JOIN product_dim p ON cov.product_key = p.product_key
JOIN store_dim s ON cov.store_key = s.store_key
JOIN promotion_dim prom ON cov.promotion_key = prom.promotion_key
JOIN date_dim d ON cov.date_key = d.date_key
-- LEFT JOIN to sales fact to find non-matches
LEFT JOIN sales_fact sf 
    ON cov.date_key = sf.date_key
    AND cov.product_key = sf.product_key
    AND cov.store_key = sf.store_key
    AND cov.promotion_key = sf.promotion_key
WHERE sf.sale_key IS NULL  -- No matching sale
ORDER BY d.calendar_date, prom.promotion_name;

The Dummy Count Fact

Some factless fact tables include a constant value column (always 1) to simplify COUNT operations. This is a stylistic choice. With or without it, COUNT(*) works the same, but some query tools work better when at least one fact column exists.

Fact Table Design Best Practices

After decades of data warehouse implementations, a set of proven design practices has emerged. Following these guidelines leads to fact tables that are performant, maintainable, and analytically powerful.

Essential Fact Table Design Practices

•Declare grain before adding columns — Begin every fact table design by writing down the grain statement. Only add dimensions and facts that are consistent with that grain.
•Use surrogate keys for dimension references — Use integer surrogate keys (dimension_key), not natural keys. This enables slowly changing dimensions and provides consistent join performance.
•Store measured values, not calculated values — Store quantities, amounts, and durations. Derive ratios, percentages, and rankings at query time from stored components.
•Include audit columns — Add ETL_load_timestamp and source_system columns for debugging, reconciliation, and incremental loading.
•Partition large tables — For tables exceeding tens of millions of rows, partition by date (year/month) to enable partition elimination in queries.
•Maintain NULL discipline — Fact measurements should never be NULL; use 0 or create explicit 'unknown' dimension entries. NULL foreign keys indicate ETL problems.
•Create useful indexes — Index foreign keys individually and create composite indexes for common query patterns (e.g., date_key + product_key).
•Document the grain — Include a comment or metadata entry that explicitly states the grain. Future developers will thank you.

Sizing Considerations

Fact tables dominate warehouse storage. A retail organization with 1,000 stores, 50,000 products, and 5 years of history might have:

1,000 stores × 1,000 customers/day × 5 items/customer × 365 days × 5 years
= 9.125 billion rows in the line-item grain fact table

At 100 bytes per row (typical for a fact table with 5-10 foreign keys and 4-6 fact columns), this equals:

9.125B × 100 bytes = ~850 GB of raw fact data

Add indexes and you might reach 2-3 TB for this single table. This scale is routine for enterprise warehouses, but it demands:

Appropriate partitioning strategy
Columnar storage formats for analytics (Parquet, ORC)
Query engines designed for large-scale aggregation

Summary: Mastering Fact Tables

The fact table is the nucleus of dimensional modeling. Every analytical query ultimately aggregates or filters fact table data. Mastering fact table design means understanding not just the technical structure, but the business process it represents.

Key Takeaways

•Fact tables store measurements — Numeric, quantitative data about business events, surrounded by dimension foreign keys that provide context.
•Additivity determines aggregation behavior — Additive facts (revenue, quantity) sum across all dimensions; semi-additive facts (balances) don't sum across time; non-additive facts (ratios) require special handling.
•Three types serve different needs — Transaction tables record events, periodic snapshots capture point-in-time states, accumulating snapshots track process pipelines.
•Grain is the foundation — The grain declaration precedes all other design decisions. Never mix grains in a single table.
•Degenerate dimensions optimize storage — Header-level identifiers (invoice numbers) that have no attributes live directly in the fact table.
•Factless facts capture occurrences — Event tracking and coverage analysis tables contain only dimension keys, enabling analysis of what happened and what could have happened.

What's Next:

With fact tables understood, we turn to dimension tables—the structures that give meaning to fact table measurements. Where fact tables answer "how much?" and "how many?", dimensions answer "who?", "what?", "when?", "where?", and "why?" Understanding dimensions completes your foundation for star schema design.

Page Complete

You now understand fact tables at the expert level—their structure, types, granularity, and design patterns. This foundation enables you to design analytical databases that support the complex queries business intelligence demands.

1 / 5

Loading learning content...

Database Management SystemsStar Schema

Star Schema: The Foundation of Dimensional Modeling

LevelIntermediate

Duration75 mins

TopicStar Schema

1 / 5

Fact Table: The Heart of the Star Schema

The Central Table That Tells Your Business Story

What You Will Learn

Understanding Fact Tables

The Fundamental Characteristics

Fact tables possess several distinguishing characteristics that separate them from other database tables:

1. Numeric Measurements (Facts)

The primary content of a fact table consists of numeric, quantitative values called facts or measures. These are the numbers that analysts aggregate, summarize, and compare. Examples include:

Sales amount (revenue dollars)
Quantity sold (unit count)
Duration (time in seconds)
Weight (kilograms shipped)
Count (number of clicks, views, events)

2. Foreign Keys to Dimensions

Fact tables contain foreign keys that reference dimension tables, establishing the context for each measurement. A sales fact row might contain keys pointing to:

Which customer made the purchase
Which product was purchased
When the purchase occurred
Where the purchase happened
Which employee processed the sale

3. Narrow and Deep Structure

Anatomy of a Sales Fact Table Row
Column	Type	Example Value	Purpose
sale_key	Surrogate Key	7492031	Unique identifier for this fact row
date_key	Foreign Key	20240115	Links to date dimension
product_key	Foreign Key	4521	Links to product dimension
customer_key	Foreign Key	89234	Links to customer dimension
store_key	Foreign Key	127	Links to store dimension
promotion_key	Foreign Key	18	Links to promotion dimension
quantity_sold	Fact (Additive)	3	Units purchased in this transaction
unit_price	Fact (Non-Additive)	29.99	Price per unit at time of sale
sales_amount	Fact (Additive)	89.97	Total revenue = quantity × price
cost_amount	Fact (Additive)	54.00	Total cost of goods sold
profit_amount	Fact (Derived)	35.97	Revenue minus cost

The Grain Is Everything

Types of Facts: Understanding Additivity

Additive Facts

Additive facts can be summed across all dimensions without producing meaningless results. These are the most useful facts because they support the full range of analytical operations.

Examples of Additive Facts:

Sales Revenue: You can sum revenue across time (total for the year), across products (total for all products), across customers (total for all customers), or any combination.
Units Sold: Quantities are additive—100 units sold yesterday plus 150 today equals 250 total.
Cost of Goods Sold: Costs aggregate meaningfully across all dimensions.

Design for Additivity

Semi-Additive Facts

Semi-additive facts can be summed across some dimensions but not others—typically not across time. These usually represent balance or level measurements at a point in time.

Examples of Semi-Additive Facts:

Account Balance: You can sum balances across all accounts at a single point in time to get total assets, but summing Monday's balance plus Tuesday's balance is meaningless.
Inventory Quantity: Current stock levels across all warehouses sum to total inventory, but summing today's level plus yesterday's level produces nonsense.
Headcount: Employee count across all departments makes sense at a single date, but not summed across dates.

For semi-additive facts, typical analytical operations use AVG, MIN, MAX, or last value across the time dimension rather than SUM.

Aggregation Behavior by Fact Type
Fact Type	Sum Across Products	Sum Across Customers	Sum Across Time	Typical Operations
Sales Revenue (Additive)	✓ Valid	✓ Valid	✓ Valid	SUM, AVG, COUNT
Account Balance (Semi-Additive)	✓ Valid	✓ Valid	✗ Invalid	AVG, MIN, MAX, LAST
Unit Price (Non-Additive)	✗ Invalid	✗ Invalid	✗ Invalid	AVG, MIN, MAX, weighted calculations
Percentage (Non-Additive)	✗ Invalid	✗ Invalid	✗ Invalid	Recalculate from components

Non-Additive Facts

Non-Additive facts cannot be meaningfully summed across any dimension. These are typically ratios, percentages, averages, or other derived calculations.

Examples of Non-Additive Facts:

Unit Price: Summing prices across products or time produces meaningless numbers. Average price requires a weighted calculation.
Profit Margin Percentage: A 10% margin plus a 20% margin does not equal a 30% margin.
Temperature / Rate / Ratio: Any measurement where the unit of measure changes when aggregated.

The Solution: Store Components, Not Ratios

SELECT 
    product_name,
    SUM(profit_amount) / SUM(revenue_amount) * 100 AS profit_margin_percent
FROM sales_fact f
JOIN product_dim p ON f.product_key = p.product_key
GROUP BY product_name;

This approach enables correct weighted averages across any dimension combination.

Types of Fact Tables

Different business processes require different fact table structures. The three fundamental types—transactional, periodic snapshot, and accumulating snapshot—each serve distinct analytical needs.

Transaction Fact Tables

Characteristics:

One row per discrete event
Sparse data—rows only exist when events occur
Most flexible for drill-down analysis
Typically the largest tables in the warehouse

Examples:

Point-of-sale transactions (one row per line item purchased)
Web clickstream events (one row per page view or click)
Call center records (one row per call)
Financial transactions (one row per debit or credit)

transaction_fact_example.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
-- Transaction Fact Table: Retail Sales
-- Grain: One row per line item per transaction
 
CREATE TABLE sales_transaction_fact (
    -- Surrogate key
    transaction_line_key    BIGINT PRIMARY KEY,
    
    -- Degenerate dimension (transaction header info)
    transaction_number      VARCHAR(20) NOT NULL,
    line_number             INT NOT NULL,
    
    -- Foreign keys to dimensions
    date_key                INT NOT NULL,        -- When
    time_key                INT NOT NULL,        -- What time of day
    product_key             INT NOT NULL,        -- What
    customer_key            INT NOT NULL,        -- Who
    store_key               INT NOT NULL,        -- Where
    cashier_key             INT NOT NULL,        -- By whom
    promotion_key           INT NOT NULL,        -- Under what promotion
    payment_method_key      INT NOT NULL,        -- How paid
    
    -- Additive facts
    quantity_sold           INT NOT NULL,
    unit_list_price         DECIMAL(10,2),
    unit_discount_amount    DECIMAL(10,2),
    extended_sales_amount   DECIMAL(12,2) NOT NULL,
    extended_cost_amount    DECIMAL(12,2) NOT NULL,
    extended_profit_amount  DECIMAL(12,2) NOT NULL,
    
    -- Timestamps for ETL tracking
    etl_load_timestamp      TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);
 
-- Example query: Daily sales by product category
SELECT 
    d.calendar_date,
    p.category_name,
    SUM(f.quantity_sold) AS total_units,
    SUM(f.extended_sales_amount) AS total_revenue,
    SUM(f.extended_profit_amount) AS total_profit
FROM sales_transaction_fact f
JOIN date_dim d ON f.date_key = d.date_key
JOIN product_dim p ON f.product_key = p.product_key
WHERE d.calendar_year = 2024
GROUP BY d.calendar_date, p.category_name
ORDER BY d.calendar_date, total_revenue DESC;

Periodic Snapshot Fact Tables

Characteristics:

One row per entity per period (day, week, month)
Dense data—every row exists whether or not activity occurred
Contains semi-additive facts (balances, levels)
Excellent for trend analysis and period-over-period comparisons

Examples:

Daily inventory levels (one row per product per warehouse per day)
Monthly account balances (one row per account per month)
Weekly sales summaries (one row per product per store per week)
Daily website traffic aggregates

periodic_snapshot_example.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
-- Periodic Snapshot Fact Table: Daily Inventory
-- Grain: One row per product per warehouse per day
 
CREATE TABLE inventory_daily_snapshot_fact (
    -- Composite key (could also use surrogate)
    date_key                INT NOT NULL,
    product_key             INT NOT NULL,
    warehouse_key           INT NOT NULL,
    
    -- Semi-additive facts (point-in-time levels)
    quantity_on_hand        INT NOT NULL,           -- Current stock
    quantity_on_order       INT NOT NULL,           -- Pending receipts
    quantity_reserved       INT NOT NULL,           -- Allocated to orders
    quantity_available      INT AS (quantity_on_hand + quantity_on_order - quantity_reserved),
    
    -- Additive facts (activity during the day)
    quantity_received       INT DEFAULT 0,          -- Units received today
    quantity_shipped        INT DEFAULT 0,          -- Units shipped today
    quantity_returned       INT DEFAULT 0,          -- Units returned today
    quantity_adjusted       INT DEFAULT 0,          -- Inventory adjustments
    
    -- Value facts
    unit_cost               DECIMAL(10,2),
    total_inventory_cost    DECIMAL(14,2),          -- on_hand × unit_cost
    
    PRIMARY KEY (date_key, product_key, warehouse_key)
);
 
-- Example query: Average inventory levels over time
SELECT 
    p.category_name,
    d.month_name,
    AVG(f.quantity_on_hand) AS avg_daily_inventory,
    MAX(f.quantity_on_hand) AS peak_inventory,
    MIN(f.quantity_on_hand) AS trough_inventory
FROM inventory_daily_snapshot_fact f
JOIN date_dim d ON f.date_key = d.date_key
JOIN product_dim p ON f.product_key = p.product_key
WHERE d.calendar_year = 2024
GROUP BY p.category_name, d.month_name, d.month_number
ORDER BY p.category_name, d.month_number;

Accumulating Snapshot Fact Tables

Characteristics:

One row per process instance (e.g., one row per order, one row per loan application)
Rows are updated multiple times as milestones are reached
Contains multiple date keys (one for each significant milestone)
Tracks lag times between process stages

Examples:

Order fulfillment (order received → picked → shipped → delivered)
Loan processing (application → review → approval → funding → closing)
Manufacturing (work order → started → QA → completed)
Insurance claims (filed → reviewed → approved → paid)

accumulating_snapshot_example.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
-- Accumulating Snapshot Fact Table: Order Fulfillment
-- Grain: One row per order (updated as order progresses)
 
CREATE TABLE order_fulfillment_fact (
    order_key               BIGINT PRIMARY KEY,
    order_number            VARCHAR(20) NOT NULL,       -- Degenerate dimension
    
    -- Foreign keys
    customer_key            INT NOT NULL,
    product_key             INT NOT NULL,
    source_warehouse_key    INT,
    shipping_carrier_key    INT,
    
    -- Milestone date keys (updated as process progresses)
    order_date_key          INT NOT NULL,               -- Order placed
    warehouse_assign_date_key INT,                      -- Warehouse assigned
    pick_start_date_key     INT,                        -- Picking began
    pick_complete_date_key  INT,                        -- Picking finished
    pack_date_key           INT,                        -- Packed
    ship_date_key           INT,                        -- Shipped
    delivery_date_key       INT,                        -- Delivered
    
    -- Lag measurements (in days or hours)
    order_to_assign_days    INT,                        -- Assignment delay
    pick_duration_hours     DECIMAL(6,2),               -- Time to pick
    assign_to_ship_days     INT,                        -- Warehouse processing time
    ship_to_delivery_days   INT,                        -- Transit time
    total_fulfillment_days  INT,                        -- End-to-end cycle time
    
    -- Facts
    quantity_ordered        INT NOT NULL,
    quantity_shipped        INT,
    quantity_delivered      INT,
    extended_amount         DECIMAL(12,2),
    shipping_cost           DECIMAL(10,2),
    
    -- Status tracking
    current_status          VARCHAR(20),                -- 'ordered','picking','shipped','delivered'
    is_late                 BOOLEAN DEFAULT FALSE,
    promised_delivery_date_key INT,
    
    -- ETL metadata
    last_update_timestamp   TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);
 
-- Example query: Order fulfillment performance analysis
SELECT 
    d.month_name,
    AVG(total_fulfillment_days) AS avg_cycle_time,
    AVG(assign_to_ship_days) AS avg_warehouse_time,
    AVG(ship_to_delivery_days) AS avg_transit_time,
    COUNT(*) AS total_orders,
    SUM(CASE WHEN is_late THEN 1 ELSE 0 END) AS late_orders,
    SUM(CASE WHEN is_late THEN 1 ELSE 0 END) * 100.0 / COUNT(*) AS late_pct
FROM order_fulfillment_fact f
JOIN date_dim d ON f.order_date_key = d.date_key
WHERE f.delivery_date_key IS NOT NULL  -- Only delivered orders
  AND d.calendar_year = 2024
GROUP BY d.month_name, d.month_number
ORDER BY d.month_number;

ETL Complexity Trade-off

Granularity: The Single Most Important Design Decision

Granularity (also called grain) defines the level of detail stored in a fact table. It answers the question: What does a single row in this fact table represent?

The Grain Declaration Statement

Every fact table should have a clear, unambiguous grain statement. This statement should be understandable to business users, not just technical staff:

Good grain statements:

"One row per line item on a retail transaction"
"One row per product per store per day"
"One row per website session"
"One row per patient visit to a healthcare facility"

Poor (ambiguous) grain statements:

"Sales data" (What level of detail?)
"Customer transactions" (Per transaction? Per line item? Per customer per day?)
"Financial information" (Meaningless without specificity)

The Cardinal Rule of Grain

Choosing the Right Grain

The grain choice involves balancing several competing concerns:

Favor Atomic Grain

Why atomic grain is preferred:

Maximum analytical flexibility: You can always aggregate atomic data upward (daily sales from transactions), but you cannot decompose aggregated data downward (transactions from daily summaries).
Handles unforeseen questions: Business users will inevitably ask questions you didn't anticipate. Atomic grain data lets you answer them.
Supports drill-through: When investigating anomalies, analysts need to see the underlying transactions.

Arguments for Atomic Grain

•Enables any level of aggregation
•Answers questions not yet imagined
•Allows drill-down to source data
•Preserves full audit trail
•Supports complex analytical queries
•Natural alignment with operational data

Considerations for Higher Grain

•Smaller table size (storage savings)
•Faster query performance for summaries
•Less ETL complexity
•May be sufficient for limited use cases
•Aggregate tables supplement atomic tables
•Privacy/security may require aggregation

Grain and Dimensions

The grain declaration determines which dimensions are applicable. If the grain is "one row per line item on a retail transaction," then valid dimensions include:

Date (when the transaction occurred)
Product (what was purchased)
Customer (who made the purchase)
Store (where the transaction occurred)
Employee (who processed the sale)
Promotion (what discount applied)

Degenerate Dimensions

Why Degenerate Dimensions Exist

Consider a retail fact table at the line-item grain. Each line item belongs to a transaction, identified by a transaction number. But what attributes would a "transaction dimension" contain?

Transaction date? Already in the date dimension.
Store? Already in the store dimension.
Customer? Already in the customer dimension.
Cashier? Already in the employee dimension.

Solution: Store the transaction number directly in the fact table as a degenerate dimension.

degenerate_dimension_example.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
-- Fact table with degenerate dimensions
CREATE TABLE retail_sales_fact (
    sale_line_key           BIGINT PRIMARY KEY,
    
    -- Degenerate dimensions (stored in fact, no separate dimension table)
    invoice_number          VARCHAR(15) NOT NULL,    -- Transaction identifier
    line_number             INT NOT NULL,            -- Line within transaction
    po_number               VARCHAR(20),             -- Purchase order reference
    
    -- Regular foreign keys to dimensions
    date_key                INT NOT NULL,
    customer_key            INT NOT NULL,
    product_key             INT NOT NULL,
    store_key               INT NOT NULL,
    
    -- Facts
    quantity                INT NOT NULL,
    sales_amount            DECIMAL(12,2) NOT NULL,
    cost_amount             DECIMAL(12,2) NOT NULL
);
 
-- Query using degenerate dimension for drill-through
-- "Show me all line items from invoice ABC-2024-00042"
SELECT 
    f.invoice_number,
    f.line_number,
    p.product_name,
    f.quantity,
    f.sales_amount
FROM retail_sales_fact f
JOIN product_dim p ON f.product_key = p.product_key
WHERE f.invoice_number = 'ABC-2024-00042'
ORDER BY f.line_number;
 
-- Transaction-level aggregation using degenerate dimension
-- "Average items per transaction"
SELECT 
    d.calendar_month,
    COUNT(DISTINCT f.invoice_number) AS transaction_count,
    COUNT(*) AS line_item_count,
    COUNT(*) * 1.0 / COUNT(DISTINCT f.invoice_number) AS avg_items_per_transaction
FROM retail_sales_fact f
JOIN date_dim d ON f.date_key = d.date_key
GROUP BY d.calendar_month
ORDER BY d.calendar_month;

Common Degenerate Dimensions

Factless Fact Tables

Why Factless Fact Tables Are Useful

Type 1: Event Tracking

Examples:

Class attendance: Student A attended Class B on Date C (no numeric fact, just the occurrence)
Website browsing: User A viewed Page B at Time C on Device D
Coverage/association: Product A is covered by Promotion B during Period C

factless_event_example.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
-- Factless Fact Table: Student Attendance
-- Grain: One row per student per class per day attended
 
CREATE TABLE student_attendance_fact (
    date_key                INT NOT NULL,
    student_key             INT NOT NULL,
    class_key               INT NOT NULL,
    instructor_key          INT NOT NULL,
    room_key                INT NOT NULL,
    
    -- Optionally include a dummy fact for counting
    attendance_count        INT DEFAULT 1,
    
    PRIMARY KEY (date_key, student_key, class_key)
);
 
-- Query: Attendance rate by class
SELECT 
    c.class_name,
    COUNT(*) AS days_attended,
    (SELECT COUNT(DISTINCT date_key) FROM date_dim 
     WHERE calendar_year = 2024 AND is_school_day = TRUE) AS school_days,
    COUNT(*) * 100.0 / 
        (SELECT COUNT(DISTINCT date_key) FROM date_dim 
         WHERE calendar_year = 2024 AND is_school_day = TRUE) AS attendance_rate
FROM student_attendance_fact f
JOIN class_dim c ON f.class_key = c.class_key
JOIN date_dim d ON f.date_key = d.date_key
WHERE d.calendar_year = 2024
GROUP BY c.class_name;

Type 2: Coverage Analysis

The second use of factless fact tables is to answer "what could have happened but didn't?" questions. These are coverage or eligibility tables that define all possible combinations.

The problem:

factless_coverage_example.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
-- Factless Coverage Table: Product-Store-Promotion Coverage
-- "What promotions were available for what products at what stores?"
 
CREATE TABLE promotion_coverage_fact (
    date_key                INT NOT NULL,
    product_key             INT NOT NULL,
    store_key               INT NOT NULL,
    promotion_key           INT NOT NULL,
    
    PRIMARY KEY (date_key, product_key, store_key, promotion_key)
);
 
-- Query: Products with promotions that had ZERO sales
-- (Identify underperforming promotions)
SELECT 
    prom.promotion_name,
    p.product_name,
    s.store_name,
    d.calendar_date
FROM promotion_coverage_fact cov
JOIN product_dim p ON cov.product_key = p.product_key
JOIN store_dim s ON cov.store_key = s.store_key
JOIN promotion_dim prom ON cov.promotion_key = prom.promotion_key
JOIN date_dim d ON cov.date_key = d.date_key
-- LEFT JOIN to sales fact to find non-matches
LEFT JOIN sales_fact sf 
    ON cov.date_key = sf.date_key
    AND cov.product_key = sf.product_key
    AND cov.store_key = sf.store_key
    AND cov.promotion_key = sf.promotion_key
WHERE sf.sale_key IS NULL  -- No matching sale
ORDER BY d.calendar_date, prom.promotion_name;

The Dummy Count Fact

Fact Table Design Best Practices

Essential Fact Table Design Practices

•Declare grain before adding columns — Begin every fact table design by writing down the grain statement. Only add dimensions and facts that are consistent with that grain.
•Use surrogate keys for dimension references — Use integer surrogate keys (dimension_key), not natural keys. This enables slowly changing dimensions and provides consistent join performance.
•Store measured values, not calculated values — Store quantities, amounts, and durations. Derive ratios, percentages, and rankings at query time from stored components.
•Include audit columns — Add ETL_load_timestamp and source_system columns for debugging, reconciliation, and incremental loading.
•Partition large tables — For tables exceeding tens of millions of rows, partition by date (year/month) to enable partition elimination in queries.
•Maintain NULL discipline — Fact measurements should never be NULL; use 0 or create explicit 'unknown' dimension entries. NULL foreign keys indicate ETL problems.
•Create useful indexes — Index foreign keys individually and create composite indexes for common query patterns (e.g., date_key + product_key).
•Document the grain — Include a comment or metadata entry that explicitly states the grain. Future developers will thank you.

Sizing Considerations

Fact tables dominate warehouse storage. A retail organization with 1,000 stores, 50,000 products, and 5 years of history might have:

1,000 stores × 1,000 customers/day × 5 items/customer × 365 days × 5 years
= 9.125 billion rows in the line-item grain fact table

At 100 bytes per row (typical for a fact table with 5-10 foreign keys and 4-6 fact columns), this equals:

9.125B × 100 bytes = ~850 GB of raw fact data

Add indexes and you might reach 2-3 TB for this single table. This scale is routine for enterprise warehouses, but it demands:

Appropriate partitioning strategy
Columnar storage formats for analytics (Parquet, ORC)
Query engines designed for large-scale aggregation

Summary: Mastering Fact Tables

Key Takeaways

•Fact tables store measurements — Numeric, quantitative data about business events, surrounded by dimension foreign keys that provide context.
•Additivity determines aggregation behavior — Additive facts (revenue, quantity) sum across all dimensions; semi-additive facts (balances) don't sum across time; non-additive facts (ratios) require special handling.
•Three types serve different needs — Transaction tables record events, periodic snapshots capture point-in-time states, accumulating snapshots track process pipelines.
•Grain is the foundation — The grain declaration precedes all other design decisions. Never mix grains in a single table.
•Degenerate dimensions optimize storage — Header-level identifiers (invoice numbers) that have no attributes live directly in the fact table.
•Factless facts capture occurrences — Event tracking and coverage analysis tables contain only dimension keys, enabling analysis of what happened and what could have happened.

What's Next:

Page Complete

1 / 5