Database Management SystemsDenormalization

Denormalization Techniques

LevelIntermediate

Duration75 mins

TopicDenormalization

3 / 5

Duplicating Foreign Key Data

The Cost of Joins in Normalized Schemas

Normalization separates data into distinct tables connected by foreign key relationships. This elegant design eliminates redundancy but introduces a fundamental performance cost: every time you need related data, you must perform a join.

Consider an order management system with properly normalized tables:

orders → customers (to display customer name)
orders → shipping_addresses (to display delivery info)
order_items → products (to display product name)
order_items → categories (to display category)

Displaying a complete order with all its line items requires joining 5-6 tables. At scale, these joins become the dominant factor in query latency. The database must:

Locate matching rows across multiple tables
Track which rows belong together
Materialize the combined result set
Handle potential index misses and table scans

Duplicating foreign key data addresses this by copying frequently-accessed attributes from related tables into the referencing table, eliminating the need for joins in common queries.

What You Will Learn

By the end of this page, you will understand when and how to duplicate foreign key data for performance optimization. You'll learn to identify high-value duplication candidates, implement synchronization strategies, and manage the consistency challenges that duplication introduces.

Understanding Data Duplication

Data duplication (also called data copying, embedding, or inlining) means storing copies of data from referenced tables directly in the referencing table. The foreign key relationship remains for referential integrity, but commonly-accessed attributes are duplicated to avoid joins.

Before (normalized):

orders table:
| order_id | customer_id | order_date | status |

customers table:
| customer_id | name | email | loyalty_tier |

Query: SELECT o.*, c.name, c.email 
       FROM orders o JOIN customers c ON o.customer_id = c.customer_id

After (with duplicated data):

orders table:
| order_id | customer_id | customer_name | customer_email | order_date | status |

Query: SELECT * FROM orders

The duplicated customer_name and customer_email columns eliminate the join for the most common query pattern while customer_id maintains referential integrity for less frequent operations.

Trade-off Analysis: Join vs. Duplication
Aspect	Join (Normalized)	Duplication (Denormalized)
Storage	Minimal - single copy of each value	Increased - values repeated across rows
Read Performance	Slower - must combine data from multiple tables	Faster - all needed data in single table
Write Performance	Faster - update source once	Slower - must update all copies
Data Consistency	Guaranteed - single source of truth	Must be actively maintained
Schema Complexity	More tables, cleaner design	Fewer joins, more columns
Query Complexity	Explicit joins required	Simpler SELECT statements

The Read-Heavy Reality

Most production systems are heavily read-biased. A typical web application might perform 100-1000 reads for every write. If duplicating data eliminates joins on 99% of reads, the write overhead of maintaining duplicates is often negligible.

Identifying Duplication Candidates

Not all foreign key attributes are worth duplicating. The decision requires analyzing query patterns, update frequencies, and consistency requirements.

Evaluation criteria for duplication candidates:

Strong Candidates for Duplication

•Frequently accessed in reads — Attributes included in nearly every query of the referencing table (e.g., customer name on every order display)
•Rarely updated at source — Attributes that change infrequently (e.g., customer name changes less than product prices)
•Small in size — Compact attributes have minimal storage overhead (e.g., name, category, status vs. large text fields)
•Display-only in context — Attributes used for display but not for business logic (e.g., showing product name, not computing discounts)
•High join cost — Joins to large tables or across database boundaries benefit most from duplication

Poor Candidates for Duplication

•Frequently updated at source — Attributes that change often create constant synchronization overhead
•Large in size — Binary data, long text, or BLOB fields dramatically increase storage costs
•Used in business logic — Attributes needing guaranteed current values (e.g., current prices, inventory counts)
•Rarely accessed — If the join is only needed for occasional reports, duplication isn't justified
•High cardinality with unique access — Attributes rarely displayed together (e.g., detailed customer preferences)

candidate-analysis-queries.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
-- Analysis queries to identify duplication candidates
 
-- 1. Find most frequently joined column combinations
-- (Requires query logging/monitoring setup)
SELECT 
    query_pattern,
    join_tables,
    execution_count,
    avg_execution_time_ms
FROM query_performance_log
WHERE query_pattern LIKE '%JOIN%'
ORDER BY execution_count * avg_execution_time_ms DESC
LIMIT 20;
 
-- 2. Analyze update frequency of referenced table columns
SELECT 
    column_name,
    COUNT(*) AS update_count,
    MAX(updated_at) AS last_update
FROM audit_log
WHERE table_name = 'customers'
  AND operation = 'UPDATE'
  AND updated_at > CURRENT_DATE - INTERVAL '30 days'
GROUP BY column_name
ORDER BY update_count DESC;
 
-- 3. Measure join cost for specific query patterns
EXPLAIN ANALYZE
SELECT o.order_id, o.order_date, o.status,
       c.name AS customer_name, c.email AS customer_email
FROM orders o
JOIN customers c ON o.customer_id = c.customer_id
WHERE o.order_date > CURRENT_DATE - INTERVAL '7 days'
ORDER BY o.order_date DESC
LIMIT 100;
 
-- Check output for:
-- - Nested Loop vs Hash Join (algorithm choice)
-- - Index usage on join conditions
-- - Total execution time
-- - Rows processed vs rows returned
 
-- 4. Estimate storage impact of duplication
SELECT 
    COUNT(*) AS row_count,
    COUNT(*) * 50 AS estimated_bytes_for_name,  -- ~50 bytes per name
    COUNT(*) * 100 AS estimated_bytes_for_email,
    pg_size_pretty((COUNT(*) * 150)::bigint) AS total_additional_storage
FROM orders;

Profile Before Duplicating

Never duplicate data based on assumptions. Profile your actual query workload, measure join costs, and quantify the expected benefit. Premature duplication adds complexity without guaranteed performance gains.

Implementation Patterns

There are several patterns for implementing foreign key data duplication, each suited to different scenarios and consistency requirements.

Pattern 1: Point-in-Time Capture

Capture the value at transaction time and never update it. This is appropriate when the historical state is meaningful—what the value was when the action occurred, not what it is now.

point-in-time-capture.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
-- Point-in-Time Capture: Values frozen at creation time
-- Example: Order captures product price at time of order
 
CREATE TABLE order_items (
    order_item_id SERIAL PRIMARY KEY,
    order_id INT NOT NULL REFERENCES orders(order_id),
    product_id INT NOT NULL REFERENCES products(product_id),
    quantity INT NOT NULL,
    
    -- Captured at order time - intentionally NOT updated when product changes
    unit_price DECIMAL(10, 2) NOT NULL,  -- Price at time of order
    product_name VARCHAR(255) NOT NULL,   -- Name at time of order
    product_sku VARCHAR(50) NOT NULL,     -- SKU at time of order
    
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);
 
-- Application code captures values during order creation
INSERT INTO order_items (order_id, product_id, quantity, unit_price, product_name, product_sku)
SELECT 
    :order_id,
    p.product_id,
    :quantity,
    p.current_price,  -- Capture current price
    p.name,           -- Capture current name
    p.sku             -- Capture current SKU
FROM products p
WHERE p.product_id = :product_id;
 
-- Benefits:
-- 1. Historical accuracy - order shows what customer actually saw/paid
-- 2. No synchronization needed - values are intentionally static
-- 3. Audit trail - original values preserved regardless of product changes
-- 4. Legal/compliance - invoices reflect actual transaction terms

Pattern 2: Live Synchronized Copy

Maintain a copy that stays synchronized with the source. This requires active maintenance but ensures the duplicate always reflects current state.

live-synchronized-copy.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
-- Live Synchronized Copy: Duplicates updated when source changes
-- Example: Order displays current customer name
 
CREATE TABLE orders (
    order_id SERIAL PRIMARY KEY,
    customer_id INT NOT NULL REFERENCES customers(customer_id),
    
    -- Synchronized copies - updated when customer data changes
    customer_name VARCHAR(100) NOT NULL,
    customer_email VARCHAR(255) NOT NULL,
    customer_phone VARCHAR(20),
    
    order_date TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    status VARCHAR(20) DEFAULT 'pending'
);
 
-- Trigger to synchronize duplicates when source changes
CREATE OR REPLACE FUNCTION sync_customer_data_to_orders()
RETURNS TRIGGER AS $$
BEGIN
    -- Only update if the duplicated fields actually changed
    IF OLD.name IS DISTINCT FROM NEW.name 
       OR OLD.email IS DISTINCT FROM NEW.email
       OR OLD.phone IS DISTINCT FROM NEW.phone THEN
        
        UPDATE orders
        SET customer_name = NEW.name,
            customer_email = NEW.email,
            customer_phone = NEW.phone
        WHERE customer_id = NEW.customer_id;
        
        -- Log the sync operation for monitoring
        INSERT INTO sync_audit_log (table_name, source_id, operation, synced_at)
        VALUES ('orders', NEW.customer_id, 'customer_update', CURRENT_TIMESTAMP);
    END IF;
    
    RETURN NEW;
END;
$$ LANGUAGE plpgsql;
 
CREATE TRIGGER customer_data_sync_trigger
AFTER UPDATE ON customers
FOR EACH ROW EXECUTE FUNCTION sync_customer_data_to_orders();
 
-- Also sync on initial order creation
CREATE OR REPLACE FUNCTION populate_order_customer_data()
RETURNS TRIGGER AS $$
BEGIN
    SELECT name, email, phone
    INTO NEW.customer_name, NEW.customer_email, NEW.customer_phone
    FROM customers
    WHERE customer_id = NEW.customer_id;
    
    RETURN NEW;
END;
$$ LANGUAGE plpgsql;
 
CREATE TRIGGER order_customer_populate_trigger
BEFORE INSERT ON orders
FOR EACH ROW EXECUTE FUNCTION populate_order_customer_data();

Pattern 3: Lazy Refresh with Staleness Tolerance

Allow duplicates to become stale and refresh periodically. Appropriate when exact consistency isn't required and refresh cost should be batched.

lazy-refresh-pattern.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
-- Lazy Refresh: Periodic batch synchronization
-- Example: Product catalog data in recommendations table
 
CREATE TABLE product_recommendations (
    recommendation_id SERIAL PRIMARY KEY,
    user_id INT NOT NULL,
    product_id INT NOT NULL REFERENCES products(product_id),
    
    -- Duplicated product data (may be slightly stale)
    product_name VARCHAR(255),
    product_image_url VARCHAR(500),
    product_category VARCHAR(100),
    product_price DECIMAL(10, 2),
    
    -- Staleness tracking
    product_data_synced_at TIMESTAMP,
    
    score DECIMAL(5, 4) NOT NULL,
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);
 
-- Batch refresh job - run every 15 minutes or hourly
CREATE OR REPLACE FUNCTION refresh_recommendation_product_data()
RETURNS INTEGER AS $$
DECLARE
    rows_updated INTEGER;
BEGIN
    UPDATE product_recommendations pr
    SET product_name = p.name,
        product_image_url = p.image_url,
        product_category = c.name,
        product_price = p.current_price,
        product_data_synced_at = CURRENT_TIMESTAMP
    FROM products p
    JOIN categories c ON p.category_id = c.category_id
    WHERE pr.product_id = p.product_id
      AND (pr.product_data_synced_at IS NULL 
           OR pr.product_data_synced_at < p.updated_at);
    
    GET DIAGNOSTICS rows_updated = ROW_COUNT;
    
    -- Log refresh statistics
    INSERT INTO batch_sync_log (job_name, rows_affected, completed_at)
    VALUES ('refresh_recommendation_product_data', rows_updated, CURRENT_TIMESTAMP);
    
    RETURN rows_updated;
END;
$$ LANGUAGE plpgsql;
 
-- Selective refresh: Only sync high-visibility rows
CREATE OR REPLACE FUNCTION refresh_active_recommendation_data()
RETURNS INTEGER AS $$
DECLARE
    rows_updated INTEGER;
BEGIN
    -- Only refresh recommendations shown in last 24 hours
    UPDATE product_recommendations pr
    SET product_name = p.name,
        product_price = p.current_price,
        product_data_synced_at = CURRENT_TIMESTAMP
    FROM products p
    WHERE pr.product_id = p.product_id
      AND pr.last_displayed_at > CURRENT_TIMESTAMP - INTERVAL '24 hours'
      AND pr.product_data_synced_at < p.updated_at;
    
    GET DIAGNOSTICS rows_updated = ROW_COUNT;
    RETURN rows_updated;
END;
$$ LANGUAGE plpgsql;

When to Use Each Pattern
Pattern	Use When	Examples
Point-in-Time Capture	Historical accuracy matters; legal/audit requirements; values represent transaction terms	Order prices, invoice addresses, contract terms
Live Synchronized	Current values always needed; low update frequency at source; strong consistency required	Customer contact info, product availability status
Lazy Refresh	Slight staleness acceptable; high volume tables; batch updates preferred	Recommendations, cached listings, search results

Maintaining Consistency

Data duplication inherently creates the possibility of inconsistency—the duplicate may not match the source. Robust implementations require multiple layers of consistency protection.

Layer 1: Transactional Consistency

Wrap source updates and duplicate updates in the same database transaction:

transactional-consistency.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
-- Transactional consistency: All updates in single transaction
BEGIN;
 
-- Update source table
UPDATE customers
SET name = 'Jane Smith',
    email = 'jane.smith@example.com'
WHERE customer_id = 12345;
 
-- Update all duplicates in same transaction
UPDATE orders
SET customer_name = 'Jane Smith',
    customer_email = 'jane.smith@example.com'
WHERE customer_id = 12345;
 
UPDATE support_tickets
SET customer_name = 'Jane Smith',
    customer_email = 'jane.smith@example.com'
WHERE customer_id = 12345;
 
COMMIT;  -- All changes succeed or all fail
 
-- Better: Let triggers handle this automatically (see previous examples)
-- The trigger executes within the same transaction as the source UPDATE

Layer 2: Consistency Verification

Scheduled jobs to detect and alert on inconsistencies:

consistency-verification.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
-- Consistency verification: Detect drift between source and duplicates
 
CREATE TABLE consistency_check_results (
    check_id SERIAL PRIMARY KEY,
    check_name VARCHAR(100) NOT NULL,
    table_name VARCHAR(100) NOT NULL,
    inconsistent_rows INT NOT NULL,
    sample_ids TEXT,
    checked_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);
 
-- Check orders.customer_name matches customers.name
CREATE OR REPLACE FUNCTION check_order_customer_consistency()
RETURNS void AS $$
DECLARE
    inconsistent_count INT;
    sample_order_ids TEXT;
BEGIN
    -- Count inconsistencies
    SELECT COUNT(*), STRING_AGG(order_id::TEXT, ',' ORDER BY order_id LIMIT 10)
    INTO inconsistent_count, sample_order_ids
    FROM orders o
    JOIN customers c ON o.customer_id = c.customer_id
    WHERE o.customer_name != c.name
       OR o.customer_email != c.email;
    
    -- Log results
    INSERT INTO consistency_check_results (check_name, table_name, inconsistent_rows, sample_ids)
    VALUES ('order_customer_data', 'orders', inconsistent_count, sample_order_ids);
    
    -- Alert if threshold exceeded
    IF inconsistent_count > 0 THEN
        PERFORM pg_notify('consistency_alert', 
            FORMAT('Found %s inconsistent rows in orders.customer_* fields', inconsistent_count));
    END IF;
END;
$$ LANGUAGE plpgsql;
 
-- Schedule to run hourly
SELECT cron.schedule('order-customer-consistency-check', '0 * * * *', 
    'SELECT check_order_customer_consistency()');

Layer 3: Automated Repair

When inconsistencies are detected, repair them automatically or flag for review:

automated-repair.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
-- Automated repair: Fix detected inconsistencies
 
CREATE OR REPLACE FUNCTION repair_order_customer_inconsistencies()
RETURNS TABLE (
    order_id INT,
    old_customer_name VARCHAR,
    new_customer_name VARCHAR,
    repaired_at TIMESTAMP
) AS $$
BEGIN
    RETURN QUERY
    WITH inconsistent AS (
        SELECT o.order_id,
               o.customer_name AS old_name,
               c.name AS correct_name,
               o.customer_email AS old_email,
               c.email AS correct_email
        FROM orders o
        JOIN customers c ON o.customer_id = c.customer_id
        WHERE o.customer_name != c.name OR o.customer_email != c.email
    ),
    repaired AS (
        UPDATE orders o
        SET customer_name = i.correct_name,
            customer_email = i.correct_email
        FROM inconsistent i
        WHERE o.order_id = i.order_id
        RETURNING o.order_id, i.old_name, i.correct_name
    )
    SELECT r.order_id, 
           r.old_name,
           r.correct_name,
           CURRENT_TIMESTAMP
    FROM repaired r;
    
    -- Log repair action
    INSERT INTO data_repair_log (repair_type, rows_repaired, repaired_at)
    SELECT 'order_customer_sync', COUNT(*), CURRENT_TIMESTAMP
    FROM repaired;
END;
$$ LANGUAGE plpgsql;
 
-- Run repair with limits for safety
CREATE OR REPLACE FUNCTION safe_repair_order_customers(max_repairs INT DEFAULT 1000)
RETURNS INT AS $$
DECLARE
    repaired_count INT;
BEGIN
    WITH limited_inconsistent AS (
        SELECT o.order_id
        FROM orders o
        JOIN customers c ON o.customer_id = c.customer_id
        WHERE o.customer_name != c.name
        LIMIT max_repairs
    ),
    repaired AS (
        UPDATE orders o
        SET customer_name = c.name,
            customer_email = c.email
        FROM customers c
        WHERE o.customer_id = c.customer_id
          AND o.order_id IN (SELECT order_id FROM limited_inconsistent)
        RETURNING o.order_id
    )
    SELECT COUNT(*) INTO repaired_count FROM repaired;
    
    RETURN repaired_count;
END;
$$ LANGUAGE plpgsql;

Defense in Depth

No single consistency mechanism is foolproof. Triggers can be bypassed by direct SQL. Batch jobs can fail. Network issues can interrupt transactions. Layer multiple protections: transactional integrity as primary, verification as detection, repair as correction, and alerting as awareness.

Handling Source Data Updates

When source data changes, duplicated data must be updated—but this can impact many rows. Managing this update fan-out is crucial for write performance.

Understanding update fan-out:

If a customer has 1,000 orders, updating the customer's name means updating 1,000 order rows. If a product is in 100,000 order items, a product name change affects 100,000 rows. This fan-out ratio determines the write amplification cost of duplication.

fanout-analysis.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
-- Analyze fan-out ratio for different source tables
 
-- Customer fan-out: How many orders per customer?
SELECT 
    PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY order_count) AS median_orders,
    PERCENTILE_CONT(0.95) WITHIN GROUP (ORDER BY order_count) AS p95_orders,
    MAX(order_count) AS max_orders,
    AVG(order_count) AS avg_orders
FROM (
    SELECT customer_id, COUNT(*) AS order_count
    FROM orders
    GROUP BY customer_id
) customer_orders;
 
-- Product fan-out: How many order items per product?
SELECT 
    PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY item_count) AS median_orders,
    PERCENTILE_CONT(0.95) WITHIN GROUP (ORDER BY item_count) AS p95_orders,
    MAX(item_count) AS max_orders
FROM (
    SELECT product_id, COUNT(*) AS item_count
    FROM order_items
    GROUP BY product_id
) product_items;
 
-- Identify high-fanout entities that would cause massive updates
SELECT p.product_id, p.name, COUNT(*) AS order_item_count
FROM products p
JOIN order_items oi ON p.product_id = oi.product_id
GROUP BY p.product_id, p.name
HAVING COUNT(*) > 10000
ORDER BY order_item_count DESC;

Strategy 1: Batched Updates

For high fan-out updates, process in batches to avoid long-running transactions and lock contention:

batched-update-strategy.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
-- Batched update: Process in manageable chunks
 
CREATE OR REPLACE FUNCTION update_product_name_in_order_items(
    p_product_id INT,
    p_new_name VARCHAR,
    p_batch_size INT DEFAULT 1000
)
RETURNS INT AS $$
DECLARE
    total_updated INT := 0;
    batch_updated INT;
BEGIN
    LOOP
        -- Update one batch
        WITH batch AS (
            SELECT order_item_id
            FROM order_items
            WHERE product_id = p_product_id
              AND product_name != p_new_name
            LIMIT p_batch_size
            FOR UPDATE SKIP LOCKED  -- Avoid blocking concurrent transactions
        )
        UPDATE order_items
        SET product_name = p_new_name
        WHERE order_item_id IN (SELECT order_item_id FROM batch);
        
        GET DIAGNOSTICS batch_updated = ROW_COUNT;
        total_updated := total_updated + batch_updated;
        
        -- Exit when no more rows to update
        EXIT WHEN batch_updated = 0;
        
        -- Brief pause between batches to reduce lock pressure
        PERFORM pg_sleep(0.1);  -- 100ms
        
        -- Commit each batch (requires calling from application with autocommit)
        COMMIT;
    END LOOP;
    
    RETURN total_updated;
END;
$$ LANGUAGE plpgsql;

Strategy 2: Async Queue-Based Updates

Decouple source updates from duplicate propagation using a message queue:

async-queue-updates.ts
TypeScript
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
// Async queue-based duplicate propagation
 
interface DataChangeEvent {
    sourceTable: string;
    sourceId: number;
    changedFields: Record<string, { old: any; new: any }>;
    timestamp: Date;
}
 
class DuplicateDataSynchronizer {
    private queue: MessageQueue;
    private db: Database;
 
    // Source table update triggers this
    async onSourceDataChange(event: DataChangeEvent) {
        // Enqueue for async processing
        await this.queue.publish('duplicate-sync', event);
    }
 
    // Worker processes queue messages
    async processDuplicateSync(event: DataChangeEvent) {
        const syncConfig = this.getSyncConfig(event.sourceTable);
 
        for (const target of syncConfig.targets) {
            await this.batchUpdateTarget(
                target.table,
                target.foreignKey,
                event.sourceId,
                this.mapFields(event.changedFields, target.fieldMapping)
            );
        }
    }
 
    private async batchUpdateTarget(
        table: string,
        foreignKey: string,
        sourceId: number,
        updates: Record<string, any>,
        batchSize: number = 500
    ) {
        let offset = 0;
        let updated = 0;
 
        do {
            const result = await this.db.execute(`
                UPDATE ${table}
                SET ${Object.entries(updates).map(([k, v], i) => `${k} = $${i + 3}`).join(', ')},
                    sync_updated_at = NOW()
                WHERE ${foreignKey} = $1
                  AND id > $2
                ORDER BY id
                LIMIT ${batchSize}
            `, [sourceId, offset, ...Object.values(updates)]);
 
            updated = result.rowCount;
            offset += batchSize;
 
            // Yield to allow other operations
            await new Promise(resolve => setImmediate(resolve));
        } while (updated === batchSize);
    }
}

Accept Eventual Consistency

For display-only duplicated data, eventual consistency is often acceptable. A customer's name showing as 'John' for a few seconds after they changed it to 'Jonathan' rarely causes problems. Design your synchronization for the actual consistency requirements, not theoretical perfection.

Real-World Example: E-Commerce Order Display

Let's examine a complete implementation of foreign key data duplication for an e-commerce order display system.

Requirement: Display order history with customer name, shipping address, and product details without joins for every page load.

Before optimization:

before-denormalization.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
-- Original normalized query for order display
-- Requires 5 table joins
 
SELECT 
    o.order_id,
    o.order_date,
    o.status,
    c.name AS customer_name,
    c.email AS customer_email,
    sa.street AS shipping_street,
    sa.city AS shipping_city,
    sa.state AS shipping_state,
    sa.postal_code AS shipping_postal,
    oi.quantity,
    oi.unit_price,
    p.name AS product_name,
    p.sku AS product_sku,
    cat.name AS category_name
FROM orders o
JOIN customers c ON o.customer_id = c.customer_id
JOIN shipping_addresses sa ON o.shipping_address_id = sa.address_id
JOIN order_items oi ON o.order_id = oi.order_id
JOIN products p ON oi.product_id = p.product_id
JOIN categories cat ON p.category_id = cat.category_id
WHERE o.customer_id = $1
ORDER BY o.order_date DESC
LIMIT 50;
 
-- EXPLAIN ANALYZE result:
-- Hash Join (cost=1234.56..5678.90)
--   -> Nested Loop (cost=...)
--     -> Index Scan on orders...
--   -> Hash (cost=...)
--     -> Seq Scan on customers...
-- Planning Time: 8.234 ms
-- Execution Time: 156.789 ms

After optimization with duplicated data:

after-denormalization.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
-- Schema with duplicated foreign key data
 
CREATE TABLE orders (
    order_id SERIAL PRIMARY KEY,
    customer_id INT NOT NULL REFERENCES customers(customer_id),
    shipping_address_id INT NOT NULL REFERENCES shipping_addresses(address_id),
    
    -- Duplicated customer data (point-in-time for legal/audit)
    customer_name_snapshot VARCHAR(100) NOT NULL,
    customer_email_snapshot VARCHAR(255) NOT NULL,
    
    -- Duplicated shipping address (point-in-time - where it was shipped)
    shipping_street VARCHAR(255) NOT NULL,
    shipping_city VARCHAR(100) NOT NULL,
    shipping_state VARCHAR(50) NOT NULL,
    shipping_postal_code VARCHAR(20) NOT NULL,
    
    order_date TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    status VARCHAR(20) DEFAULT 'pending'
);
 
CREATE TABLE order_items (
    order_item_id SERIAL PRIMARY KEY,
    order_id INT NOT NULL REFERENCES orders(order_id),
    product_id INT NOT NULL REFERENCES products(product_id),
    
    quantity INT NOT NULL,
    
    -- Duplicated product data (point-in-time - what was ordered)
    unit_price_snapshot DECIMAL(10, 2) NOT NULL,
    product_name_snapshot VARCHAR(255) NOT NULL,
    product_sku_snapshot VARCHAR(50) NOT NULL,
    category_name_snapshot VARCHAR(100) NOT NULL
);
 
-- Optimized query: single table + child table, no external joins
SELECT 
    o.order_id,
    o.order_date,
    o.status,
    o.customer_name_snapshot AS customer_name,
    o.customer_email_snapshot AS customer_email,
    o.shipping_street,
    o.shipping_city,
    o.shipping_state,
    o.shipping_postal_code,
    oi.quantity,
    oi.unit_price_snapshot AS unit_price,
    oi.product_name_snapshot AS product_name,
    oi.product_sku_snapshot AS product_sku,
    oi.category_name_snapshot AS category_name
FROM orders o
JOIN order_items oi ON o.order_id = oi.order_id
WHERE o.customer_id = $1
ORDER BY o.order_date DESC
LIMIT 50;
 
-- EXPLAIN ANALYZE result:
-- Nested Loop (cost=0.85..123.45)
--   -> Index Scan using idx_orders_customer_date on orders
--   -> Index Scan using idx_order_items_order_id on order_items
-- Planning Time: 0.456 ms
-- Execution Time: 2.345 ms  -- 67x FASTER!

order-creation-with-snapshots.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
-- Order creation procedure: Captures point-in-time snapshots
 
CREATE OR REPLACE FUNCTION create_order(
    p_customer_id INT,
    p_shipping_address_id INT,
    p_items JSONB  -- Array of {product_id, quantity}
)
RETURNS INT AS $$
DECLARE
    v_order_id INT;
    v_customer RECORD;
    v_address RECORD;
    v_item JSONB;
    v_product RECORD;
BEGIN
    -- Fetch customer data for snapshot
    SELECT name, email INTO v_customer
    FROM customers WHERE customer_id = p_customer_id;
    
    -- Fetch shipping address for snapshot
    SELECT street, city, state, postal_code INTO v_address
    FROM shipping_addresses WHERE address_id = p_shipping_address_id;
    
    -- Create order with snapshots
    INSERT INTO orders (
        customer_id, shipping_address_id,
        customer_name_snapshot, customer_email_snapshot,
        shipping_street, shipping_city, shipping_state, shipping_postal_code
    ) VALUES (
        p_customer_id, p_shipping_address_id,
        v_customer.name, v_customer.email,
        v_address.street, v_address.city, v_address.state, v_address.postal_code
    )
    RETURNING order_id INTO v_order_id;
    
    -- Create order items with product snapshots
    FOR v_item IN SELECT * FROM jsonb_array_elements(p_items)
    LOOP
        SELECT p.name, p.sku, p.current_price, c.name AS category_name
        INTO v_product
        FROM products p
        JOIN categories c ON p.category_id = c.category_id
        WHERE p.product_id = (v_item->>'product_id')::INT;
        
        INSERT INTO order_items (
            order_id, product_id, quantity,
            unit_price_snapshot, product_name_snapshot, 
            product_sku_snapshot, category_name_snapshot
        ) VALUES (
            v_order_id, 
            (v_item->>'product_id')::INT,
            (v_item->>'quantity')::INT,
            v_product.current_price,
            v_product.name,
            v_product.sku,
            v_product.category_name
        );
    END LOOP;
    
    RETURN v_order_id;
END;
$$ LANGUAGE plpgsql;

Performance Result

Query time reduced from 157ms to 2.3ms—a 67x improvement. The e-commerce platform can now serve order history pages instantly, improving user experience and reducing database load. The trade-off: slightly larger order tables and snapshot capture during order creation.

Summary and Best Practices

Duplicating foreign key data is a powerful technique for eliminating join overhead in read-heavy workloads. When implemented carefully, it can transform query performance while maintaining data integrity.

Key Takeaways

•Duplicating data trades storage and write complexity for read performance — Eliminate joins by copying frequently-accessed attributes from referenced tables.
•Choose appropriate duplication patterns — Point-in-time capture for transactional snapshots, live synchronized for current state, lazy refresh when staleness is acceptable.
•Profile before duplicating — Measure actual join costs and query frequencies; only duplicate high-impact, high-frequency access patterns.
•Implement layered consistency protection — Transactional integrity, verification checks, automated repair, and alerting work together to maintain data quality.
•Manage update fan-out — For high fan-out relationships, use batched updates or async processing to avoid write performance degradation.
•Preserve referential integrity — Keep foreign key columns even when duplicating data; they remain essential for updates, deletes, and less frequent queries.

What's Next:

Duplicating foreign key data eliminates individual joins. But sometimes the right solution is even more aggressive: fully merging related tables to eliminate the relationship entirely. The next page explores merging tables—combining normalized tables into wider structures for maximum read efficiency.

Page Complete

You now understand foreign key data duplication as a denormalization technique. You can identify duplication candidates, implement appropriate patterns, maintain consistency, and handle update fan-out. Next, we explore merging tables for even greater denormalization.

3 / 5

Loading learning content...

Database Management SystemsDenormalization

Denormalization Techniques

LevelIntermediate

Duration75 mins

TopicDenormalization

3 / 5

Duplicating Foreign Key Data

The Cost of Joins in Normalized Schemas

Consider an order management system with properly normalized tables:

orders → customers (to display customer name)
orders → shipping_addresses (to display delivery info)
order_items → products (to display product name)
order_items → categories (to display category)

Displaying a complete order with all its line items requires joining 5-6 tables. At scale, these joins become the dominant factor in query latency. The database must:

Locate matching rows across multiple tables
Track which rows belong together
Materialize the combined result set
Handle potential index misses and table scans

Duplicating foreign key data addresses this by copying frequently-accessed attributes from related tables into the referencing table, eliminating the need for joins in common queries.

What You Will Learn

Understanding Data Duplication

Before (normalized):

orders table:
| order_id | customer_id | order_date | status |

customers table:
| customer_id | name | email | loyalty_tier |

Query: SELECT o.*, c.name, c.email 
       FROM orders o JOIN customers c ON o.customer_id = c.customer_id

After (with duplicated data):

orders table:
| order_id | customer_id | customer_name | customer_email | order_date | status |

Query: SELECT * FROM orders

The duplicated customer_name and customer_email columns eliminate the join for the most common query pattern while customer_id maintains referential integrity for less frequent operations.

Trade-off Analysis: Join vs. Duplication
Aspect	Join (Normalized)	Duplication (Denormalized)
Storage	Minimal - single copy of each value	Increased - values repeated across rows
Read Performance	Slower - must combine data from multiple tables	Faster - all needed data in single table
Write Performance	Faster - update source once	Slower - must update all copies
Data Consistency	Guaranteed - single source of truth	Must be actively maintained
Schema Complexity	More tables, cleaner design	Fewer joins, more columns
Query Complexity	Explicit joins required	Simpler SELECT statements

The Read-Heavy Reality

Identifying Duplication Candidates

Not all foreign key attributes are worth duplicating. The decision requires analyzing query patterns, update frequencies, and consistency requirements.

Evaluation criteria for duplication candidates:

Strong Candidates for Duplication

•Frequently accessed in reads — Attributes included in nearly every query of the referencing table (e.g., customer name on every order display)
•Rarely updated at source — Attributes that change infrequently (e.g., customer name changes less than product prices)
•Small in size — Compact attributes have minimal storage overhead (e.g., name, category, status vs. large text fields)
•Display-only in context — Attributes used for display but not for business logic (e.g., showing product name, not computing discounts)
•High join cost — Joins to large tables or across database boundaries benefit most from duplication

Poor Candidates for Duplication

•Frequently updated at source — Attributes that change often create constant synchronization overhead
•Large in size — Binary data, long text, or BLOB fields dramatically increase storage costs
•Used in business logic — Attributes needing guaranteed current values (e.g., current prices, inventory counts)
•Rarely accessed — If the join is only needed for occasional reports, duplication isn't justified
•High cardinality with unique access — Attributes rarely displayed together (e.g., detailed customer preferences)

candidate-analysis-queries.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
-- Analysis queries to identify duplication candidates
 
-- 1. Find most frequently joined column combinations
-- (Requires query logging/monitoring setup)
SELECT 
    query_pattern,
    join_tables,
    execution_count,
    avg_execution_time_ms
FROM query_performance_log
WHERE query_pattern LIKE '%JOIN%'
ORDER BY execution_count * avg_execution_time_ms DESC
LIMIT 20;
 
-- 2. Analyze update frequency of referenced table columns
SELECT 
    column_name,
    COUNT(*) AS update_count,
    MAX(updated_at) AS last_update
FROM audit_log
WHERE table_name = 'customers'
  AND operation = 'UPDATE'
  AND updated_at > CURRENT_DATE - INTERVAL '30 days'
GROUP BY column_name
ORDER BY update_count DESC;
 
-- 3. Measure join cost for specific query patterns
EXPLAIN ANALYZE
SELECT o.order_id, o.order_date, o.status,
       c.name AS customer_name, c.email AS customer_email
FROM orders o
JOIN customers c ON o.customer_id = c.customer_id
WHERE o.order_date > CURRENT_DATE - INTERVAL '7 days'
ORDER BY o.order_date DESC
LIMIT 100;
 
-- Check output for:
-- - Nested Loop vs Hash Join (algorithm choice)
-- - Index usage on join conditions
-- - Total execution time
-- - Rows processed vs rows returned
 
-- 4. Estimate storage impact of duplication
SELECT 
    COUNT(*) AS row_count,
    COUNT(*) * 50 AS estimated_bytes_for_name,  -- ~50 bytes per name
    COUNT(*) * 100 AS estimated_bytes_for_email,
    pg_size_pretty((COUNT(*) * 150)::bigint) AS total_additional_storage
FROM orders;

Profile Before Duplicating

Implementation Patterns

There are several patterns for implementing foreign key data duplication, each suited to different scenarios and consistency requirements.

Pattern 1: Point-in-Time Capture

Capture the value at transaction time and never update it. This is appropriate when the historical state is meaningful—what the value was when the action occurred, not what it is now.

point-in-time-capture.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
-- Point-in-Time Capture: Values frozen at creation time
-- Example: Order captures product price at time of order
 
CREATE TABLE order_items (
    order_item_id SERIAL PRIMARY KEY,
    order_id INT NOT NULL REFERENCES orders(order_id),
    product_id INT NOT NULL REFERENCES products(product_id),
    quantity INT NOT NULL,
    
    -- Captured at order time - intentionally NOT updated when product changes
    unit_price DECIMAL(10, 2) NOT NULL,  -- Price at time of order
    product_name VARCHAR(255) NOT NULL,   -- Name at time of order
    product_sku VARCHAR(50) NOT NULL,     -- SKU at time of order
    
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);
 
-- Application code captures values during order creation
INSERT INTO order_items (order_id, product_id, quantity, unit_price, product_name, product_sku)
SELECT 
    :order_id,
    p.product_id,
    :quantity,
    p.current_price,  -- Capture current price
    p.name,           -- Capture current name
    p.sku             -- Capture current SKU
FROM products p
WHERE p.product_id = :product_id;
 
-- Benefits:
-- 1. Historical accuracy - order shows what customer actually saw/paid
-- 2. No synchronization needed - values are intentionally static
-- 3. Audit trail - original values preserved regardless of product changes
-- 4. Legal/compliance - invoices reflect actual transaction terms

Pattern 2: Live Synchronized Copy

Maintain a copy that stays synchronized with the source. This requires active maintenance but ensures the duplicate always reflects current state.

live-synchronized-copy.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
-- Live Synchronized Copy: Duplicates updated when source changes
-- Example: Order displays current customer name
 
CREATE TABLE orders (
    order_id SERIAL PRIMARY KEY,
    customer_id INT NOT NULL REFERENCES customers(customer_id),
    
    -- Synchronized copies - updated when customer data changes
    customer_name VARCHAR(100) NOT NULL,
    customer_email VARCHAR(255) NOT NULL,
    customer_phone VARCHAR(20),
    
    order_date TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    status VARCHAR(20) DEFAULT 'pending'
);
 
-- Trigger to synchronize duplicates when source changes
CREATE OR REPLACE FUNCTION sync_customer_data_to_orders()
RETURNS TRIGGER AS $$
BEGIN
    -- Only update if the duplicated fields actually changed
    IF OLD.name IS DISTINCT FROM NEW.name 
       OR OLD.email IS DISTINCT FROM NEW.email
       OR OLD.phone IS DISTINCT FROM NEW.phone THEN
        
        UPDATE orders
        SET customer_name = NEW.name,
            customer_email = NEW.email,
            customer_phone = NEW.phone
        WHERE customer_id = NEW.customer_id;
        
        -- Log the sync operation for monitoring
        INSERT INTO sync_audit_log (table_name, source_id, operation, synced_at)
        VALUES ('orders', NEW.customer_id, 'customer_update', CURRENT_TIMESTAMP);
    END IF;
    
    RETURN NEW;
END;
$$ LANGUAGE plpgsql;
 
CREATE TRIGGER customer_data_sync_trigger
AFTER UPDATE ON customers
FOR EACH ROW EXECUTE FUNCTION sync_customer_data_to_orders();
 
-- Also sync on initial order creation
CREATE OR REPLACE FUNCTION populate_order_customer_data()
RETURNS TRIGGER AS $$
BEGIN
    SELECT name, email, phone
    INTO NEW.customer_name, NEW.customer_email, NEW.customer_phone
    FROM customers
    WHERE customer_id = NEW.customer_id;
    
    RETURN NEW;
END;
$$ LANGUAGE plpgsql;
 
CREATE TRIGGER order_customer_populate_trigger
BEFORE INSERT ON orders
FOR EACH ROW EXECUTE FUNCTION populate_order_customer_data();

Pattern 3: Lazy Refresh with Staleness Tolerance

Allow duplicates to become stale and refresh periodically. Appropriate when exact consistency isn't required and refresh cost should be batched.

lazy-refresh-pattern.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
-- Lazy Refresh: Periodic batch synchronization
-- Example: Product catalog data in recommendations table
 
CREATE TABLE product_recommendations (
    recommendation_id SERIAL PRIMARY KEY,
    user_id INT NOT NULL,
    product_id INT NOT NULL REFERENCES products(product_id),
    
    -- Duplicated product data (may be slightly stale)
    product_name VARCHAR(255),
    product_image_url VARCHAR(500),
    product_category VARCHAR(100),
    product_price DECIMAL(10, 2),
    
    -- Staleness tracking
    product_data_synced_at TIMESTAMP,
    
    score DECIMAL(5, 4) NOT NULL,
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);
 
-- Batch refresh job - run every 15 minutes or hourly
CREATE OR REPLACE FUNCTION refresh_recommendation_product_data()
RETURNS INTEGER AS $$
DECLARE
    rows_updated INTEGER;
BEGIN
    UPDATE product_recommendations pr
    SET product_name = p.name,
        product_image_url = p.image_url,
        product_category = c.name,
        product_price = p.current_price,
        product_data_synced_at = CURRENT_TIMESTAMP
    FROM products p
    JOIN categories c ON p.category_id = c.category_id
    WHERE pr.product_id = p.product_id
      AND (pr.product_data_synced_at IS NULL 
           OR pr.product_data_synced_at < p.updated_at);
    
    GET DIAGNOSTICS rows_updated = ROW_COUNT;
    
    -- Log refresh statistics
    INSERT INTO batch_sync_log (job_name, rows_affected, completed_at)
    VALUES ('refresh_recommendation_product_data', rows_updated, CURRENT_TIMESTAMP);
    
    RETURN rows_updated;
END;
$$ LANGUAGE plpgsql;
 
-- Selective refresh: Only sync high-visibility rows
CREATE OR REPLACE FUNCTION refresh_active_recommendation_data()
RETURNS INTEGER AS $$
DECLARE
    rows_updated INTEGER;
BEGIN
    -- Only refresh recommendations shown in last 24 hours
    UPDATE product_recommendations pr
    SET product_name = p.name,
        product_price = p.current_price,
        product_data_synced_at = CURRENT_TIMESTAMP
    FROM products p
    WHERE pr.product_id = p.product_id
      AND pr.last_displayed_at > CURRENT_TIMESTAMP - INTERVAL '24 hours'
      AND pr.product_data_synced_at < p.updated_at;
    
    GET DIAGNOSTICS rows_updated = ROW_COUNT;
    RETURN rows_updated;
END;
$$ LANGUAGE plpgsql;

When to Use Each Pattern
Pattern	Use When	Examples
Point-in-Time Capture	Historical accuracy matters; legal/audit requirements; values represent transaction terms	Order prices, invoice addresses, contract terms
Live Synchronized	Current values always needed; low update frequency at source; strong consistency required	Customer contact info, product availability status
Lazy Refresh	Slight staleness acceptable; high volume tables; batch updates preferred	Recommendations, cached listings, search results

Maintaining Consistency

Data duplication inherently creates the possibility of inconsistency—the duplicate may not match the source. Robust implementations require multiple layers of consistency protection.

Layer 1: Transactional Consistency

Wrap source updates and duplicate updates in the same database transaction:

transactional-consistency.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
-- Transactional consistency: All updates in single transaction
BEGIN;
 
-- Update source table
UPDATE customers
SET name = 'Jane Smith',
    email = 'jane.smith@example.com'
WHERE customer_id = 12345;
 
-- Update all duplicates in same transaction
UPDATE orders
SET customer_name = 'Jane Smith',
    customer_email = 'jane.smith@example.com'
WHERE customer_id = 12345;
 
UPDATE support_tickets
SET customer_name = 'Jane Smith',
    customer_email = 'jane.smith@example.com'
WHERE customer_id = 12345;
 
COMMIT;  -- All changes succeed or all fail
 
-- Better: Let triggers handle this automatically (see previous examples)
-- The trigger executes within the same transaction as the source UPDATE

Layer 2: Consistency Verification

Scheduled jobs to detect and alert on inconsistencies:

consistency-verification.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
-- Consistency verification: Detect drift between source and duplicates
 
CREATE TABLE consistency_check_results (
    check_id SERIAL PRIMARY KEY,
    check_name VARCHAR(100) NOT NULL,
    table_name VARCHAR(100) NOT NULL,
    inconsistent_rows INT NOT NULL,
    sample_ids TEXT,
    checked_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);
 
-- Check orders.customer_name matches customers.name
CREATE OR REPLACE FUNCTION check_order_customer_consistency()
RETURNS void AS $$
DECLARE
    inconsistent_count INT;
    sample_order_ids TEXT;
BEGIN
    -- Count inconsistencies
    SELECT COUNT(*), STRING_AGG(order_id::TEXT, ',' ORDER BY order_id LIMIT 10)
    INTO inconsistent_count, sample_order_ids
    FROM orders o
    JOIN customers c ON o.customer_id = c.customer_id
    WHERE o.customer_name != c.name
       OR o.customer_email != c.email;
    
    -- Log results
    INSERT INTO consistency_check_results (check_name, table_name, inconsistent_rows, sample_ids)
    VALUES ('order_customer_data', 'orders', inconsistent_count, sample_order_ids);
    
    -- Alert if threshold exceeded
    IF inconsistent_count > 0 THEN
        PERFORM pg_notify('consistency_alert', 
            FORMAT('Found %s inconsistent rows in orders.customer_* fields', inconsistent_count));
    END IF;
END;
$$ LANGUAGE plpgsql;
 
-- Schedule to run hourly
SELECT cron.schedule('order-customer-consistency-check', '0 * * * *', 
    'SELECT check_order_customer_consistency()');

Layer 3: Automated Repair

When inconsistencies are detected, repair them automatically or flag for review:

automated-repair.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
-- Automated repair: Fix detected inconsistencies
 
CREATE OR REPLACE FUNCTION repair_order_customer_inconsistencies()
RETURNS TABLE (
    order_id INT,
    old_customer_name VARCHAR,
    new_customer_name VARCHAR,
    repaired_at TIMESTAMP
) AS $$
BEGIN
    RETURN QUERY
    WITH inconsistent AS (
        SELECT o.order_id,
               o.customer_name AS old_name,
               c.name AS correct_name,
               o.customer_email AS old_email,
               c.email AS correct_email
        FROM orders o
        JOIN customers c ON o.customer_id = c.customer_id
        WHERE o.customer_name != c.name OR o.customer_email != c.email
    ),
    repaired AS (
        UPDATE orders o
        SET customer_name = i.correct_name,
            customer_email = i.correct_email
        FROM inconsistent i
        WHERE o.order_id = i.order_id
        RETURNING o.order_id, i.old_name, i.correct_name
    )
    SELECT r.order_id, 
           r.old_name,
           r.correct_name,
           CURRENT_TIMESTAMP
    FROM repaired r;
    
    -- Log repair action
    INSERT INTO data_repair_log (repair_type, rows_repaired, repaired_at)
    SELECT 'order_customer_sync', COUNT(*), CURRENT_TIMESTAMP
    FROM repaired;
END;
$$ LANGUAGE plpgsql;
 
-- Run repair with limits for safety
CREATE OR REPLACE FUNCTION safe_repair_order_customers(max_repairs INT DEFAULT 1000)
RETURNS INT AS $$
DECLARE
    repaired_count INT;
BEGIN
    WITH limited_inconsistent AS (
        SELECT o.order_id
        FROM orders o
        JOIN customers c ON o.customer_id = c.customer_id
        WHERE o.customer_name != c.name
        LIMIT max_repairs
    ),
    repaired AS (
        UPDATE orders o
        SET customer_name = c.name,
            customer_email = c.email
        FROM customers c
        WHERE o.customer_id = c.customer_id
          AND o.order_id IN (SELECT order_id FROM limited_inconsistent)
        RETURNING o.order_id
    )
    SELECT COUNT(*) INTO repaired_count FROM repaired;
    
    RETURN repaired_count;
END;
$$ LANGUAGE plpgsql;

Defense in Depth

Handling Source Data Updates

When source data changes, duplicated data must be updated—but this can impact many rows. Managing this update fan-out is crucial for write performance.

Understanding update fan-out:

fanout-analysis.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
-- Analyze fan-out ratio for different source tables
 
-- Customer fan-out: How many orders per customer?
SELECT 
    PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY order_count) AS median_orders,
    PERCENTILE_CONT(0.95) WITHIN GROUP (ORDER BY order_count) AS p95_orders,
    MAX(order_count) AS max_orders,
    AVG(order_count) AS avg_orders
FROM (
    SELECT customer_id, COUNT(*) AS order_count
    FROM orders
    GROUP BY customer_id
) customer_orders;
 
-- Product fan-out: How many order items per product?
SELECT 
    PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY item_count) AS median_orders,
    PERCENTILE_CONT(0.95) WITHIN GROUP (ORDER BY item_count) AS p95_orders,
    MAX(item_count) AS max_orders
FROM (
    SELECT product_id, COUNT(*) AS item_count
    FROM order_items
    GROUP BY product_id
) product_items;
 
-- Identify high-fanout entities that would cause massive updates
SELECT p.product_id, p.name, COUNT(*) AS order_item_count
FROM products p
JOIN order_items oi ON p.product_id = oi.product_id
GROUP BY p.product_id, p.name
HAVING COUNT(*) > 10000
ORDER BY order_item_count DESC;

Strategy 1: Batched Updates

For high fan-out updates, process in batches to avoid long-running transactions and lock contention:

batched-update-strategy.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
-- Batched update: Process in manageable chunks
 
CREATE OR REPLACE FUNCTION update_product_name_in_order_items(
    p_product_id INT,
    p_new_name VARCHAR,
    p_batch_size INT DEFAULT 1000
)
RETURNS INT AS $$
DECLARE
    total_updated INT := 0;
    batch_updated INT;
BEGIN
    LOOP
        -- Update one batch
        WITH batch AS (
            SELECT order_item_id
            FROM order_items
            WHERE product_id = p_product_id
              AND product_name != p_new_name
            LIMIT p_batch_size
            FOR UPDATE SKIP LOCKED  -- Avoid blocking concurrent transactions
        )
        UPDATE order_items
        SET product_name = p_new_name
        WHERE order_item_id IN (SELECT order_item_id FROM batch);
        
        GET DIAGNOSTICS batch_updated = ROW_COUNT;
        total_updated := total_updated + batch_updated;
        
        -- Exit when no more rows to update
        EXIT WHEN batch_updated = 0;
        
        -- Brief pause between batches to reduce lock pressure
        PERFORM pg_sleep(0.1);  -- 100ms
        
        -- Commit each batch (requires calling from application with autocommit)
        COMMIT;
    END LOOP;
    
    RETURN total_updated;
END;
$$ LANGUAGE plpgsql;

Strategy 2: Async Queue-Based Updates

Decouple source updates from duplicate propagation using a message queue:

async-queue-updates.ts
TypeScript
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
// Async queue-based duplicate propagation
 
interface DataChangeEvent {
    sourceTable: string;
    sourceId: number;
    changedFields: Record<string, { old: any; new: any }>;
    timestamp: Date;
}
 
class DuplicateDataSynchronizer {
    private queue: MessageQueue;
    private db: Database;
 
    // Source table update triggers this
    async onSourceDataChange(event: DataChangeEvent) {
        // Enqueue for async processing
        await this.queue.publish('duplicate-sync', event);
    }
 
    // Worker processes queue messages
    async processDuplicateSync(event: DataChangeEvent) {
        const syncConfig = this.getSyncConfig(event.sourceTable);
 
        for (const target of syncConfig.targets) {
            await this.batchUpdateTarget(
                target.table,
                target.foreignKey,
                event.sourceId,
                this.mapFields(event.changedFields, target.fieldMapping)
            );
        }
    }
 
    private async batchUpdateTarget(
        table: string,
        foreignKey: string,
        sourceId: number,
        updates: Record<string, any>,
        batchSize: number = 500
    ) {
        let offset = 0;
        let updated = 0;
 
        do {
            const result = await this.db.execute(`
                UPDATE ${table}
                SET ${Object.entries(updates).map(([k, v], i) => `${k} = $${i + 3}`).join(', ')},
                    sync_updated_at = NOW()
                WHERE ${foreignKey} = $1
                  AND id > $2
                ORDER BY id
                LIMIT ${batchSize}
            `, [sourceId, offset, ...Object.values(updates)]);
 
            updated = result.rowCount;
            offset += batchSize;
 
            // Yield to allow other operations
            await new Promise(resolve => setImmediate(resolve));
        } while (updated === batchSize);
    }
}

Accept Eventual Consistency

Real-World Example: E-Commerce Order Display

Let's examine a complete implementation of foreign key data duplication for an e-commerce order display system.

Requirement: Display order history with customer name, shipping address, and product details without joins for every page load.

Before optimization:

before-denormalization.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
-- Original normalized query for order display
-- Requires 5 table joins
 
SELECT 
    o.order_id,
    o.order_date,
    o.status,
    c.name AS customer_name,
    c.email AS customer_email,
    sa.street AS shipping_street,
    sa.city AS shipping_city,
    sa.state AS shipping_state,
    sa.postal_code AS shipping_postal,
    oi.quantity,
    oi.unit_price,
    p.name AS product_name,
    p.sku AS product_sku,
    cat.name AS category_name
FROM orders o
JOIN customers c ON o.customer_id = c.customer_id
JOIN shipping_addresses sa ON o.shipping_address_id = sa.address_id
JOIN order_items oi ON o.order_id = oi.order_id
JOIN products p ON oi.product_id = p.product_id
JOIN categories cat ON p.category_id = cat.category_id
WHERE o.customer_id = $1
ORDER BY o.order_date DESC
LIMIT 50;
 
-- EXPLAIN ANALYZE result:
-- Hash Join (cost=1234.56..5678.90)
--   -> Nested Loop (cost=...)
--     -> Index Scan on orders...
--   -> Hash (cost=...)
--     -> Seq Scan on customers...
-- Planning Time: 8.234 ms
-- Execution Time: 156.789 ms

After optimization with duplicated data:

after-denormalization.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
-- Schema with duplicated foreign key data
 
CREATE TABLE orders (
    order_id SERIAL PRIMARY KEY,
    customer_id INT NOT NULL REFERENCES customers(customer_id),
    shipping_address_id INT NOT NULL REFERENCES shipping_addresses(address_id),
    
    -- Duplicated customer data (point-in-time for legal/audit)
    customer_name_snapshot VARCHAR(100) NOT NULL,
    customer_email_snapshot VARCHAR(255) NOT NULL,
    
    -- Duplicated shipping address (point-in-time - where it was shipped)
    shipping_street VARCHAR(255) NOT NULL,
    shipping_city VARCHAR(100) NOT NULL,
    shipping_state VARCHAR(50) NOT NULL,
    shipping_postal_code VARCHAR(20) NOT NULL,
    
    order_date TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    status VARCHAR(20) DEFAULT 'pending'
);
 
CREATE TABLE order_items (
    order_item_id SERIAL PRIMARY KEY,
    order_id INT NOT NULL REFERENCES orders(order_id),
    product_id INT NOT NULL REFERENCES products(product_id),
    
    quantity INT NOT NULL,
    
    -- Duplicated product data (point-in-time - what was ordered)
    unit_price_snapshot DECIMAL(10, 2) NOT NULL,
    product_name_snapshot VARCHAR(255) NOT NULL,
    product_sku_snapshot VARCHAR(50) NOT NULL,
    category_name_snapshot VARCHAR(100) NOT NULL
);
 
-- Optimized query: single table + child table, no external joins
SELECT 
    o.order_id,
    o.order_date,
    o.status,
    o.customer_name_snapshot AS customer_name,
    o.customer_email_snapshot AS customer_email,
    o.shipping_street,
    o.shipping_city,
    o.shipping_state,
    o.shipping_postal_code,
    oi.quantity,
    oi.unit_price_snapshot AS unit_price,
    oi.product_name_snapshot AS product_name,
    oi.product_sku_snapshot AS product_sku,
    oi.category_name_snapshot AS category_name
FROM orders o
JOIN order_items oi ON o.order_id = oi.order_id
WHERE o.customer_id = $1
ORDER BY o.order_date DESC
LIMIT 50;
 
-- EXPLAIN ANALYZE result:
-- Nested Loop (cost=0.85..123.45)
--   -> Index Scan using idx_orders_customer_date on orders
--   -> Index Scan using idx_order_items_order_id on order_items
-- Planning Time: 0.456 ms
-- Execution Time: 2.345 ms  -- 67x FASTER!

order-creation-with-snapshots.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
-- Order creation procedure: Captures point-in-time snapshots
 
CREATE OR REPLACE FUNCTION create_order(
    p_customer_id INT,
    p_shipping_address_id INT,
    p_items JSONB  -- Array of {product_id, quantity}
)
RETURNS INT AS $$
DECLARE
    v_order_id INT;
    v_customer RECORD;
    v_address RECORD;
    v_item JSONB;
    v_product RECORD;
BEGIN
    -- Fetch customer data for snapshot
    SELECT name, email INTO v_customer
    FROM customers WHERE customer_id = p_customer_id;
    
    -- Fetch shipping address for snapshot
    SELECT street, city, state, postal_code INTO v_address
    FROM shipping_addresses WHERE address_id = p_shipping_address_id;
    
    -- Create order with snapshots
    INSERT INTO orders (
        customer_id, shipping_address_id,
        customer_name_snapshot, customer_email_snapshot,
        shipping_street, shipping_city, shipping_state, shipping_postal_code
    ) VALUES (
        p_customer_id, p_shipping_address_id,
        v_customer.name, v_customer.email,
        v_address.street, v_address.city, v_address.state, v_address.postal_code
    )
    RETURNING order_id INTO v_order_id;
    
    -- Create order items with product snapshots
    FOR v_item IN SELECT * FROM jsonb_array_elements(p_items)
    LOOP
        SELECT p.name, p.sku, p.current_price, c.name AS category_name
        INTO v_product
        FROM products p
        JOIN categories c ON p.category_id = c.category_id
        WHERE p.product_id = (v_item->>'product_id')::INT;
        
        INSERT INTO order_items (
            order_id, product_id, quantity,
            unit_price_snapshot, product_name_snapshot, 
            product_sku_snapshot, category_name_snapshot
        ) VALUES (
            v_order_id, 
            (v_item->>'product_id')::INT,
            (v_item->>'quantity')::INT,
            v_product.current_price,
            v_product.name,
            v_product.sku,
            v_product.category_name
        );
    END LOOP;
    
    RETURN v_order_id;
END;
$$ LANGUAGE plpgsql;

Performance Result

Summary and Best Practices

Key Takeaways

•Duplicating data trades storage and write complexity for read performance — Eliminate joins by copying frequently-accessed attributes from referenced tables.
•Choose appropriate duplication patterns — Point-in-time capture for transactional snapshots, live synchronized for current state, lazy refresh when staleness is acceptable.
•Profile before duplicating — Measure actual join costs and query frequencies; only duplicate high-impact, high-frequency access patterns.
•Implement layered consistency protection — Transactional integrity, verification checks, automated repair, and alerting work together to maintain data quality.
•Manage update fan-out — For high fan-out relationships, use batched updates or async processing to avoid write performance degradation.
•Preserve referential integrity — Keep foreign key columns even when duplicating data; they remain essential for updates, deletes, and less frequent queries.

What's Next:

Page Complete

3 / 5