System Design (HLD)Data Modeling Fundamentals

Data Modeling Fundamentals

LevelIntermediate

Duration90 mins

TopicData Modeling Fundamentals

3 / 5

Access Pattern-Driven Design

Design for How You Query

Traditional data modeling teaches us to start with entities and their logical relationships. Customer has Orders. Order contains Products. This entity-first approach produces clean, normalized schemas that accurately represent the domain.

But there's a problem: the database doesn't care about your domain model. The database cares about the queries it must execute. A beautifully normalized schema that requires 8-table JOINs for your most common query is a poorly designed schema.

Access pattern-driven design flips the methodology. Instead of asking 'What entities exist?', we ask 'What questions will we ask of this data?' The queries come first; the schema follows.

This isn't a rejection of proper modeling—it's an evolution. We still understand entities and relationships, but we optimize the physical structure for the access patterns that matter most.

What You Will Learn

By the end of this page, you will understand how to identify and document access patterns, how to design schemas that optimize for specific query types, how to balance competing access patterns, and how to apply these principles across both SQL and NoSQL databases.

The Entity-First Trap

Entity-first design feels intuitive. You examine the domain, identify entities, define their attributes, map relationships, and normalize. The result is a schema that accurately represents the business domain.

But accuracy and performance are different goals.

The Reality of Production Queries:

Consider an e-commerce application. The entity-first schema might look like:

customers, orders, order_items, products, categories, addresses, payments

This is logically correct. But what happens when you need to display a customer's order history page? You need:

Customer name and email (from customers)
All orders with status and date (from orders)
Line items for each order (from order_items)
Product names and images (from products)
Shipping address (from addresses)

That's a 5-table JOIN for every page load, on one of your most common user flows.

entity-first-query.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
-- Entity-first schema: Logically correct, operationally expensive
SELECT 
    o.order_id,
    o.status,
    o.created_at,
    c.name AS customer_name,
    c.email,
    oi.quantity,
    oi.unit_price,
    p.name AS product_name,
    p.image_url,
    a.street,
    a.city,
    a.state,
    a.zip_code
FROM orders o
JOIN customers c ON o.customer_id = c.customer_id
JOIN order_items oi ON o.order_id = oi.order_id
JOIN products p ON oi.product_id = p.product_id
JOIN addresses a ON o.shipping_address_id = a.address_id
WHERE c.customer_id = 'cust-uuid'
ORDER BY o.created_at DESC
LIMIT 50;
 
-- Query plan analysis might show:
-- - Seq Scan on orders (if customer_id index missing)
-- - Hash Join on order_items 
-- - Hash Join on products
-- - Nested Loop on addresses and customers
-- - Total: 200ms+ on moderate data volumes

The Hidden Cost

The entity-first trap is insidious because the schema looks 'clean' and passes all normalization checks. The problem only surfaces in production when query latency climbs. By then, you're dealing with migrations on live data—far more expensive than designing for access patterns upfront.

Identifying Access Patterns

Before designing a schema, you must systematically identify and document your access patterns. This is not guesswork—it requires careful analysis of how the system will be used.

Access Pattern Documentation:

For each major operation, document:

Description: What is the user/system trying to accomplish?
Frequency: How often will this query run? (per second, per minute, per day)
Latency requirement: What's the acceptable response time?
Data volume: How many records are typically involved?
Pattern type: Point lookup, range scan, aggregation, full scan?
Filters/parameters: What values are provided as input?
Sort order: How should results be ordered?

E-commerce Access Pattern Analysis
Operation	Frequency	Latency	Pattern Type	Priority
Get product details by ID	10K/sec	<50ms	Point lookup	Critical
List products by category	5K/sec	<100ms	Range scan + pagination	Critical
Search products by keyword	2K/sec	<200ms	Full-text search	High
Get customer order history	500/sec	<150ms	Range scan	High
Calculate daily revenue	10/day	<30sec	Aggregation	Medium
Find products below restock level	1/hour	<5sec	Range scan	Low

Priority-Based Design:

Not all access patterns are equal. A query that runs 10,000 times per second with a 50ms SLA demands schema optimization. A nightly batch report can tolerate suboptimal query plans.

Rank access patterns by:

Frequency × Latency Sensitivity = Optimization Priority

Optimize ruthlessly for the top 3-5 patterns. Accept that lower-priority patterns may require more expensive queries—this is an intentional trade-off, not a failure.

access-pattern-docs.ts
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
// Formal access pattern documentation
interface AccessPattern {
  id: string;
  description: string;
  frequency: {
    value: number;
    unit: 'per_second' | 'per_minute' | 'per_hour' | 'per_day';
  };
  latencySLA: {
    p50: number;  // milliseconds
    p99: number;  // milliseconds
  };
  inputs: string[];
  outputs: string[];
  patternType: 'point_lookup' | 'range_scan' | 'aggregation' | 'search';
  priority: 'critical' | 'high' | 'medium' | 'low';
}
 
const accessPatterns: AccessPattern[] = [
  {
    id: 'get-product-by-id',
    description: 'Fetch product details for product display page',
    frequency: { value: 10000, unit: 'per_second' },
    latencySLA: { p50: 20, p99: 50 },
    inputs: ['product_id'],
    outputs: ['name', 'description', 'price', 'images', 'category', 'inventory_count'],
    patternType: 'point_lookup',
    priority: 'critical',
  },
  {
    id: 'customer-order-history',
    description: 'List orders with details for customer account page',
    frequency: { value: 500, unit: 'per_second' },
    latencySLA: { p50: 100, p99: 200 },
    inputs: ['customer_id', 'page_number', 'page_size'],
    outputs: ['order_id', 'date', 'status', 'total', 'item_summaries'],
    patternType: 'range_scan',
    priority: 'high',
  },
  // ... more patterns
];

Talk to Product and Frontend Teams

Access patterns emerge from use cases, which emerge from product requirements. Don't design in isolation. Work with product managers to understand user flows. Work with frontend engineers to understand what data each page/screen needs. These conversations reveal the true access patterns, not the ones you imagine.

Designing for Specific Patterns

Each access pattern type suggests specific schema design strategies. Let's examine how to optimize for the most common patterns.

Pattern 1: Point Lookups (Get by ID)

The most common pattern: fetch a single record by its identifier. This should be O(1) or O(log n).

Design principles:

Primary key on the lookup column
Denormalize frequently joined data into the main record
Consider covering indexes to avoid heap lookups

point-lookup-design.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
-- POINT LOOKUP: Get product by ID
-- Optimize by denormalizing category and aggregates into product row
 
CREATE TABLE products (
    product_id          UUID PRIMARY KEY,
    sku                 VARCHAR(100) NOT NULL UNIQUE,
    name                VARCHAR(255) NOT NULL,
    description         TEXT,
    price               DECIMAL(10, 2) NOT NULL,
    
    -- Denormalized: category info embedded
    category_id         UUID NOT NULL,
    category_name       VARCHAR(100) NOT NULL,
    category_path       TEXT,  -- '/electronics/phones/smartphones'
    
    -- Denormalized: aggregates
    average_rating      DECIMAL(3, 2),
    review_count        INTEGER DEFAULT 0,
    inventory_count     INTEGER DEFAULT 0,
    
    -- Denormalized: main image for listing pages
    main_image_url      TEXT,
    
    created_at          TIMESTAMPTZ DEFAULT NOW(),
    updated_at          TIMESTAMPTZ DEFAULT NOW()
);
 
-- Single query returns everything needed for product page
SELECT * FROM products WHERE product_id = 'uuid';
 
-- No JOINs needed for the critical product display path

Pattern 2: Range Scans (List by Filter)

Fetch multiple records matching criteria, often with pagination. Examples: orders by customer, products by category, messages by thread.

Design principles:

Index on filter columns + sort columns
Store related data together for efficient scanning
Consider composite indexes matching common filter combinations

range-scan-design.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
-- RANGE SCAN: List customer orders with pagination
-- Design index to match access pattern exactly
 
CREATE TABLE orders (
    order_id            UUID PRIMARY KEY,
    customer_id         UUID NOT NULL,
    status              VARCHAR(50) NOT NULL,
    total_amount        DECIMAL(12, 2) NOT NULL,
    created_at          TIMESTAMPTZ NOT NULL DEFAULT NOW(),
    
    -- Denormalized for order list display
    item_count          INTEGER NOT NULL,
    first_item_name     VARCHAR(255),  -- "MacBook Pro and 2 more items"
    first_item_image    TEXT
);
 
-- Composite index matching the exact access pattern
-- customer_id (equality) + created_at (range/sort)
CREATE INDEX idx_orders_customer_date 
    ON orders (customer_id, created_at DESC);
 
-- Query uses the index efficiently
SELECT order_id, status, total_amount, created_at, 
       item_count, first_item_name, first_item_image
FROM orders
WHERE customer_id = 'cust-uuid'
ORDER BY created_at DESC
LIMIT 20 OFFSET 0;
 
-- Index-only scan possible if all columns in index or few enough

Pattern 3: Aggregations (Analytics Queries)

Compute sums, averages, counts across many records. Examples: daily revenue, user growth, conversion rates.

Design principles:

Precompute aggregates if real-time isn't required
Use materialized views for complex aggregations
Partition tables for efficient range-based aggregations

aggregation-design.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
-- AGGREGATION: Daily revenue reporting
-- Use partitioning + materialized views for efficient aggregation
 
-- Partition orders by month for efficient date-range queries
CREATE TABLE orders (
    order_id            UUID NOT NULL,
    customer_id         UUID NOT NULL,
    total_amount        DECIMAL(12, 2) NOT NULL,
    created_at          TIMESTAMPTZ NOT NULL,
    PRIMARY KEY (order_id, created_at)
) PARTITION BY RANGE (created_at);
 
-- Create monthly partitions
CREATE TABLE orders_2024_01 PARTITION OF orders
    FOR VALUES FROM ('2024-01-01') TO ('2024-02-01');
CREATE TABLE orders_2024_02 PARTITION OF orders
    FOR VALUES FROM ('2024-02-01') TO ('2024-03-01');
-- etc.
 
-- Materialized view for daily aggregates
CREATE MATERIALIZED VIEW daily_revenue AS
SELECT 
    DATE(created_at) AS date,
    COUNT(*) AS order_count,
    SUM(total_amount) AS revenue,
    AVG(total_amount) AS average_order_value
FROM orders
WHERE created_at >= CURRENT_DATE - INTERVAL '90 days'
GROUP BY DATE(created_at);
 
-- Refresh nightly
REFRESH MATERIALIZED VIEW CONCURRENTLY daily_revenue;
 
-- Fast analytics queries
SELECT * FROM daily_revenue WHERE date >= CURRENT_DATE - INTERVAL '30 days';

OLTP vs OLAP

If your system requires both transactional (OLTP) and analytical (OLAP) query patterns, consider separating them. Keep a normalized OLTP database for writes and real-time reads. Replicate to a denormalized data warehouse (Snowflake, BigQuery, Redshift) for analytics. This is the 'lambda architecture' approach—each database optimized for its access patterns.

Index Design Principles

Indexes are the primary mechanism for optimizing access patterns without changing table structure. Expert index design can often eliminate the need for denormalization.

The Compound Index Ordering Rule:

For composite indexes, column order matters dramatically. The rule:

Equality columns first — Columns in WHERE with =
Range/inequality columns second — Columns with <, >, BETWEEN
Sort columns last — Columns in ORDER BY

This order allows the database to use the index for the maximum number of conditions.

index-ordering.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
-- ACCESS PATTERN: Find active products in category, sorted by price
-- Query: WHERE category_id = ? AND status = 'active' ORDER BY price ASC
 
-- WRONG: Range column (price) before equality columns
CREATE INDEX idx_bad ON products (price, category_id, status);
-- Index can't be used efficiently for equality on category_id
 
-- CORRECT: Equality first, then sort
CREATE INDEX idx_good ON products (category_id, status, price);
-- Index scan: Find category_id, filter status, already sorted by price
 
-- QUERY USES INDEX PERFECTLY
EXPLAIN ANALYZE
SELECT product_id, name, price
FROM products
WHERE category_id = 'cat-uuid' 
  AND status = 'active'
ORDER BY price ASC
LIMIT 20;
 
-- Result: Index Scan using idx_good, no sort operation needed

Covering Indexes:

A covering index contains all columns needed by a query, eliminating the need to access the table itself (heap/clustered index). This dramatically improves performance for frequently-run queries.

covering-index.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
-- ACCESS PATTERN: List order summaries for customer
-- Query needs: order_id, status, total, created_at
-- Filter: customer_id = ?
 
-- Non-covering index: requires heap lookup for each row
CREATE INDEX idx_non_covering ON orders (customer_id);
-- Scan index → get row IDs → lookup heap for each row
 
-- Covering index: INCLUDE columns in the index leaf nodes
CREATE INDEX idx_covering ON orders (customer_id, created_at DESC)
INCLUDE (order_id, status, total_amount);
-- Scan index → return data directly, no heap access
 
-- PostgreSQL EXPLAIN shows "Index Only Scan" when covering works
EXPLAIN ANALYZE
SELECT order_id, status, total_amount, created_at
FROM orders
WHERE customer_id = 'cust-uuid'
ORDER BY created_at DESC
LIMIT 10;
 
-- "Index Only Scan using idx_covering"
-- Heap Fetches: 0  ← This is the goal

Index Design Checklist

•Identify high-frequency queries: Focus index optimization on queries running thousands of times per second.
•Match index to query: Column order in index should match filter types (equality → range → sort).
•Consider covering indexes: For hot queries, include all SELECT columns to avoid heap lookups.
•Balance read vs write: Every index slows writes. Don't over-index—target specific access patterns.
•Monitor index usage: Unused indexes waste storage and slow writes. Periodically audit with pg_stat_user_indexes.
•Use partial indexes: If you only query active/recent data, index only those rows: WHERE status = 'active'.

EXPLAIN ANALYZE Is Your Best Friend

Never guess about index effectiveness. Run EXPLAIN ANALYZE on your queries to see exactly how the database executes them. Look for 'Seq Scan' on large tables (bad), 'Index Scan' or 'Index Only Scan' (good), and check the actual vs estimated row counts.

NoSQL and Access Patterns

Access pattern-driven design becomes even more critical in NoSQL databases, where query flexibility is limited by design. Unlike SQL databases which can JOIN anything at runtime (at performance cost), NoSQL databases must have data pre-arranged for the queries you need.

DynamoDB: The Extreme Case

Amazon DynamoDB exemplifies access pattern-driven design. You must define your access patterns before writing a single line of code, because:

No JOINs exist—data must be denormalized
Queries must use partition key or pre-defined indexes
Adding new access patterns may require table redesign

This forces rigorous upfront analysis but delivers consistent single-digit millisecond performance at any scale.

dynamodb-design.ts
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
// DynamoDB single-table design for e-commerce
// All access patterns served by one table with composite keys
 
interface DynamoDBRecord {
  PK: string;           // Partition key
  SK: string;           // Sort key
  GSI1PK?: string;     // Global Secondary Index 1 partition key
  GSI1SK?: string;     // Global Secondary Index 1 sort key
  // ... entity-specific attributes
}
 
// ACCESS PATTERNS AND KEY DESIGN:
 
// 1. Get customer by ID
//    PK: CUSTOMER#<customer_id>  SK: PROFILE
{
  PK: 'CUSTOMER#cust-123',
  SK: 'PROFILE',
  name: 'John Doe',
  email: 'john@example.com',
  entityType: 'Customer'
}
 
// 2. List orders for customer (sorted by date)
//    PK: CUSTOMER#<customer_id>  SK: ORDER#<timestamp>
{
  PK: 'CUSTOMER#cust-123',
  SK: 'ORDER#2024-01-15T10:30:00Z#ord-456',
  orderId: 'ord-456',
  status: 'shipped',
  total: 129.99,
  entityType: 'Order'
}
 
// Query: Get all orders for customer
// Key condition: PK = 'CUSTOMER#cust-123' AND begins_with(SK, 'ORDER#')
// Returns orders sorted by date (SK is sortable)
 
// 3. Get order details (including items)
//    PK: ORDER#<order_id>  SK: METADATA or ITEM#<product_id>
{
  PK: 'ORDER#ord-456',
  SK: 'METADATA',
  customerId: 'cust-123',
  status: 'shipped',
  shippingAddress: { /* embedded */ },
  entityType: 'OrderMetadata'
}
{
  PK: 'ORDER#ord-456', 
  SK: 'ITEM#prod-789',
  productName: 'Widget Pro',
  quantity: 2,
  unitPrice: 49.99,
  entityType: 'OrderItem'
}
 
// Query: Get order with all items
// Key condition: PK = 'ORDER#ord-456'
// Returns metadata + all items in one query

MongoDB: Embedded Documents

MongoDB offers more flexibility than DynamoDB but still rewards access pattern-driven design. The key question: should related data be embedded (denormalized) or referenced (normalized)?

Embed when data is accessed together, updated together, and the relationship is 1:1 or 1:few
Reference when data is accessed independently, updated at different rates, or the relationship is 1:many

mongodb-design.ts
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
// MongoDB document design based on access patterns
 
// PATTERN: Always fetch user with their preferences
// DESIGN: Embed preferences in user document
interface UserDocument {
  _id: ObjectId;
  email: string;
  name: string;
  // Embedded (1:1, always accessed together)
  preferences: {
    theme: 'light' | 'dark';
    notifications: boolean;
    language: string;
  };
  // Embedded (1:few, bounded growth)
  addresses: Array<{
    type: 'home' | 'work' | 'shipping';
    street: string;
    city: string;
    zipCode: string;
  }>;
}
 
// PATTERN: Fetch blog post with author name (not full profile)
// DESIGN: Embed author summary, reference full author
interface BlogPostDocument {
  _id: ObjectId;
  title: string;
  content: string;
  // Denormalized author summary for display
  author: {
    _id: ObjectId;  // Reference for full profile link
    name: string;
    avatarUrl: string;
  };
  // Referenced (unbounded, accessed separately)
  commentIds: ObjectId[];  // Only IDs, fetch comments separately
  
  // Or for small numbers, embed recent comments
  recentComments: Array<{
    _id: ObjectId;
    authorName: string;
    content: string;
    createdAt: Date;
  }>;
}
 
// PATTERN: Show order history with basic details
// DESIGN: Embed line item summaries, reference products
interface OrderDocument {
  _id: ObjectId;
  customerId: ObjectId;
  status: string;
  totalAmount: number;
  createdAt: Date;
  // Embedded snapshot (historical accuracy)
  items: Array<{
    productId: ObjectId;
    productName: string;   // Snapshot at order time
    quantity: number;
    unitPrice: number;     // Price at order time
  }>;
  shippingAddress: {
    // Full embedded address snapshot
  };
}

NoSQL Requires More Upfront Design

With SQL databases, you can often add indexes or rewrite queries to support new access patterns. With NoSQL, changing access patterns may require data migrations, new tables, or complete redesigns. Invest heavily in access pattern analysis before choosing NoSQL.

Handling Competing Access Patterns

Real systems have multiple access patterns that may conflict. Optimizing for one pattern often suboptimizes another. How do you resolve these tensions?

Strategy 1: Priority-Based Optimization

Rank patterns by importance (frequency × latency sensitivity). Optimize fully for top patterns; accept suboptimal performance for lower-priority patterns.

Strategy 2: Separate Read Models (CQRS)

Command Query Responsibility Segregation maintains separate data stores for reads and writes. The write store is normalized for integrity; read stores are denormalized for specific queries.

cqrs-pattern.ts
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
// CQRS: Separate write and read models
 
// WRITE MODEL: Normalized, handles all mutations
// Stored in PostgreSQL for ACID guarantees
interface OrderWriteModel {
  orderId: string;
  customerId: string;
  status: string;
  items: Array<{
    productId: string;
    quantity: number;
    unitPrice: number;
  }>;
  createdAt: Date;
}
 
// READ MODELS: Denormalized for specific use cases
 
// Read model 1: Customer order list
// Stored in Redis or Elasticsearch
interface CustomerOrdersReadModel {
  customerId: string;
  orders: Array<{
    orderId: string;
    status: string;
    totalAmount: number;
    itemCount: number;
    createdAt: Date;
  }>;
}
 
// Read model 2: Order detail view
interface OrderDetailReadModel {
  orderId: string;
  customerName: string;
  customerEmail: string;
  status: string;
  items: Array<{
    productName: string;
    productImage: string;
    quantity: number;
    unitPrice: number;
  }>;
  shippingAddress: FullAddress;
  totalAmount: number;
}
 
// Event handler: Sync read models when orders change
async function onOrderCreated(event: OrderCreatedEvent) {
  // Update write store
  await orderRepository.save(event.order);
  
  // Project to read model 1
  await customerOrdersCache.addOrder(
    event.order.customerId,
    summarizeOrder(event.order)
  );
  
  // Project to read model 2
  await orderDetailCache.save(
    event.order.orderId,
    await enrichOrderForDisplay(event.order)
  );
}

Strategy 3: Multiple Indexes / GSIs

For SQL databases, multiple indexes can support different access patterns on the same data. For DynamoDB, Global Secondary Indexes (GSIs) allow querying on different key combinations.

Strategy 4: Polyglot Persistence

Use different database technologies for different access patterns:

PostgreSQL for complex transactional queries
Redis for high-speed lookups and caching
Elasticsearch for full-text search
ClickHouse for analytics

Pattern Conflict Resolution Strategies
Strategy	Best For	Trade-offs
Priority-based optimization	Clear hierarchy of importance	Low-priority queries may be slow
CQRS	Complex domains, event sourcing	Eventual consistency, complexity
Multiple indexes	SQL with varied query patterns	Write overhead, storage cost
Polyglot persistence	Extreme scale, specialized requirements	Operational complexity, sync challenges

Start Simple, Evolve As Needed

Don't adopt CQRS or polyglot persistence prematurely. Start with a well-indexed SQL database. Only add complexity when you have evidence that the simpler approach fails. Many successful systems run entirely on PostgreSQL with thoughtful index design.

Schema Evolution and Access Patterns

Access patterns change over time. New features require new queries. User behavior evolves. How do you design schemas that can adapt?

Principle: Anticipate Change, But Don't Over-Engineer

The goal is a schema that:

Handles current access patterns efficiently
Can accommodate foreseeable new patterns without major restructuring
Provides escape hatches for unforeseeable requirements

Techniques for Evolvable Schemas:

Evolvable Schema Design

•Generous Primary Keys: Use UUIDs or large integers. Never use composite business keys as PKs—they're hard to change.
•Nullable New Columns: Add new columns as nullable. Backfill data, then add NOT NULL constraint if needed.
•JSONB for Flexibility: Store evolving/varying attributes in a JSONB column. Query with GIN indexes.
•Versioned Records: Include a schema_version column for records that may have different structures over time.
•Abstract Over Implementation: Use views to abstract physical tables. You can restructure tables without changing application queries.

evolvable-schema.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
-- EVOLVABLE SCHEMA: Products with flexible attributes
 
CREATE TABLE products (
    product_id          UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    sku                 VARCHAR(100) NOT NULL UNIQUE,
    name                VARCHAR(255) NOT NULL,
    description         TEXT,
    price               DECIMAL(10, 2) NOT NULL,
    category_id         UUID NOT NULL,
    
    -- JSONB for category-specific attributes
    -- Electronics: { "screenSize": "15.6", "cpu": "Intel i7", "ram": "16GB" }
    -- Clothing: { "size": "M", "color": "blue", "material": "cotton" }
    attributes          JSONB DEFAULT '{}',
    
    -- Metadata for record evolution
    schema_version      INTEGER DEFAULT 1,
    created_at          TIMESTAMPTZ DEFAULT NOW(),
    updated_at          TIMESTAMPTZ DEFAULT NOW()
);
 
-- GIN index for JSONB queries
CREATE INDEX idx_products_attributes ON products USING GIN (attributes);
 
-- Query specific attributes
SELECT * FROM products 
WHERE attributes->>'screenSize' = '15.6'
  AND category_id = 'electronics-uuid';
 
-- Later: Add new access pattern with just an index
-- "Find products by color across all categories"
CREATE INDEX idx_products_color ON products ((attributes->>'color'));
 
-- VIEW abstraction: Application sees stable interface
CREATE VIEW product_catalog AS
SELECT 
    product_id,
    sku,
    name,
    price,
    category_id,
    attributes->>'mainImage' AS main_image,
    COALESCE((attributes->>'reviewCount')::integer, 0) AS review_count
FROM products;
 
-- Application queries the view
-- Later: change underlying structure without breaking queries

The YAGNI Principle Applies

Don't build for access patterns you might need. Build for patterns you know you need, with reasonable flexibility for evolution. The cost of over-engineering is real: complexity, maintenance burden, and slower development velocity. When new patterns emerge, adapt—migrations are a normal part of system evolution.

Summary: Access Pattern-Driven Design

Access pattern-driven design is the discipline of designing data structures around the queries they must serve, not just the entities they represent.

Key Takeaways

•Avoid the entity-first trap: A schema that models the domain perfectly but queries poorly is a failed design.
•Document access patterns formally: Frequency, latency requirements, pattern types, priority. This drives all optimization decisions.
•Design indexes for queries: Column order matters. Equality first, range second, sort last. Use covering indexes for hot paths.
•NoSQL demands even more rigor: Without JOINs, data must be pre-arranged for access patterns. DynamoDB's single-table design is the extreme example.
•Resolve competing patterns strategically: Priority-based optimization, CQRS, multiple indexes, or polyglot persistence—choose based on your constraints.
•Design for evolution: Use flexible column types (JSONB), versioning, and view abstractions. Accept that migrations are normal.
•Measure, don't guess: EXPLAIN ANALYZE tells you exactly how queries execute. Design decisions should be evidence-based.

What's Next:

With access pattern-driven design understood, we'll explore Schema Evolution—the practices and patterns for changing your data model safely in production systems with millions of records.

Page Complete

You now understand how to design schemas that efficiently serve their queries. In the next page, we'll explore how to evolve schemas safely as systems grow and requirements change.

3 / 5

Loading learning content...

System Design (HLD)Data Modeling Fundamentals

Data Modeling Fundamentals

LevelIntermediate

Duration90 mins

TopicData Modeling Fundamentals

3 / 5

Access Pattern-Driven Design

Design for How You Query

Access pattern-driven design flips the methodology. Instead of asking 'What entities exist?', we ask 'What questions will we ask of this data?' The queries come first; the schema follows.

This isn't a rejection of proper modeling—it's an evolution. We still understand entities and relationships, but we optimize the physical structure for the access patterns that matter most.

What You Will Learn

The Entity-First Trap

But accuracy and performance are different goals.

The Reality of Production Queries:

Consider an e-commerce application. The entity-first schema might look like:

customers, orders, order_items, products, categories, addresses, payments

This is logically correct. But what happens when you need to display a customer's order history page? You need:

Customer name and email (from customers)
All orders with status and date (from orders)
Line items for each order (from order_items)
Product names and images (from products)
Shipping address (from addresses)

That's a 5-table JOIN for every page load, on one of your most common user flows.

entity-first-query.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
-- Entity-first schema: Logically correct, operationally expensive
SELECT 
    o.order_id,
    o.status,
    o.created_at,
    c.name AS customer_name,
    c.email,
    oi.quantity,
    oi.unit_price,
    p.name AS product_name,
    p.image_url,
    a.street,
    a.city,
    a.state,
    a.zip_code
FROM orders o
JOIN customers c ON o.customer_id = c.customer_id
JOIN order_items oi ON o.order_id = oi.order_id
JOIN products p ON oi.product_id = p.product_id
JOIN addresses a ON o.shipping_address_id = a.address_id
WHERE c.customer_id = 'cust-uuid'
ORDER BY o.created_at DESC
LIMIT 50;
 
-- Query plan analysis might show:
-- - Seq Scan on orders (if customer_id index missing)
-- - Hash Join on order_items 
-- - Hash Join on products
-- - Nested Loop on addresses and customers
-- - Total: 200ms+ on moderate data volumes

The Hidden Cost

Identifying Access Patterns

Before designing a schema, you must systematically identify and document your access patterns. This is not guesswork—it requires careful analysis of how the system will be used.

Access Pattern Documentation:

For each major operation, document:

Description: What is the user/system trying to accomplish?
Frequency: How often will this query run? (per second, per minute, per day)
Latency requirement: What's the acceptable response time?
Data volume: How many records are typically involved?
Pattern type: Point lookup, range scan, aggregation, full scan?
Filters/parameters: What values are provided as input?
Sort order: How should results be ordered?

E-commerce Access Pattern Analysis
Operation	Frequency	Latency	Pattern Type	Priority
Get product details by ID	10K/sec	<50ms	Point lookup	Critical
List products by category	5K/sec	<100ms	Range scan + pagination	Critical
Search products by keyword	2K/sec	<200ms	Full-text search	High
Get customer order history	500/sec	<150ms	Range scan	High
Calculate daily revenue	10/day	<30sec	Aggregation	Medium
Find products below restock level	1/hour	<5sec	Range scan	Low

Priority-Based Design:

Not all access patterns are equal. A query that runs 10,000 times per second with a 50ms SLA demands schema optimization. A nightly batch report can tolerate suboptimal query plans.

Rank access patterns by:

Frequency × Latency Sensitivity = Optimization Priority

Optimize ruthlessly for the top 3-5 patterns. Accept that lower-priority patterns may require more expensive queries—this is an intentional trade-off, not a failure.

access-pattern-docs.ts
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
// Formal access pattern documentation
interface AccessPattern {
  id: string;
  description: string;
  frequency: {
    value: number;
    unit: 'per_second' | 'per_minute' | 'per_hour' | 'per_day';
  };
  latencySLA: {
    p50: number;  // milliseconds
    p99: number;  // milliseconds
  };
  inputs: string[];
  outputs: string[];
  patternType: 'point_lookup' | 'range_scan' | 'aggregation' | 'search';
  priority: 'critical' | 'high' | 'medium' | 'low';
}
 
const accessPatterns: AccessPattern[] = [
  {
    id: 'get-product-by-id',
    description: 'Fetch product details for product display page',
    frequency: { value: 10000, unit: 'per_second' },
    latencySLA: { p50: 20, p99: 50 },
    inputs: ['product_id'],
    outputs: ['name', 'description', 'price', 'images', 'category', 'inventory_count'],
    patternType: 'point_lookup',
    priority: 'critical',
  },
  {
    id: 'customer-order-history',
    description: 'List orders with details for customer account page',
    frequency: { value: 500, unit: 'per_second' },
    latencySLA: { p50: 100, p99: 200 },
    inputs: ['customer_id', 'page_number', 'page_size'],
    outputs: ['order_id', 'date', 'status', 'total', 'item_summaries'],
    patternType: 'range_scan',
    priority: 'high',
  },
  // ... more patterns
];

Talk to Product and Frontend Teams

Designing for Specific Patterns

Each access pattern type suggests specific schema design strategies. Let's examine how to optimize for the most common patterns.

Pattern 1: Point Lookups (Get by ID)

The most common pattern: fetch a single record by its identifier. This should be O(1) or O(log n).

Design principles:

Primary key on the lookup column
Denormalize frequently joined data into the main record
Consider covering indexes to avoid heap lookups

point-lookup-design.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
-- POINT LOOKUP: Get product by ID
-- Optimize by denormalizing category and aggregates into product row
 
CREATE TABLE products (
    product_id          UUID PRIMARY KEY,
    sku                 VARCHAR(100) NOT NULL UNIQUE,
    name                VARCHAR(255) NOT NULL,
    description         TEXT,
    price               DECIMAL(10, 2) NOT NULL,
    
    -- Denormalized: category info embedded
    category_id         UUID NOT NULL,
    category_name       VARCHAR(100) NOT NULL,
    category_path       TEXT,  -- '/electronics/phones/smartphones'
    
    -- Denormalized: aggregates
    average_rating      DECIMAL(3, 2),
    review_count        INTEGER DEFAULT 0,
    inventory_count     INTEGER DEFAULT 0,
    
    -- Denormalized: main image for listing pages
    main_image_url      TEXT,
    
    created_at          TIMESTAMPTZ DEFAULT NOW(),
    updated_at          TIMESTAMPTZ DEFAULT NOW()
);
 
-- Single query returns everything needed for product page
SELECT * FROM products WHERE product_id = 'uuid';
 
-- No JOINs needed for the critical product display path

Pattern 2: Range Scans (List by Filter)

Fetch multiple records matching criteria, often with pagination. Examples: orders by customer, products by category, messages by thread.

Design principles:

Index on filter columns + sort columns
Store related data together for efficient scanning
Consider composite indexes matching common filter combinations

range-scan-design.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
-- RANGE SCAN: List customer orders with pagination
-- Design index to match access pattern exactly
 
CREATE TABLE orders (
    order_id            UUID PRIMARY KEY,
    customer_id         UUID NOT NULL,
    status              VARCHAR(50) NOT NULL,
    total_amount        DECIMAL(12, 2) NOT NULL,
    created_at          TIMESTAMPTZ NOT NULL DEFAULT NOW(),
    
    -- Denormalized for order list display
    item_count          INTEGER NOT NULL,
    first_item_name     VARCHAR(255),  -- "MacBook Pro and 2 more items"
    first_item_image    TEXT
);
 
-- Composite index matching the exact access pattern
-- customer_id (equality) + created_at (range/sort)
CREATE INDEX idx_orders_customer_date 
    ON orders (customer_id, created_at DESC);
 
-- Query uses the index efficiently
SELECT order_id, status, total_amount, created_at, 
       item_count, first_item_name, first_item_image
FROM orders
WHERE customer_id = 'cust-uuid'
ORDER BY created_at DESC
LIMIT 20 OFFSET 0;
 
-- Index-only scan possible if all columns in index or few enough

Pattern 3: Aggregations (Analytics Queries)

Compute sums, averages, counts across many records. Examples: daily revenue, user growth, conversion rates.

Design principles:

Precompute aggregates if real-time isn't required
Use materialized views for complex aggregations
Partition tables for efficient range-based aggregations

aggregation-design.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
-- AGGREGATION: Daily revenue reporting
-- Use partitioning + materialized views for efficient aggregation
 
-- Partition orders by month for efficient date-range queries
CREATE TABLE orders (
    order_id            UUID NOT NULL,
    customer_id         UUID NOT NULL,
    total_amount        DECIMAL(12, 2) NOT NULL,
    created_at          TIMESTAMPTZ NOT NULL,
    PRIMARY KEY (order_id, created_at)
) PARTITION BY RANGE (created_at);
 
-- Create monthly partitions
CREATE TABLE orders_2024_01 PARTITION OF orders
    FOR VALUES FROM ('2024-01-01') TO ('2024-02-01');
CREATE TABLE orders_2024_02 PARTITION OF orders
    FOR VALUES FROM ('2024-02-01') TO ('2024-03-01');
-- etc.
 
-- Materialized view for daily aggregates
CREATE MATERIALIZED VIEW daily_revenue AS
SELECT 
    DATE(created_at) AS date,
    COUNT(*) AS order_count,
    SUM(total_amount) AS revenue,
    AVG(total_amount) AS average_order_value
FROM orders
WHERE created_at >= CURRENT_DATE - INTERVAL '90 days'
GROUP BY DATE(created_at);
 
-- Refresh nightly
REFRESH MATERIALIZED VIEW CONCURRENTLY daily_revenue;
 
-- Fast analytics queries
SELECT * FROM daily_revenue WHERE date >= CURRENT_DATE - INTERVAL '30 days';

OLTP vs OLAP

Index Design Principles

Indexes are the primary mechanism for optimizing access patterns without changing table structure. Expert index design can often eliminate the need for denormalization.

The Compound Index Ordering Rule:

For composite indexes, column order matters dramatically. The rule:

Equality columns first — Columns in WHERE with =
Range/inequality columns second — Columns with <, >, BETWEEN
Sort columns last — Columns in ORDER BY

This order allows the database to use the index for the maximum number of conditions.

index-ordering.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
-- ACCESS PATTERN: Find active products in category, sorted by price
-- Query: WHERE category_id = ? AND status = 'active' ORDER BY price ASC
 
-- WRONG: Range column (price) before equality columns
CREATE INDEX idx_bad ON products (price, category_id, status);
-- Index can't be used efficiently for equality on category_id
 
-- CORRECT: Equality first, then sort
CREATE INDEX idx_good ON products (category_id, status, price);
-- Index scan: Find category_id, filter status, already sorted by price
 
-- QUERY USES INDEX PERFECTLY
EXPLAIN ANALYZE
SELECT product_id, name, price
FROM products
WHERE category_id = 'cat-uuid' 
  AND status = 'active'
ORDER BY price ASC
LIMIT 20;
 
-- Result: Index Scan using idx_good, no sort operation needed

Covering Indexes:

A covering index contains all columns needed by a query, eliminating the need to access the table itself (heap/clustered index). This dramatically improves performance for frequently-run queries.

covering-index.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
-- ACCESS PATTERN: List order summaries for customer
-- Query needs: order_id, status, total, created_at
-- Filter: customer_id = ?
 
-- Non-covering index: requires heap lookup for each row
CREATE INDEX idx_non_covering ON orders (customer_id);
-- Scan index → get row IDs → lookup heap for each row
 
-- Covering index: INCLUDE columns in the index leaf nodes
CREATE INDEX idx_covering ON orders (customer_id, created_at DESC)
INCLUDE (order_id, status, total_amount);
-- Scan index → return data directly, no heap access
 
-- PostgreSQL EXPLAIN shows "Index Only Scan" when covering works
EXPLAIN ANALYZE
SELECT order_id, status, total_amount, created_at
FROM orders
WHERE customer_id = 'cust-uuid'
ORDER BY created_at DESC
LIMIT 10;
 
-- "Index Only Scan using idx_covering"
-- Heap Fetches: 0  ← This is the goal

Index Design Checklist

•Identify high-frequency queries: Focus index optimization on queries running thousands of times per second.
•Match index to query: Column order in index should match filter types (equality → range → sort).
•Consider covering indexes: For hot queries, include all SELECT columns to avoid heap lookups.
•Balance read vs write: Every index slows writes. Don't over-index—target specific access patterns.
•Monitor index usage: Unused indexes waste storage and slow writes. Periodically audit with pg_stat_user_indexes.
•Use partial indexes: If you only query active/recent data, index only those rows: WHERE status = 'active'.

EXPLAIN ANALYZE Is Your Best Friend

NoSQL and Access Patterns

DynamoDB: The Extreme Case

Amazon DynamoDB exemplifies access pattern-driven design. You must define your access patterns before writing a single line of code, because:

No JOINs exist—data must be denormalized
Queries must use partition key or pre-defined indexes
Adding new access patterns may require table redesign

This forces rigorous upfront analysis but delivers consistent single-digit millisecond performance at any scale.

dynamodb-design.ts
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
// DynamoDB single-table design for e-commerce
// All access patterns served by one table with composite keys
 
interface DynamoDBRecord {
  PK: string;           // Partition key
  SK: string;           // Sort key
  GSI1PK?: string;     // Global Secondary Index 1 partition key
  GSI1SK?: string;     // Global Secondary Index 1 sort key
  // ... entity-specific attributes
}
 
// ACCESS PATTERNS AND KEY DESIGN:
 
// 1. Get customer by ID
//    PK: CUSTOMER#<customer_id>  SK: PROFILE
{
  PK: 'CUSTOMER#cust-123',
  SK: 'PROFILE',
  name: 'John Doe',
  email: 'john@example.com',
  entityType: 'Customer'
}
 
// 2. List orders for customer (sorted by date)
//    PK: CUSTOMER#<customer_id>  SK: ORDER#<timestamp>
{
  PK: 'CUSTOMER#cust-123',
  SK: 'ORDER#2024-01-15T10:30:00Z#ord-456',
  orderId: 'ord-456',
  status: 'shipped',
  total: 129.99,
  entityType: 'Order'
}
 
// Query: Get all orders for customer
// Key condition: PK = 'CUSTOMER#cust-123' AND begins_with(SK, 'ORDER#')
// Returns orders sorted by date (SK is sortable)
 
// 3. Get order details (including items)
//    PK: ORDER#<order_id>  SK: METADATA or ITEM#<product_id>
{
  PK: 'ORDER#ord-456',
  SK: 'METADATA',
  customerId: 'cust-123',
  status: 'shipped',
  shippingAddress: { /* embedded */ },
  entityType: 'OrderMetadata'
}
{
  PK: 'ORDER#ord-456', 
  SK: 'ITEM#prod-789',
  productName: 'Widget Pro',
  quantity: 2,
  unitPrice: 49.99,
  entityType: 'OrderItem'
}
 
// Query: Get order with all items
// Key condition: PK = 'ORDER#ord-456'
// Returns metadata + all items in one query

MongoDB: Embedded Documents

MongoDB offers more flexibility than DynamoDB but still rewards access pattern-driven design. The key question: should related data be embedded (denormalized) or referenced (normalized)?

Embed when data is accessed together, updated together, and the relationship is 1:1 or 1:few
Reference when data is accessed independently, updated at different rates, or the relationship is 1:many

mongodb-design.ts
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
// MongoDB document design based on access patterns
 
// PATTERN: Always fetch user with their preferences
// DESIGN: Embed preferences in user document
interface UserDocument {
  _id: ObjectId;
  email: string;
  name: string;
  // Embedded (1:1, always accessed together)
  preferences: {
    theme: 'light' | 'dark';
    notifications: boolean;
    language: string;
  };
  // Embedded (1:few, bounded growth)
  addresses: Array<{
    type: 'home' | 'work' | 'shipping';
    street: string;
    city: string;
    zipCode: string;
  }>;
}
 
// PATTERN: Fetch blog post with author name (not full profile)
// DESIGN: Embed author summary, reference full author
interface BlogPostDocument {
  _id: ObjectId;
  title: string;
  content: string;
  // Denormalized author summary for display
  author: {
    _id: ObjectId;  // Reference for full profile link
    name: string;
    avatarUrl: string;
  };
  // Referenced (unbounded, accessed separately)
  commentIds: ObjectId[];  // Only IDs, fetch comments separately
  
  // Or for small numbers, embed recent comments
  recentComments: Array<{
    _id: ObjectId;
    authorName: string;
    content: string;
    createdAt: Date;
  }>;
}
 
// PATTERN: Show order history with basic details
// DESIGN: Embed line item summaries, reference products
interface OrderDocument {
  _id: ObjectId;
  customerId: ObjectId;
  status: string;
  totalAmount: number;
  createdAt: Date;
  // Embedded snapshot (historical accuracy)
  items: Array<{
    productId: ObjectId;
    productName: string;   // Snapshot at order time
    quantity: number;
    unitPrice: number;     // Price at order time
  }>;
  shippingAddress: {
    // Full embedded address snapshot
  };
}

NoSQL Requires More Upfront Design

Handling Competing Access Patterns

Real systems have multiple access patterns that may conflict. Optimizing for one pattern often suboptimizes another. How do you resolve these tensions?

Strategy 1: Priority-Based Optimization

Rank patterns by importance (frequency × latency sensitivity). Optimize fully for top patterns; accept suboptimal performance for lower-priority patterns.

Strategy 2: Separate Read Models (CQRS)

Command Query Responsibility Segregation maintains separate data stores for reads and writes. The write store is normalized for integrity; read stores are denormalized for specific queries.

cqrs-pattern.ts
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
// CQRS: Separate write and read models
 
// WRITE MODEL: Normalized, handles all mutations
// Stored in PostgreSQL for ACID guarantees
interface OrderWriteModel {
  orderId: string;
  customerId: string;
  status: string;
  items: Array<{
    productId: string;
    quantity: number;
    unitPrice: number;
  }>;
  createdAt: Date;
}
 
// READ MODELS: Denormalized for specific use cases
 
// Read model 1: Customer order list
// Stored in Redis or Elasticsearch
interface CustomerOrdersReadModel {
  customerId: string;
  orders: Array<{
    orderId: string;
    status: string;
    totalAmount: number;
    itemCount: number;
    createdAt: Date;
  }>;
}
 
// Read model 2: Order detail view
interface OrderDetailReadModel {
  orderId: string;
  customerName: string;
  customerEmail: string;
  status: string;
  items: Array<{
    productName: string;
    productImage: string;
    quantity: number;
    unitPrice: number;
  }>;
  shippingAddress: FullAddress;
  totalAmount: number;
}
 
// Event handler: Sync read models when orders change
async function onOrderCreated(event: OrderCreatedEvent) {
  // Update write store
  await orderRepository.save(event.order);
  
  // Project to read model 1
  await customerOrdersCache.addOrder(
    event.order.customerId,
    summarizeOrder(event.order)
  );
  
  // Project to read model 2
  await orderDetailCache.save(
    event.order.orderId,
    await enrichOrderForDisplay(event.order)
  );
}

Strategy 3: Multiple Indexes / GSIs

For SQL databases, multiple indexes can support different access patterns on the same data. For DynamoDB, Global Secondary Indexes (GSIs) allow querying on different key combinations.

Strategy 4: Polyglot Persistence

Use different database technologies for different access patterns:

PostgreSQL for complex transactional queries
Redis for high-speed lookups and caching
Elasticsearch for full-text search
ClickHouse for analytics

Pattern Conflict Resolution Strategies
Strategy	Best For	Trade-offs
Priority-based optimization	Clear hierarchy of importance	Low-priority queries may be slow
CQRS	Complex domains, event sourcing	Eventual consistency, complexity
Multiple indexes	SQL with varied query patterns	Write overhead, storage cost
Polyglot persistence	Extreme scale, specialized requirements	Operational complexity, sync challenges

Start Simple, Evolve As Needed

Schema Evolution and Access Patterns

Access patterns change over time. New features require new queries. User behavior evolves. How do you design schemas that can adapt?

Principle: Anticipate Change, But Don't Over-Engineer

The goal is a schema that:

Handles current access patterns efficiently
Can accommodate foreseeable new patterns without major restructuring
Provides escape hatches for unforeseeable requirements

Techniques for Evolvable Schemas:

Evolvable Schema Design

•Generous Primary Keys: Use UUIDs or large integers. Never use composite business keys as PKs—they're hard to change.
•Nullable New Columns: Add new columns as nullable. Backfill data, then add NOT NULL constraint if needed.
•JSONB for Flexibility: Store evolving/varying attributes in a JSONB column. Query with GIN indexes.
•Versioned Records: Include a schema_version column for records that may have different structures over time.
•Abstract Over Implementation: Use views to abstract physical tables. You can restructure tables without changing application queries.

evolvable-schema.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
-- EVOLVABLE SCHEMA: Products with flexible attributes
 
CREATE TABLE products (
    product_id          UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    sku                 VARCHAR(100) NOT NULL UNIQUE,
    name                VARCHAR(255) NOT NULL,
    description         TEXT,
    price               DECIMAL(10, 2) NOT NULL,
    category_id         UUID NOT NULL,
    
    -- JSONB for category-specific attributes
    -- Electronics: { "screenSize": "15.6", "cpu": "Intel i7", "ram": "16GB" }
    -- Clothing: { "size": "M", "color": "blue", "material": "cotton" }
    attributes          JSONB DEFAULT '{}',
    
    -- Metadata for record evolution
    schema_version      INTEGER DEFAULT 1,
    created_at          TIMESTAMPTZ DEFAULT NOW(),
    updated_at          TIMESTAMPTZ DEFAULT NOW()
);
 
-- GIN index for JSONB queries
CREATE INDEX idx_products_attributes ON products USING GIN (attributes);
 
-- Query specific attributes
SELECT * FROM products 
WHERE attributes->>'screenSize' = '15.6'
  AND category_id = 'electronics-uuid';
 
-- Later: Add new access pattern with just an index
-- "Find products by color across all categories"
CREATE INDEX idx_products_color ON products ((attributes->>'color'));
 
-- VIEW abstraction: Application sees stable interface
CREATE VIEW product_catalog AS
SELECT 
    product_id,
    sku,
    name,
    price,
    category_id,
    attributes->>'mainImage' AS main_image,
    COALESCE((attributes->>'reviewCount')::integer, 0) AS review_count
FROM products;
 
-- Application queries the view
-- Later: change underlying structure without breaking queries

The YAGNI Principle Applies

Summary: Access Pattern-Driven Design

Access pattern-driven design is the discipline of designing data structures around the queries they must serve, not just the entities they represent.

Key Takeaways

•Avoid the entity-first trap: A schema that models the domain perfectly but queries poorly is a failed design.
•Document access patterns formally: Frequency, latency requirements, pattern types, priority. This drives all optimization decisions.
•Design indexes for queries: Column order matters. Equality first, range second, sort last. Use covering indexes for hot paths.
•NoSQL demands even more rigor: Without JOINs, data must be pre-arranged for access patterns. DynamoDB's single-table design is the extreme example.
•Resolve competing patterns strategically: Priority-based optimization, CQRS, multiple indexes, or polyglot persistence—choose based on your constraints.
•Design for evolution: Use flexible column types (JSONB), versioning, and view abstractions. Accept that migrations are normal.
•Measure, don't guess: EXPLAIN ANALYZE tells you exactly how queries execute. Design decisions should be evidence-based.

What's Next:

With access pattern-driven design understood, we'll explore Schema Evolution—the practices and patterns for changing your data model safely in production systems with millions of records.

Page Complete

You now understand how to design schemas that efficiently serve their queries. In the next page, we'll explore how to evolve schemas safely as systems grow and requirements change.

3 / 5