System Design Questions - Learning Module

Loading content...

0/241

Schema Design

The Blueprint of Your Data Architecture

Schema design is where abstract system requirements become concrete data structures. In system design interviews, your schema design reveals your understanding of data modeling principles, performance optimization, and long-term maintainability. A poor schema creates technical debt that compounds over years; a well-designed schema enables feature velocity and operational simplicity.

Unlike code refactoring, schema changes are expensive and risky. Adding a column to a billion-row table can take hours and require careful orchestration. Changing a primary key structure might require migrating the entire dataset. These constraints make upfront schema design disproportionately important—and interviewers know it.

What You Will Learn

By the end of this page, you will master schema design principles for system design interviews. You'll understand how to model entities and relationships, make informed normalization decisions, design for performance, plan for evolution, and adapt principles across different database types.

Schema Design Fundamentals

Regardless of database type, schema design follows universal principles rooted in clarity, efficiency, and adaptability. Let's establish the foundational concepts before diving into specific patterns.

The Three Schema Design Goals

Every schema should optimize for these three (often competing) goals:

Data Integrity: The schema prevents invalid states. Constraints, types, and relationships ensure data correctness at the database level rather than relying on application logic.
Query Performance: The schema supports efficient access patterns. Data is organized, indexed, and (where appropriate) denormalized to minimize I/O and computation for common queries.
Evolvability: The schema can change without catastrophic migrations. New features can be added without rewriting the data layer or requiring downtime.

The Triangle of Trade-offs

These goals often conflict. Maximum integrity (full normalization) can hurt performance. Maximum performance (aggressive denormalization) can hurt integrity and evolvability. Great schema design finds the right balance for your specific use case.

Entity Identification

The first step in schema design is identifying the entities your system manages. An entity is a distinct 'thing' that:

Has an independent existence (can be referred to without context)
Has a unique identity (distinguishable from other instances)
Has attributes that describe it
Participates in relationships with other entities

In an E-commerce System:

Users (customers, sellers, admins)
Products (items for sale)
Orders (purchase transactions)
Payments (financial transactions)
Reviews (user-generated content)
Inventory (stock levels)

In a Social Media System:

Users (accounts, profiles)
Posts (content units)
Comments (nested content)
Follows (user relationships)
Messages (private communications)
Notifications (alerts)

Entity Identification Checklist

•Does it have a lifecycle? — Entities are created, updated, and deleted independently
•Is it referenced by other entities? — If yes, it likely needs its own table with a primary key
•Does it have > 3 attributes? — Complex value objects might deserve entity status
•Will it be queried independently? — Direct access requirements suggest entity-level storage
•Does it have cardinality? — 'One user has many orders' suggests separate entities

Normalization vs. Denormalization Decisions

The normalization-denormalization spectrum is the most critical schema design decision. Understanding when to normalize (eliminate redundancy) and when to denormalize (introduce controlled redundancy) separates junior from senior engineers.

When to Normalize

•Write-heavy workloads where updates are frequent
•Data integrity is critical (financial, medical, legal)
•Storage efficiency matters
•Query patterns are unpredictable or evolving
•ACID transactions span multiple entities
•Development speed prioritized over read performance

When to Denormalize

•Read-heavy workloads (10:1 or higher read:write)
•Low-latency queries are critical
•Data is queried together consistently
•Updates to denormalized data are rare
•Horizontal scaling requires avoiding joins
•Specific, stable query patterns are known

Denormalization Patterns

Denormalization isn't random data duplication—it's strategic redundancy to optimize specific access patterns.

1. Embedding (for Documents)

// Instead of separate orders and items tables
{
  "orderId": "ord_123",
  "userId": "usr_456",
  "items": [
    { "productId": "prd_789", "name": "Widget", "quantity": 2, "price": 29.99 },
    { "productId": "prd_012", "name": "Gadget", "quantity": 1, "price": 49.99 }
  ],
  "totalAmount": 109.97
}

Complete order retrieved in single read
Product name stored at order time (captures historical name)
No joins required

2. Computed Columns

-- Store computed values that would require expensive aggregation
ALTER TABLE users ADD COLUMN post_count INT DEFAULT 0;
ALTER TABLE users ADD COLUMN follower_count INT DEFAULT 0;
-- Updated asynchronously via triggers or application logic

3. Materialized Views

-- Pre-join tables for read-heavy reporting
CREATE MATERIALIZED VIEW order_summary AS
SELECT 
  u.name as customer_name,
  o.id as order_id,
  SUM(i.quantity * i.price) as total,
  COUNT(i.id) as item_count
FROM users u
JOIN orders o ON u.id = o.user_id
JOIN order_items i ON o.id = i.order_id
GROUP BY u.id, o.id;

-- Refresh periodically or on-change
REFRESH MATERIALIZED VIEW order_summary;

The Consistency Tax

Every piece of denormalized data requires a mechanism to stay consistent with its source of truth. Evaluate this 'consistency tax' when deciding to denormalize: triggers, application logic, async workers, or eventual consistency acceptance.

Relational Schema Patterns

Relational databases remain the most common choice for transactional systems. Mastering relational schema patterns is essential for system design interviews.

Primary Key Selection

The primary key choice has cascading effects on performance, scalability, and operations.

Auto-Increment Integers

CREATE TABLE users (
  id SERIAL PRIMARY KEY,  -- PostgreSQL
  -- or
  id INT AUTO_INCREMENT PRIMARY KEY  -- MySQL
);

✅ Compact (4/8 bytes)
✅ Index-friendly (B-tree loves sequential inserts)
❌ Predictable (security concern if exposed)
❌ Global uniqueness not guaranteed across shards

UUIDs

CREATE TABLE users (
  id UUID PRIMARY KEY DEFAULT gen_random_uuid()
);

✅ Globally unique (safe for distributed systems)
✅ Unpredictable (secure for external exposure)
❌ Large (16 bytes, 2-4x index size)
❌ Random inserts cause index fragmentation

ULIDs / Snowflake IDs

-- ULID: Lexicographically sortable, time-ordered
CREATE TABLE events (
  id VARCHAR(26) PRIMARY KEY  -- ULID string format
);

-- Snowflake: 64-bit time-sortable
CREATE TABLE tweets (
  id BIGINT PRIMARY KEY  -- Snowflake ID
);

✅ Time-sortable (great for time-based queries)
✅ Globally unique
✅ Index-friendly (sequential-ish)
⚠️ Requires application-level generation

Interview Recommendation

In interviews, recommend UUIDs or ULIDs for distributed systems and auto-increment for simpler, single-node deployments. Always explain the trade-offs to demonstrate depth.

NoSQL Schema Patterns

NoSQL databases require different schema thinking. Instead of normalizing, you design around access patterns. This section covers patterns for document, key-value, and wide-column stores.

Document Database Design Principles

Embed when data is accessed together

// Blog post with embedded comments (access pattern: read post with comments)
{
  "_id": "post_123",
  "title": "Schema Design Best Practices",
  "author": {
    "_id": "user_456",
    "name": "Jane Smith",  // Denormalized at write time
    "avatar": "https://..."
  },
  "comments": [
    { "userId": "user_789", "text": "Great post!", "createdAt": "..." },
    { "userId": "user_012", "text": "Very helpful", "createdAt": "..." }
  ],
  "commentCount": 2  // Denormalized counter
}

Reference when data is updated independently

// Product with reference to category (category changes independently)
{
  "_id": "prod_123",
  "name": "Widget Pro",
  "categoryId": "cat_456",  // Reference, not embedding
  "price": 29.99
}

// Category document
{
  "_id": "cat_456",
  "name": "Electronics",
  "description": "..."
}

Bucket pattern for time-series

// Instead of one document per sensor reading
// Bucket readings by hour
{
  "_id": "sensor_123_2024011514",  // sensor + hour bucket
  "sensorId": "sensor_123",
  "bucketStart": "2024-01-15T14:00:00Z",
  "readings": [
    { "t": "2024-01-15T14:00:12Z", "value": 23.5 },
    { "t": "2024-01-15T14:00:24Z", "value": 23.7 },
    // ... up to 60 readings per document
  ],
  "count": 60,
  "sum": 1412.3,  // Pre-aggregated
  "min": 22.1,
  "max": 24.8
}

Key-Value Access Pattern Modeling

Key-value stores require encoding access patterns into key structure:

# User profile lookup
Key: user:{userId}
Value: {entire user JSON}

# User's posts (sorted)
Key: user:{userId}:posts
Value: [sorted set of post IDs]

# Session storage
Key: session:{sessionId}
Value: {session data}
TTL: 3600

# Rate limiting
Key: ratelimit:{userId}:{minute}
Value: request count (integer)
TTL: 60

# Leaderboard
Key: leaderboard:daily:{date}
Value: sorted set (userId -> score)

DynamoDB Single-Table Design

DynamoDB best practices often involve storing multiple entity types in one table:

PK	SK	Type	Data
USER#123	USER#123	User	{name, email, ...}
USER#123	ORDER#456	Order	{total, status, ...}
USER#123	ORDER#456#ITEM#1	OrderItem	{productId, qty, ...}
ORDER#456	ORDER#456	Order	{duplicate for order lookup}
PRODUCT#789	PRODUCT#789	Product	{name, price, ...}

NoSQL in Interviews

When using NoSQL in interviews, demonstrate that you understand the trade-offs. Show that you're modeling for specific access patterns and acknowledge what becomes harder (ad-hoc queries, joins, consistency).

Indexing Strategy

Indexes are the most impactful performance lever in database design. A schema without indexing strategy is incomplete. In interviews, proactively discussing indexes demonstrates production experience.

Index Types and Use Cases
Index Type	Use Case	Example
B-Tree (default)	Equality and range queries	WHERE created_at > '2024-01-01'
Hash	Exact equality only	WHERE id = 123
GiST/SP-GiST	Geometric, full-text, range types	WHERE location && box
GIN	Array contains, full-text	WHERE tags @> '{postgresql}'
BRIN	Large, naturally ordered tables	Append-only log tables
Partial	Subset of rows matching condition	WHERE status = 'active'
Composite	Multi-column queries	WHERE user_id = ? AND created_at > ?
Covering	Index contains all query columns	SELECT id, name FROM users

Composite Index Column Order

Column order in composite indexes is critical. Follow the ESR Rule (Equality, Sort, Range):

-- Query pattern
SELECT * FROM orders 
WHERE status = 'shipped'      -- Equality
  AND region = 'US'           -- Equality 
  AND created_at > '2024-01-01' -- Range
ORDER BY created_at DESC;     -- Sort

-- Optimal index
CREATE INDEX idx_orders_status_region_created 
  ON orders(status, region, created_at DESC);

-- Why this order?
-- 1. status (E): Exact match, narrows to subset
-- 2. region (E): Exact match, further narrows
-- 3. created_at (S/R): Supports both sort and range

Covering Indexes (Index-Only Scans)

-- Query only needs these columns
SELECT user_id, created_at FROM posts WHERE user_id = 123;

-- Covering index avoids table access entirely
CREATE INDEX idx_posts_user_covering 
  ON posts(user_id) INCLUDE (created_at);
  
-- All data comes from index, no heap fetch

Indexing Anti-Patterns

•Over-indexing: Every index costs write performance and storage. Only index columns actually used in WHERE, JOIN, ORDER BY
•Low-cardinality indexes: Indexing boolean or enum columns with few values rarely helps (full scan often faster)
•Function-wrapped columns: WHERE LOWER(email) = ? won't use index on email—create expression index instead
•Wrong column order: Leading column must appear in WHERE clause for index to be used
•Missing join indexes: Foreign key columns without indexes cause slow joins

Schema Evolution Strategies

Production schemas must evolve without downtime. Designing for evolution from day one prevents painful migrations later. Interviewers appreciate candidates who think about operational realities.

Backward-Compatible Changes (Safe)

These changes can be applied without coordinated deployment:

-- Adding nullable column (safe)
ALTER TABLE users ADD COLUMN middle_name VARCHAR(100);

-- Adding column with default (safe, but may lock in some DBs)
ALTER TABLE users ADD COLUMN status VARCHAR(20) DEFAULT 'active';

-- Adding new index (safe, may be slow on large tables)
CREATE INDEX CONCURRENTLY idx_users_status ON users(status);

-- Adding new table (safe)
CREATE TABLE user_preferences (...);

Breaking Changes (Require Coordination)

-- Removing column (break existing queries)
ALTER TABLE users DROP COLUMN legacy_field;

-- Renaming column (breaks existing queries)
ALTER TABLE users RENAME COLUMN name TO full_name;

-- Changing type (may fail or truncate)
ALTER TABLE users ALTER COLUMN age TYPE SMALLINT;

-- Adding NOT NULL (fails if NULLs exist)
ALTER TABLE users ALTER COLUMN email SET NOT NULL;

The Expand-Contract Pattern

For breaking changes, use the expand-contract (parallel change) pattern:

Phase 1: EXPAND
├── Add new column (nullable)
├── Deploy code that writes to BOTH columns
└── Backfill old data to new column

Phase 2: MIGRATE
├── Deploy code that reads from NEW column
├── Verify correctness
└── Monitor for issues

Phase 3: CONTRACT
├── Deploy code that only writes to NEW column  
├── Drop old column (or defer indefinitely)
└── Clean up

Example: Renaming 'name' to 'full_name'

-- Phase 1: Expand
ALTER TABLE users ADD COLUMN full_name VARCHAR(255);
UPDATE users SET full_name = name WHERE full_name IS NULL;

-- Phase 2: Application writes to both columns
-- Application reads from full_name (with fallback to name)

-- Phase 3: Contract (after verification)
ALTER TABLE users DROP COLUMN name;

Interview Insight

When designing schemas in interviews, mention: 'I'm using nullable columns with defaults to ensure we can add fields without breaking existing code, and I'd use the expand-contract pattern for any breaking changes.' This shows production maturity.

Schema Design Interview Walkthrough

Let's apply schema design principles to a realistic interview question.

Sample Question

Design the database schema for a ride-sharing application like Uber. Consider drivers, riders, trips, locations, payments, and ratings.

Step 1: Identify Entities and Relationships

Users: Both drivers and riders (could be same person)
Vehicles: Cars registered by drivers
Trips: Ride from point A to point B
Locations: Pickup/dropoff points, driver real-time location
Payments: Trip payment transactions
Ratings: Driver and rider ratings per trip

Relationships:

User has-many Vehicles (as driver)
User has-many Trips (as rider AND as driver)
Trip has-one Payment
Trip has-one Rating (for driver) and Rating (for rider)
Trip has-many Locations (breadcrumb trail during trip)

ride_sharing_schema.sql
PostgreSQL
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
-- Users table (both drivers and riders)
CREATE TABLE users (
  id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
  email VARCHAR(255) UNIQUE NOT NULL,
  phone VARCHAR(20) UNIQUE NOT NULL,
  name VARCHAR(255) NOT NULL,
  profile_photo_url TEXT,
  is_driver BOOLEAN DEFAULT FALSE,
  driver_license_verified BOOLEAN DEFAULT FALSE,
  created_at TIMESTAMP DEFAULT NOW(),
  updated_at TIMESTAMP DEFAULT NOW()
);
 
-- Vehicles (for drivers)
CREATE TABLE vehicles (
  id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
  driver_id UUID NOT NULL REFERENCES users(id),
  make VARCHAR(50),
  model VARCHAR(50),
  year INT,
  color VARCHAR(30),
  license_plate VARCHAR(20) UNIQUE,
  vehicle_type VARCHAR(20),  -- 'economy', 'comfort', 'premium'
  is_active BOOLEAN DEFAULT TRUE,
  created_at TIMESTAMP DEFAULT NOW()
);
CREATE INDEX idx_vehicles_driver ON vehicles(driver_id);
 
-- Trips
CREATE TABLE trips (
  id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
  rider_id UUID NOT NULL REFERENCES users(id),
  driver_id UUID REFERENCES users(id),  -- NULL until matched
  vehicle_id UUID REFERENCES vehicles(id),
  status VARCHAR(20) NOT NULL DEFAULT 'requested',
    -- 'requested', 'matched', 'driver_arriving', 'in_progress', 'completed', 'cancelled'
  pickup_location GEOGRAPHY(POINT, 4326),
  pickup_address TEXT,
  dropoff_location GEOGRAPHY(POINT, 4326),
  dropoff_address TEXT,
  requested_at TIMESTAMP DEFAULT NOW(),
  matched_at TIMESTAMP,
  started_at TIMESTAMP,
  completed_at TIMESTAMP,
  cancelled_at TIMESTAMP,
  cancelled_by UUID REFERENCES users(id),
  estimated_fare_cents INT,
  actual_fare_cents INT,
  distance_meters INT,
  duration_seconds INT
);
CREATE INDEX idx_trips_rider ON trips(rider_id, requested_at DESC);
CREATE INDEX idx_trips_driver ON trips(driver_id, requested_at DESC);
CREATE INDEX idx_trips_status ON trips(status) WHERE status IN ('requested', 'matched', 'in_progress');
 
-- Trip location breadcrumbs (time-series data)
CREATE TABLE trip_locations (
  id BIGSERIAL PRIMARY KEY,
  trip_id UUID NOT NULL REFERENCES trips(id),
  location GEOGRAPHY(POINT, 4326),
  recorded_at TIMESTAMP DEFAULT NOW(),
  speed_mph DECIMAL(5,2),
  heading INT
);
CREATE INDEX idx_trip_locations_trip_time ON trip_locations(trip_id, recorded_at);
 
-- Payments
CREATE TABLE payments (
  id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
  trip_id UUID UNIQUE NOT NULL REFERENCES trips(id),
  rider_id UUID NOT NULL REFERENCES users(id),
  amount_cents INT NOT NULL,
  currency VARCHAR(3) DEFAULT 'USD',
  payment_method_id UUID,  -- References payment methods table
  status VARCHAR(20) DEFAULT 'pending',
  stripe_charge_id VARCHAR(255),
  created_at TIMESTAMP DEFAULT NOW(),
  completed_at TIMESTAMP
);
 
-- Ratings
CREATE TABLE ratings (
  id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
  trip_id UUID NOT NULL REFERENCES trips(id),
  rater_id UUID NOT NULL REFERENCES users(id),
  ratee_id UUID NOT NULL REFERENCES users(id),
  rating INT NOT NULL CHECK (rating >= 1 AND rating <= 5),
  comment TEXT,
  created_at TIMESTAMP DEFAULT NOW(),
  UNIQUE(trip_id, rater_id)  -- One rating per trip per rater
);
CREATE INDEX idx_ratings_ratee ON ratings(ratee_id);
 
-- Denormalized: User aggregate ratings (updated async)
ALTER TABLE users ADD COLUMN avg_rating DECIMAL(3,2);
ALTER TABLE users ADD COLUMN total_ratings INT DEFAULT 0;
ALTER TABLE users ADD COLUMN total_trips INT DEFAULT 0;

Key Design Decisions Explained:

UUID primary keys: Globally unique for distributed system, safe for client exposure
Combined users table: Riders and drivers in one table with flags, allows same user to be both
PostGIS geography types: Native geospatial support for location queries, distance calculations
Trip status as enum-like VARCHAR: Readable, flexible for adding new states
Denormalized avg_rating: Expensive to compute in real-time, updated asynchronously
Trip_locations for time-series: Separate high-volume table, could migrate to TimescaleDB if volume warrants
Partial index on active trips: Most queries filter for active status, small index covers common case
Separate ratings table: Allows rider to rate driver AND driver to rate rider per trip

Summary: Schema Design Mastery

Schema design is where abstract system requirements become concrete data structures. Let's consolidate the key takeaways:

Key Takeaways

•Balance the three goals — Data integrity, query performance, and evolvability often conflict; find the right trade-off
•Normalize by default, denormalize with purpose — Start normalized, denormalize specific access patterns with clear consistency strategy
•Choose primary keys thoughtfully — Auto-increment for simplicity, UUID/ULID for distributed systems
•Index based on access patterns — Use the ESR rule, consider covering indexes, avoid over-indexing
•Design for evolution — Use nullable columns, defaults, and the expand-contract pattern
•NoSQL requires access pattern modeling — Design documents and keys around how data is queried, not how it's structured
•Discuss trade-offs explicitly — In interviews, explain what you're optimizing for and what you're sacrificing

What's Next:

With schema design mastered, the next page dives into Scaling Strategies—how to evolve your database architecture as load increases. You'll learn replication, sharding, caching layers, and when to consider distributed databases.

Page Complete

You now have a comprehensive framework for schema design in system design interviews. Remember: great schemas aren't just correct—they're maintainable, performant, and evolvable.