Loading content...
Schema design is where abstract system requirements become concrete data structures. In system design interviews, your schema design reveals your understanding of data modeling principles, performance optimization, and long-term maintainability. A poor schema creates technical debt that compounds over years; a well-designed schema enables feature velocity and operational simplicity.
Unlike code refactoring, schema changes are expensive and risky. Adding a column to a billion-row table can take hours and require careful orchestration. Changing a primary key structure might require migrating the entire dataset. These constraints make upfront schema design disproportionately important—and interviewers know it.
By the end of this page, you will master schema design principles for system design interviews. You'll understand how to model entities and relationships, make informed normalization decisions, design for performance, plan for evolution, and adapt principles across different database types.
Regardless of database type, schema design follows universal principles rooted in clarity, efficiency, and adaptability. Let's establish the foundational concepts before diving into specific patterns.
The Three Schema Design Goals
Every schema should optimize for these three (often competing) goals:
Data Integrity: The schema prevents invalid states. Constraints, types, and relationships ensure data correctness at the database level rather than relying on application logic.
Query Performance: The schema supports efficient access patterns. Data is organized, indexed, and (where appropriate) denormalized to minimize I/O and computation for common queries.
Evolvability: The schema can change without catastrophic migrations. New features can be added without rewriting the data layer or requiring downtime.
These goals often conflict. Maximum integrity (full normalization) can hurt performance. Maximum performance (aggressive denormalization) can hurt integrity and evolvability. Great schema design finds the right balance for your specific use case.
Entity Identification
The first step in schema design is identifying the entities your system manages. An entity is a distinct 'thing' that:
In an E-commerce System:
In a Social Media System:
The normalization-denormalization spectrum is the most critical schema design decision. Understanding when to normalize (eliminate redundancy) and when to denormalize (introduce controlled redundancy) separates junior from senior engineers.
Denormalization Patterns
Denormalization isn't random data duplication—it's strategic redundancy to optimize specific access patterns.
1. Embedding (for Documents)
// Instead of separate orders and items tables
{
"orderId": "ord_123",
"userId": "usr_456",
"items": [
{ "productId": "prd_789", "name": "Widget", "quantity": 2, "price": 29.99 },
{ "productId": "prd_012", "name": "Gadget", "quantity": 1, "price": 49.99 }
],
"totalAmount": 109.97
}
2. Computed Columns
-- Store computed values that would require expensive aggregation
ALTER TABLE users ADD COLUMN post_count INT DEFAULT 0;
ALTER TABLE users ADD COLUMN follower_count INT DEFAULT 0;
-- Updated asynchronously via triggers or application logic
3. Materialized Views
-- Pre-join tables for read-heavy reporting
CREATE MATERIALIZED VIEW order_summary AS
SELECT
u.name as customer_name,
o.id as order_id,
SUM(i.quantity * i.price) as total,
COUNT(i.id) as item_count
FROM users u
JOIN orders o ON u.id = o.user_id
JOIN order_items i ON o.id = i.order_id
GROUP BY u.id, o.id;
-- Refresh periodically or on-change
REFRESH MATERIALIZED VIEW order_summary;
Every piece of denormalized data requires a mechanism to stay consistent with its source of truth. Evaluate this 'consistency tax' when deciding to denormalize: triggers, application logic, async workers, or eventual consistency acceptance.
Relational databases remain the most common choice for transactional systems. Mastering relational schema patterns is essential for system design interviews.
Primary Key Selection
The primary key choice has cascading effects on performance, scalability, and operations.
Auto-Increment Integers
CREATE TABLE users (
id SERIAL PRIMARY KEY, -- PostgreSQL
-- or
id INT AUTO_INCREMENT PRIMARY KEY -- MySQL
);
UUIDs
CREATE TABLE users (
id UUID PRIMARY KEY DEFAULT gen_random_uuid()
);
ULIDs / Snowflake IDs
-- ULID: Lexicographically sortable, time-ordered
CREATE TABLE events (
id VARCHAR(26) PRIMARY KEY -- ULID string format
);
-- Snowflake: 64-bit time-sortable
CREATE TABLE tweets (
id BIGINT PRIMARY KEY -- Snowflake ID
);
In interviews, recommend UUIDs or ULIDs for distributed systems and auto-increment for simpler, single-node deployments. Always explain the trade-offs to demonstrate depth.
NoSQL databases require different schema thinking. Instead of normalizing, you design around access patterns. This section covers patterns for document, key-value, and wide-column stores.
Document Database Design Principles
// Blog post with embedded comments (access pattern: read post with comments)
{
"_id": "post_123",
"title": "Schema Design Best Practices",
"author": {
"_id": "user_456",
"name": "Jane Smith", // Denormalized at write time
"avatar": "https://..."
},
"comments": [
{ "userId": "user_789", "text": "Great post!", "createdAt": "..." },
{ "userId": "user_012", "text": "Very helpful", "createdAt": "..." }
],
"commentCount": 2 // Denormalized counter
}
// Product with reference to category (category changes independently)
{
"_id": "prod_123",
"name": "Widget Pro",
"categoryId": "cat_456", // Reference, not embedding
"price": 29.99
}
// Category document
{
"_id": "cat_456",
"name": "Electronics",
"description": "..."
}
// Instead of one document per sensor reading
// Bucket readings by hour
{
"_id": "sensor_123_2024011514", // sensor + hour bucket
"sensorId": "sensor_123",
"bucketStart": "2024-01-15T14:00:00Z",
"readings": [
{ "t": "2024-01-15T14:00:12Z", "value": 23.5 },
{ "t": "2024-01-15T14:00:24Z", "value": 23.7 },
// ... up to 60 readings per document
],
"count": 60,
"sum": 1412.3, // Pre-aggregated
"min": 22.1,
"max": 24.8
}
Key-Value Access Pattern Modeling
Key-value stores require encoding access patterns into key structure:
# User profile lookup
Key: user:{userId}
Value: {entire user JSON}
# User's posts (sorted)
Key: user:{userId}:posts
Value: [sorted set of post IDs]
# Session storage
Key: session:{sessionId}
Value: {session data}
TTL: 3600
# Rate limiting
Key: ratelimit:{userId}:{minute}
Value: request count (integer)
TTL: 60
# Leaderboard
Key: leaderboard:daily:{date}
Value: sorted set (userId -> score)
DynamoDB Single-Table Design
DynamoDB best practices often involve storing multiple entity types in one table:
| PK | SK | Type | Data |
|---|---|---|---|
| USER#123 | USER#123 | User | {name, email, ...} |
| USER#123 | ORDER#456 | Order | {total, status, ...} |
| USER#123 | ORDER#456#ITEM#1 | OrderItem | {productId, qty, ...} |
| ORDER#456 | ORDER#456 | Order | {duplicate for order lookup} |
| PRODUCT#789 | PRODUCT#789 | Product | {name, price, ...} |
When using NoSQL in interviews, demonstrate that you understand the trade-offs. Show that you're modeling for specific access patterns and acknowledge what becomes harder (ad-hoc queries, joins, consistency).
Indexes are the most impactful performance lever in database design. A schema without indexing strategy is incomplete. In interviews, proactively discussing indexes demonstrates production experience.
| Index Type | Use Case | Example |
|---|---|---|
| B-Tree (default) | Equality and range queries | WHERE created_at > '2024-01-01' |
| Hash | Exact equality only | WHERE id = 123 |
| GiST/SP-GiST | Geometric, full-text, range types | WHERE location && box |
| GIN | Array contains, full-text | WHERE tags @> '{postgresql}' |
| BRIN | Large, naturally ordered tables | Append-only log tables |
| Partial | Subset of rows matching condition | WHERE status = 'active' |
| Composite | Multi-column queries | WHERE user_id = ? AND created_at > ? |
| Covering | Index contains all query columns | SELECT id, name FROM users |
Composite Index Column Order
Column order in composite indexes is critical. Follow the ESR Rule (Equality, Sort, Range):
-- Query pattern
SELECT * FROM orders
WHERE status = 'shipped' -- Equality
AND region = 'US' -- Equality
AND created_at > '2024-01-01' -- Range
ORDER BY created_at DESC; -- Sort
-- Optimal index
CREATE INDEX idx_orders_status_region_created
ON orders(status, region, created_at DESC);
-- Why this order?
-- 1. status (E): Exact match, narrows to subset
-- 2. region (E): Exact match, further narrows
-- 3. created_at (S/R): Supports both sort and range
Covering Indexes (Index-Only Scans)
-- Query only needs these columns
SELECT user_id, created_at FROM posts WHERE user_id = 123;
-- Covering index avoids table access entirely
CREATE INDEX idx_posts_user_covering
ON posts(user_id) INCLUDE (created_at);
-- All data comes from index, no heap fetch
WHERE LOWER(email) = ? won't use index on email—create expression index insteadProduction schemas must evolve without downtime. Designing for evolution from day one prevents painful migrations later. Interviewers appreciate candidates who think about operational realities.
Backward-Compatible Changes (Safe)
These changes can be applied without coordinated deployment:
-- Adding nullable column (safe)
ALTER TABLE users ADD COLUMN middle_name VARCHAR(100);
-- Adding column with default (safe, but may lock in some DBs)
ALTER TABLE users ADD COLUMN status VARCHAR(20) DEFAULT 'active';
-- Adding new index (safe, may be slow on large tables)
CREATE INDEX CONCURRENTLY idx_users_status ON users(status);
-- Adding new table (safe)
CREATE TABLE user_preferences (...);
Breaking Changes (Require Coordination)
-- Removing column (break existing queries)
ALTER TABLE users DROP COLUMN legacy_field;
-- Renaming column (breaks existing queries)
ALTER TABLE users RENAME COLUMN name TO full_name;
-- Changing type (may fail or truncate)
ALTER TABLE users ALTER COLUMN age TYPE SMALLINT;
-- Adding NOT NULL (fails if NULLs exist)
ALTER TABLE users ALTER COLUMN email SET NOT NULL;
The Expand-Contract Pattern
For breaking changes, use the expand-contract (parallel change) pattern:
Phase 1: EXPAND
├── Add new column (nullable)
├── Deploy code that writes to BOTH columns
└── Backfill old data to new column
Phase 2: MIGRATE
├── Deploy code that reads from NEW column
├── Verify correctness
└── Monitor for issues
Phase 3: CONTRACT
├── Deploy code that only writes to NEW column
├── Drop old column (or defer indefinitely)
└── Clean up
Example: Renaming 'name' to 'full_name'
-- Phase 1: Expand
ALTER TABLE users ADD COLUMN full_name VARCHAR(255);
UPDATE users SET full_name = name WHERE full_name IS NULL;
-- Phase 2: Application writes to both columns
-- Application reads from full_name (with fallback to name)
-- Phase 3: Contract (after verification)
ALTER TABLE users DROP COLUMN name;
When designing schemas in interviews, mention: 'I'm using nullable columns with defaults to ensure we can add fields without breaking existing code, and I'd use the expand-contract pattern for any breaking changes.' This shows production maturity.
Let's apply schema design principles to a realistic interview question.
Design the database schema for a ride-sharing application like Uber. Consider drivers, riders, trips, locations, payments, and ratings.
Step 1: Identify Entities and Relationships
Relationships:
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273747576777879808182838485868788899091929394959697
-- Users table (both drivers and riders)CREATE TABLE users ( id UUID PRIMARY KEY DEFAULT gen_random_uuid(), email VARCHAR(255) UNIQUE NOT NULL, phone VARCHAR(20) UNIQUE NOT NULL, name VARCHAR(255) NOT NULL, profile_photo_url TEXT, is_driver BOOLEAN DEFAULT FALSE, driver_license_verified BOOLEAN DEFAULT FALSE, created_at TIMESTAMP DEFAULT NOW(), updated_at TIMESTAMP DEFAULT NOW()); -- Vehicles (for drivers)CREATE TABLE vehicles ( id UUID PRIMARY KEY DEFAULT gen_random_uuid(), driver_id UUID NOT NULL REFERENCES users(id), make VARCHAR(50), model VARCHAR(50), year INT, color VARCHAR(30), license_plate VARCHAR(20) UNIQUE, vehicle_type VARCHAR(20), -- 'economy', 'comfort', 'premium' is_active BOOLEAN DEFAULT TRUE, created_at TIMESTAMP DEFAULT NOW());CREATE INDEX idx_vehicles_driver ON vehicles(driver_id); -- TripsCREATE TABLE trips ( id UUID PRIMARY KEY DEFAULT gen_random_uuid(), rider_id UUID NOT NULL REFERENCES users(id), driver_id UUID REFERENCES users(id), -- NULL until matched vehicle_id UUID REFERENCES vehicles(id), status VARCHAR(20) NOT NULL DEFAULT 'requested', -- 'requested', 'matched', 'driver_arriving', 'in_progress', 'completed', 'cancelled' pickup_location GEOGRAPHY(POINT, 4326), pickup_address TEXT, dropoff_location GEOGRAPHY(POINT, 4326), dropoff_address TEXT, requested_at TIMESTAMP DEFAULT NOW(), matched_at TIMESTAMP, started_at TIMESTAMP, completed_at TIMESTAMP, cancelled_at TIMESTAMP, cancelled_by UUID REFERENCES users(id), estimated_fare_cents INT, actual_fare_cents INT, distance_meters INT, duration_seconds INT);CREATE INDEX idx_trips_rider ON trips(rider_id, requested_at DESC);CREATE INDEX idx_trips_driver ON trips(driver_id, requested_at DESC);CREATE INDEX idx_trips_status ON trips(status) WHERE status IN ('requested', 'matched', 'in_progress'); -- Trip location breadcrumbs (time-series data)CREATE TABLE trip_locations ( id BIGSERIAL PRIMARY KEY, trip_id UUID NOT NULL REFERENCES trips(id), location GEOGRAPHY(POINT, 4326), recorded_at TIMESTAMP DEFAULT NOW(), speed_mph DECIMAL(5,2), heading INT);CREATE INDEX idx_trip_locations_trip_time ON trip_locations(trip_id, recorded_at); -- PaymentsCREATE TABLE payments ( id UUID PRIMARY KEY DEFAULT gen_random_uuid(), trip_id UUID UNIQUE NOT NULL REFERENCES trips(id), rider_id UUID NOT NULL REFERENCES users(id), amount_cents INT NOT NULL, currency VARCHAR(3) DEFAULT 'USD', payment_method_id UUID, -- References payment methods table status VARCHAR(20) DEFAULT 'pending', stripe_charge_id VARCHAR(255), created_at TIMESTAMP DEFAULT NOW(), completed_at TIMESTAMP); -- RatingsCREATE TABLE ratings ( id UUID PRIMARY KEY DEFAULT gen_random_uuid(), trip_id UUID NOT NULL REFERENCES trips(id), rater_id UUID NOT NULL REFERENCES users(id), ratee_id UUID NOT NULL REFERENCES users(id), rating INT NOT NULL CHECK (rating >= 1 AND rating <= 5), comment TEXT, created_at TIMESTAMP DEFAULT NOW(), UNIQUE(trip_id, rater_id) -- One rating per trip per rater);CREATE INDEX idx_ratings_ratee ON ratings(ratee_id); -- Denormalized: User aggregate ratings (updated async)ALTER TABLE users ADD COLUMN avg_rating DECIMAL(3,2);ALTER TABLE users ADD COLUMN total_ratings INT DEFAULT 0;ALTER TABLE users ADD COLUMN total_trips INT DEFAULT 0;Key Design Decisions Explained:
UUID primary keys: Globally unique for distributed system, safe for client exposure
Combined users table: Riders and drivers in one table with flags, allows same user to be both
PostGIS geography types: Native geospatial support for location queries, distance calculations
Trip status as enum-like VARCHAR: Readable, flexible for adding new states
Denormalized avg_rating: Expensive to compute in real-time, updated asynchronously
Trip_locations for time-series: Separate high-volume table, could migrate to TimescaleDB if volume warrants
Partial index on active trips: Most queries filter for active status, small index covers common case
Separate ratings table: Allows rider to rate driver AND driver to rate rider per trip
Schema design is where abstract system requirements become concrete data structures. Let's consolidate the key takeaways:
What's Next:
With schema design mastered, the next page dives into Scaling Strategies—how to evolve your database architecture as load increases. You'll learn replication, sharding, caching layers, and when to consider distributed databases.
You now have a comprehensive framework for schema design in system design interviews. Remember: great schemas aren't just correct—they're maintainable, performant, and evolvable.