System Design HLDSQL vs NoSQL

SQL vs NoSQL: Choosing the Right Database Paradigm

LevelIntermediate

Duration75 mins

TopicSQL vs NoSQL

1 / 5

Relational Model: Tables, Rows, Relationships

The Foundation of Structured Data

In 1970, Edgar F. Codd, a researcher at IBM, published a paper titled "A Relational Model of Data for Large Shared Data Banks" that would forever change how we store, organize, and retrieve information. Before this paper, databases were hierarchical tangles of pointers and navigational code—fragile, difficult to query, and tightly coupled to application logic.

Codd's insight was revolutionary: data should be organized into relations (tables), and the database management system should handle the complexity of storage and retrieval. This separation of concerns—between logical data organization and physical storage—unlocked decades of database innovation and remains the dominant paradigm for structured data to this day.

Understanding the relational model isn't just about learning SQL syntax. It's about grasping a way of thinking about data that enables predictable, scalable, and maintainable systems.

What You Will Learn

By the end of this page, you will understand the mathematical foundations of the relational model, how tables, rows, and relationships work together, the role of keys and constraints, and why these concepts matter profoundly for system design. This knowledge forms the bedrock upon which all SQL database decisions rest.

The Relational Model: Mathematical Foundations

The relational model is grounded in set theory and first-order predicate logic. While you don't need a mathematics degree to use databases effectively, understanding these foundations explains why relational databases behave as they do.

Core Principles of the Relational Model:

Foundational Concepts

•Relation — A relation is a set of tuples (rows) that share the same attributes (columns). In mathematical terms, a relation is a subset of the Cartesian product of domains. In practical terms, it's a table.
•Tuple — A single row in a relation, representing one instance of the entity being modeled. Each tuple is unique within its relation (enforced by keys).
•Attribute — A named column with a defined domain (data type). Attributes represent properties of the entity—like 'email' or 'created_at'.
•Domain — The set of permissible values for an attribute. A domain defines both the data type (integer, string, timestamp) and any constraints on valid values.
•Schema — The structure definition of a relation: its attributes, their domains, and constraints. The schema is metadata describing the data's shape.

Why Mathematical Foundations Matter:

The mathematical basis of the relational model isn't academic pedantry—it has profound practical implications:

Declarative Queries: Because relations are mathematical sets, we can describe what data we want using set operations (union, intersection, difference, projection, selection) rather than how to retrieve it. This is why SQL is declarative—you describe the result, and the database figures out the optimal retrieval path.
Query Optimization: The mathematical properties of relational algebra enable query optimizers to transform queries into equivalent but more efficient forms. A query optimizer can reorder joins, push down selections, and eliminate redundant operations because the mathematical rules guarantee equivalence.
Consistency Guarantees: The formal definition of relations enables precise reasoning about data integrity. Constraints like unique keys, foreign keys, and check constraints can be mathematically verified.
Independence: The relational model provides data independence—applications interact with logical relations without concern for physical storage. The database can reorganize data on disk, add indexes, or change storage engines without breaking applications.

Codd's Twelve Rules

Codd later defined 12 rules (actually 13, numbered 0-12) that a database management system must satisfy to be considered truly relational. Most databases don't fully satisfy all rules, but they serve as guiding principles. Key rules include: the information rule (all data must be represented as values in tables), the systematic null handling rule, and the view updating rule.

Anatomy of a Table: Structure and Purpose

A table (relation) is the fundamental organizing unit of the relational model. Let's dissect a table's components in depth:

Consider a users table:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
CREATE TABLE users (
    id              BIGSERIAL PRIMARY KEY,
    email           VARCHAR(255) NOT NULL UNIQUE,
    password_hash   CHAR(60) NOT NULL,
    full_name       VARCHAR(100) NOT NULL,
    status          VARCHAR(20) DEFAULT 'active'
                    CHECK (status IN ('active', 'suspended', 'deleted')),
    email_verified  BOOLEAN DEFAULT FALSE,
    created_at      TIMESTAMP WITH TIME ZONE DEFAULT CURRENT_TIMESTAMP,
    updated_at      TIMESTAMP WITH TIME ZONE DEFAULT CURRENT_TIMESTAMP
);
 
-- This table definition encodes:
-- 1. The entity being modeled (users)
-- 2. Required attributes (email, password_hash, full_name)
-- 3. Optional attributes with defaults (status, email_verified, timestamps)
-- 4. Data types constraining each attribute's domain
-- 5. Constraints ensuring data integrity (NOT NULL, UNIQUE, CHECK)

Component Deep Dive:

1. Column Data Types (Domains)

Data types aren't just storage specifications—they're semantic declarations about the nature of data:

BIGSERIAL: Auto-incrementing 64-bit integer. Implies this is a surrogate key without business meaning.
VARCHAR(255): Variable-length string up to 255 characters. The limit is a constraint on valid values.
CHAR(60): Fixed-length string of exactly 60 characters. Common for bcrypt hashes which are always 60 characters.
BOOLEAN: True/false values. Semantically different from using 0/1 integers.
TIMESTAMP WITH TIME ZONE: A point in time with timezone awareness. Critical for global applications.

2. Constraints as Business Logic

Constraints encode invariants that must always hold true:

NOT NULL: This attribute is required. The entity cannot exist without this value.
UNIQUE: No two rows can share this value. Implies this could serve as a key.
CHECK: Arbitrary boolean conditions. Here, status must be one of three values.
DEFAULT: Provides a value when none is specified. Implicitly makes the column optional for INSERTs.

3. Primary Key

The PRIMARY KEY constraint combines NOT NULL and UNIQUE to guarantee that each row has a unique, non-null identifier. This is how the database knows which row you mean when you reference "that user."

Common Column Constraints and Their Meanings
Constraint	Enforcement	System Design Implication
NOT NULL	Column must always have a value	Mandatory field; cannot be omitted in application logic
UNIQUE	No duplicate values allowed	Can serve as alternate key; enables efficient lookups
PRIMARY KEY	NOT NULL + UNIQUE; identifies row	Central reference point for relationships
FOREIGN KEY	Value must exist in referenced table	Encodes relationships; database enforces referential integrity
CHECK	Value must satisfy boolean expression	Business rules enforced at database level
DEFAULT	Auto-assigns value if not provided	Simplifies inserts; documents expected defaults

Rows: Individual Facts and Data Integrity

Each row in a relational table represents a single instance of the entity being modeled—one user, one order, one product. But rows are more than just data storage; they're factual assertions about the world.

The Row as an Assertion:

When you insert a row, you're asserting that this combination of attribute values is true:

INSERT INTO users (email, password_hash, full_name) 
VALUES ('alice@example.com', '$2b$10$...', 'Alice Johnson');

This statement asserts: "There exists a user with email 'alice@example.com', with this password hash, named 'Alice Johnson'." The database will reject this assertion if it violates any constraints—if the email already exists (UNIQUE violation), if any NOT NULL column is omitted, or if the status violates the CHECK constraint.

Row Identity and Keys:

Every row must be uniquely identifiable. This is non-negotiable in the relational model—without unique identification, you cannot reliably update, delete, or reference specific rows. Keys provide this identity:

Types of Keys

•Candidate Key — Any column or combination of columns that uniquely identifies rows. A table may have multiple candidate keys (e.g., both id and email in our users table).
•Primary Key — The candidate key chosen as the official row identifier. By convention, often a single column. All other tables reference this when establishing relationships.
•Natural Key — A key derived from intrinsic data attributes (email, SSN, ISBN). Carries business meaning but may change over time.
•Surrogate Key — An artificial key with no business meaning (auto-incremented ID, UUID). Stable, compact, and independent of business logic changes.
•Composite Key — A primary key composed of multiple columns. Common in junction tables for many-to-many relationships.
•Foreign Key — A column referencing another table's primary key. Establishes and enforces relationships between tables.

The Surrogate Key vs Natural Key Debate:

This is one of the most debated topics in database design:

Surrogate Keys (Auto-increment IDs, UUIDs):

Pros: Never change, compact for joins, no business logic dependencies, simple to generate
Cons: Meaningless, require lookups to find records by business attributes, can still cause collisions at extreme scale (auto-increment across shards)

Natural Keys (Email, Product SKU):

Pros: Self-describing, no lookup needed when you have the business value, meaningful in debugging
Cons: May change (email changes, SKU renaming), often larger (slower joins), may not exist for all entities

Modern Best Practice:

Most systems use surrogate keys as primary keys for relationship integrity, while maintaining unique indexes on natural keys for business queries. This provides the stability of surrogates with the queryability of natural keys.

-- Surrogate primary key + natural key uniqueness
id    BIGSERIAL PRIMARY KEY,        -- Relationships use this
email VARCHAR(255) NOT NULL UNIQUE  -- Queries use this

UUIDs in Distributed Systems

Auto-incrementing IDs are problematic in distributed systems—each node would need coordination to avoid duplicates. UUIDs (particularly UUIDv7 which is time-ordered) have become the preferred surrogate key for systems that shard or distribute data across multiple databases.

Relationships: Connecting Tables with Foreign Keys

The power of the relational model lies in its ability to represent complex relationships between entities without data duplication. Instead of storing user information in every order record, we store it once in users and reference it from orders. This is achieved through foreign keys.

Understanding Relationship Cardinality:

Cardinality describes how many instances of one entity relate to instances of another. Understanding cardinality is fundamental to correct schema design:

Relationship Cardinality Types
Cardinality	Example	Implementation
One-to-One (1:1)	User ↔ Profile	Foreign key with UNIQUE constraint, or same table
One-to-Many (1:N)	User → Orders	Foreign key in the 'many' side referencing the 'one'
Many-to-Many (M:N)	Orders ↔ Products	Junction table with two foreign keys

One-to-Many: The Most Common Relationship

The one-to-many relationship is ubiquitous. A user has many orders. An author writes many posts. A category contains many products.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
-- Parent table (the 'one' side)
CREATE TABLE users (
    id          BIGSERIAL PRIMARY KEY,
    email       VARCHAR(255) NOT NULL UNIQUE
);
 
-- Child table (the 'many' side)
CREATE TABLE orders (
    id          BIGSERIAL PRIMARY KEY,
    user_id     BIGINT NOT NULL REFERENCES users(id)
                ON DELETE CASCADE
                ON UPDATE CASCADE,
    total       DECIMAL(10, 2) NOT NULL,
    status      VARCHAR(20) NOT NULL,
    created_at  TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);
 
-- The REFERENCES clause creates a foreign key constraint
-- ON DELETE CASCADE: If user is deleted, their orders are deleted
-- ON UPDATE CASCADE: If user.id changes, orders.user_id updates
 
-- Index on foreign key for efficient joins
CREATE INDEX idx_orders_user_id ON orders(user_id);

Many-to-Many: Junction Tables (Association Tables)

When entities have a many-to-many relationship, neither side can hold a foreign key to the other. Instead, we create a junction table (also called a join table, association table, or pivot table) that holds pairs of foreign keys:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
-- First entity
CREATE TABLE products (
    id      BIGSERIAL PRIMARY KEY,
    name    VARCHAR(255) NOT NULL,
    price   DECIMAL(10, 2) NOT NULL
);
 
-- Second entity
CREATE TABLE orders (
    id          BIGSERIAL PRIMARY KEY,
    user_id     BIGINT NOT NULL REFERENCES users(id),
    created_at  TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);
 
-- Junction table connecting products and orders
CREATE TABLE order_items (
    id          BIGSERIAL PRIMARY KEY,
    order_id    BIGINT NOT NULL REFERENCES orders(id) ON DELETE CASCADE,
    product_id  BIGINT NOT NULL REFERENCES products(id) ON DELETE RESTRICT,
    quantity    INTEGER NOT NULL CHECK (quantity > 0),
    unit_price  DECIMAL(10, 2) NOT NULL,
    
    -- Composite unique constraint prevents duplicate products in an order
    UNIQUE (order_id, product_id)
);
 
-- Each order can have many products (via order_items)
-- Each product can appear in many orders (via order_items)
-- The junction table can carry additional attributes (quantity, unit_price)

Referential Integrity Actions

ON DELETE and ON UPDATE clauses define what happens when referenced data changes:

• CASCADE: Propagate the change to referencing rows • SET NULL: Set foreign key to NULL (requires nullable FK) • SET DEFAULT: Set to default value • RESTRICT: Prevent the change if references exist • NO ACTION: Similar to RESTRICT but checked at transaction end

Choose based on business requirements—CASCADE for true ownership relationships, RESTRICT for loose associations.

Normalization: Eliminating Redundancy

Normalization is the process of organizing tables to minimize redundancy and dependency. It's a systematic approach to designing schemas that avoid data anomalies—situations where the same data has different values in different places, or where deleting one piece of data inadvertently deletes other information.

The Normal Forms:

Normalization proceeds through a series of "normal forms," each eliminating a specific type of redundancy:

Progressive Normal Forms

•First Normal Form (1NF) — Eliminate repeating groups; each cell contains atomic values. A 'phones' column shouldn't contain '555-1234, 555-5678'—instead, create a separate phones table.
•Second Normal Form (2NF) — Eliminate partial dependencies; non-key columns must depend on the entire primary key, not just part of it. Relevant for composite keys.
•Third Normal Form (3NF) — Eliminate transitive dependencies; non-key columns must depend only on the primary key, not on other non-key columns. If you store 'city' and 'state', and city determines state, extract to a cities table.
•Boyce-Codd Normal Form (BCNF) — A stricter version of 3NF; every determinant must be a candidate key. Handles edge cases 3NF misses.
•Fourth Normal Form (4NF) — Eliminate multi-valued dependencies. If a row has two independent multi-valued attributes, separate them into distinct tables.
•Fifth Normal Form (5NF) — Eliminate join dependencies; the table cannot be reconstructed from smaller tables unless they have overlapping keys.

Practical Normalization Example:

Consider this denormalized orders table:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
-- PROBLEMATIC: This table violates normalization
CREATE TABLE denormalized_orders (
    order_id        BIGINT,
    customer_name   VARCHAR(100),    -- Duplicated for every order
    customer_email  VARCHAR(255),    -- Duplicated for every order
    customer_city   VARCHAR(100),    -- Duplicated for every order
    product_name    VARCHAR(255),    -- Duplicated across orders
    product_price   DECIMAL(10,2),   -- May become inconsistent
    quantity        INTEGER,
    order_date      DATE
);
 
-- Problems:
-- 1. Customer info repeated in every order (storage waste, update anomalies)
-- 2. If customer changes email, must update all their orders
-- 3. If we delete customer's only order, we lose customer info
-- 4. Product price may differ across orders for same product

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
-- NORMALIZED: Separate concerns into proper tables
 
CREATE TABLE customers (
    id      BIGSERIAL PRIMARY KEY,
    name    VARCHAR(100) NOT NULL,
    email   VARCHAR(255) NOT NULL UNIQUE,
    city    VARCHAR(100)
);
 
CREATE TABLE products (
    id      BIGSERIAL PRIMARY KEY,
    name    VARCHAR(255) NOT NULL,
    price   DECIMAL(10, 2) NOT NULL  -- Current price
);
 
CREATE TABLE orders (
    id              BIGSERIAL PRIMARY KEY,
    customer_id     BIGINT NOT NULL REFERENCES customers(id),
    order_date      DATE NOT NULL DEFAULT CURRENT_DATE
);
 
CREATE TABLE order_items (
    id          BIGSERIAL PRIMARY KEY,
    order_id    BIGINT NOT NULL REFERENCES orders(id) ON DELETE CASCADE,
    product_id  BIGINT NOT NULL REFERENCES products(id),
    quantity    INTEGER NOT NULL,
    unit_price  DECIMAL(10, 2) NOT NULL  -- Price at time of order
);
 
-- Benefits:
-- 1. Customer/product info stored once (single source of truth)
-- 2. Updates to customer/product only require one row change
-- 3. Deleting an order doesn't lose customer info
-- 4. Historical price captured in order_items (important distinction!)

Over-Normalization Warning

Normalization taken to extremes can hurt performance. Joining 10 tables to display a user profile is slow. In practice, most systems normalize to 3NF or BCNF and deliberately denormalize specific cases for performance. The key is understanding the tradeoffs—we'll cover denormalization for performance in later modules.

Relational Algebra: The Foundation of SQL

SQL is a practical implementation of relational algebra—a formal language for manipulating relations. Understanding relational algebra explains why SQL works as it does and illuminates query optimization strategies.

Core Relational Algebra Operations:

Relational Algebra Operations and SQL Equivalents
Operation	Description	SQL Equivalent
Selection (σ)	Filter rows based on condition	WHERE clause
Projection (π)	Select specific columns	SELECT column_list
Union (∪)	Combine rows from two relations	UNION
Intersection (∩)	Rows common to both relations	INTERSECT
Difference (−)	Rows in first but not second relation	EXCEPT
Cartesian Product (×)	Every row paired with every row	CROSS JOIN
Join (⋈)	Combine related rows from two tables	JOIN ... ON
Rename (ρ)	Rename attributes or relations	AS alias

The Power of Composition:

Relational algebra operations are closed over relations—every operation takes relations as input and produces a relation as output. This closure property enables powerful composition:

-- This complex query is a composition of relational operations
SELECT u.name, COUNT(*) as order_count      -- Projection + Aggregation
FROM users u                                  -- Rename
INNER JOIN orders o ON u.id = o.user_id       -- Join
WHERE o.status = 'completed'                  -- Selection
GROUP BY u.id, u.name                         -- Grouping
HAVING COUNT(*) > 5                           -- Selection on groups
ORDER BY order_count DESC;                    -- Sorting (not purely relational)

Each intermediate result is a relation that feeds into the next operation. This compositional nature is why SQL can express complex queries concisely.

Join Types Deep Dive:

Joins are fundamental to querying normalized data. Understanding join semantics is essential:

SQL Join Types

•INNER JOIN — Returns only rows with matching values in both tables. If a user has no orders, that user doesn't appear in the result.
•LEFT (OUTER) JOIN — Returns all rows from the left table, with matching right-table rows or NULLs if no match. All users appear, even those without orders.
•RIGHT (OUTER) JOIN — Returns all rows from the right table, with matching left-table rows or NULLs. Rarely used; equivalent to swapping tables in LEFT JOIN.
•FULL (OUTER) JOIN — Returns all rows from both tables, with NULLs where there's no match. Useful for finding unmatched records in either table.
•CROSS JOIN — Cartesian product: every row from left combined with every row from right. Rarely intentional; usually indicates a missing join condition.

Query Optimization Insight

Query optimizers use relational algebra identities to rewrite queries. For example, σ(R ⋈ S) can be rewritten as σ(R) ⋈ σ(S) if the filter conditions are separable—applying filters before joins reduces the number of rows processed. Understanding these transformations helps you write optimizer-friendly queries.

Strengths of the Relational Model

After 50+ years, the relational model remains dominant for structured data. Understanding its enduring strengths explains why SQL databases are often the default choice for new systems:

Key Strengths

•Strong Consistency (ACID) — Transactions guarantee that concurrent operations maintain data integrity. When a bank transfer completes, both accounts are correctly updated or neither is.
•Schema Enforcement — The database enforces data types, constraints, and relationships. Invalid data is rejected, not stored. The schema serves as documentation and contract.
•Declarative Query Language — SQL describes what data you want. The optimizer decides how to retrieve it efficiently. This separation enables optimization without application changes.
•Ad-Hoc Query Flexibility — Any query can be written at any time. Unlike some NoSQL stores, you're not limited to pre-defined access patterns. Analytics, debugging, and exploration are straightforward.
•Mature Tooling Ecosystem — Decades of investment have produced robust backup, replication, monitoring, migration, and administration tools. The ecosystem is unmatched.
•Proven at Scale — Despite myths about SQL not scaling, systems like YouTube (MySQL), Facebook (MySQL), Shopify (MySQL), Stripe (PostgreSQL), and Instagram (PostgreSQL) handle massive scale with relational databases.

The Consistency Advantage:

Relational databases provide guarantees that simplify application development:

You can trust what you read. A balance of $1000 is actually $1000—not a stale value from 30 seconds ago.
You can trust that writes succeed or fail cleanly. A transaction either commits entirely or rolls back entirely. No partial updates.
You can trust that constraints hold. If email has a UNIQUE constraint, you know there are no duplicates—guaranteed.
You can trust relationships. If orders.user_id references users.id, every user_id value corresponds to an existing user.

These guarantees shift complexity from application code to the database, where it's been solved and optimized over decades.

The Cost of Consistency

Strong consistency has costs: coordination overhead for distributed writes, potential contention on hot rows, and the need for single-leader topologies in many cases. These costs drive some organizations toward eventual consistency (NoSQL)—but for many workloads, the simplicity of strong consistency is worth the trade-off.

System Design Implications

Understanding the relational model shapes how you approach system design. Here are key implications for architects:

Schema Design is Application Design:

The database schema is a critical part of your application architecture. Poor schema design leads to:

Query performance problems (joins on non-indexed columns, missing indexes)
Data integrity issues (no constraints means bugs persist as data corruption)
Scale bottlenecks (unnormalized data means updates touch many rows)

Invest time in schema design early. Changes are expensive once data exists.

Transaction Boundaries Matter:

Every database operation implicitly or explicitly occurs within a transaction. Understanding transaction boundaries helps prevent:

Lost updates (two concurrent edits, one overwrites the other)
Read skew (reading inconsistent data during another transaction)
Write skew (validation passes but concurrent writes cause violation)

Design your application with transaction boundaries in mind. What operations must be atomic?

Read vs Write Optimization:

Normalization optimizes for writes (updates happen in one place) but may slow reads (joins required). Denormalization optimizes for reads (data pre-joined) but complicates writes (updates must propagate).

For read-heavy workloads, consider:

Materialized views (precomputed join results)
Denormalized read tables (CQRS pattern)
Caching layer (Redis/Memcached)

When Relational Shines

•Complex queries with multiple filters, joins, aggregations
•Transaction-heavy workloads requiring consistency
•Data with natural relationships (e-commerce, SaaS, finance)
•Need for ad-hoc querying and analytics
•Schema enforcement is valuable (regulated industries)

Relational Struggles

•Highly variable/sparse data (many NULLs or extra columns)
•Massive horizontal scale beyond single node (though possible)
•Schema-less or rapidly evolving data structures
•Graph traversals (multi-hop relationships)
•Time-series or append-only workloads at extreme scale

Start with Relational

A common recommendation: start with a relational database until you have specific reasons not to. The flexibility, tooling, and consistency guarantees of SQL databases handle the vast majority of applications well. Premature adoption of NoSQL often creates complexity that wasn't needed.

Summary: The Relational Foundation

We've covered the foundational concepts of the relational model—the theoretical and practical basis for SQL databases. Let's consolidate the key takeaways:

Key Takeaways

•The relational model is mathematically grounded — Relations (tables), tuples (rows), and attributes (columns) derive from set theory, enabling powerful query optimization and consistency guarantees.
•Tables are structured containers with enforced schemas — Columns have types and constraints; the database rejects invalid data at the boundary.
•Rows are factual assertions uniquely identified by keys — Surrogate keys provide stability; natural keys provide meaning. Often, both are valuable.
•Relationships connect tables through foreign keys — One-to-many and many-to-many relationships enable normalized, non-redundant data storage.
•Normalization eliminates redundancy at the cost of joins — Normalize for write safety and consistency; denormalize selectively for read performance.
•SQL is relational algebra made practical — Understanding the underlying algebra explains SQL's power and guides optimization.
•Relational databases offer strong consistency, schema enforcement, and mature tooling — These strengths make them the default choice for most structured data.

What's Next:

Now that we understand the relational model deeply, we'll explore the alternative paradigm: NoSQL databases. In the next page, we'll examine how NoSQL systems embrace flexible schemas and specialized data models to address use cases where the relational model's strengths become constraints.

Page Complete

You now have a deep understanding of the relational model—tables, rows, relationships, normalization, and the theoretical foundations that make SQL databases work. This knowledge is essential for making informed decisions about when relational databases are the right choice and how to use them effectively.

1 / 5

Loading learning content...

System Design HLDSQL vs NoSQL

SQL vs NoSQL: Choosing the Right Database Paradigm

LevelIntermediate

Duration75 mins

TopicSQL vs NoSQL

1 / 5

Relational Model: Tables, Rows, Relationships

The Foundation of Structured Data

Understanding the relational model isn't just about learning SQL syntax. It's about grasping a way of thinking about data that enables predictable, scalable, and maintainable systems.

What You Will Learn

The Relational Model: Mathematical Foundations

Core Principles of the Relational Model:

Foundational Concepts

•Relation — A relation is a set of tuples (rows) that share the same attributes (columns). In mathematical terms, a relation is a subset of the Cartesian product of domains. In practical terms, it's a table.
•Tuple — A single row in a relation, representing one instance of the entity being modeled. Each tuple is unique within its relation (enforced by keys).
•Attribute — A named column with a defined domain (data type). Attributes represent properties of the entity—like 'email' or 'created_at'.
•Domain — The set of permissible values for an attribute. A domain defines both the data type (integer, string, timestamp) and any constraints on valid values.
•Schema — The structure definition of a relation: its attributes, their domains, and constraints. The schema is metadata describing the data's shape.

Why Mathematical Foundations Matter:

The mathematical basis of the relational model isn't academic pedantry—it has profound practical implications:

Declarative Queries: Because relations are mathematical sets, we can describe what data we want using set operations (union, intersection, difference, projection, selection) rather than how to retrieve it. This is why SQL is declarative—you describe the result, and the database figures out the optimal retrieval path.
Query Optimization: The mathematical properties of relational algebra enable query optimizers to transform queries into equivalent but more efficient forms. A query optimizer can reorder joins, push down selections, and eliminate redundant operations because the mathematical rules guarantee equivalence.
Consistency Guarantees: The formal definition of relations enables precise reasoning about data integrity. Constraints like unique keys, foreign keys, and check constraints can be mathematically verified.
Independence: The relational model provides data independence—applications interact with logical relations without concern for physical storage. The database can reorganize data on disk, add indexes, or change storage engines without breaking applications.

Codd's Twelve Rules

Anatomy of a Table: Structure and Purpose

A table (relation) is the fundamental organizing unit of the relational model. Let's dissect a table's components in depth:

Consider a users table:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
CREATE TABLE users (
    id              BIGSERIAL PRIMARY KEY,
    email           VARCHAR(255) NOT NULL UNIQUE,
    password_hash   CHAR(60) NOT NULL,
    full_name       VARCHAR(100) NOT NULL,
    status          VARCHAR(20) DEFAULT 'active'
                    CHECK (status IN ('active', 'suspended', 'deleted')),
    email_verified  BOOLEAN DEFAULT FALSE,
    created_at      TIMESTAMP WITH TIME ZONE DEFAULT CURRENT_TIMESTAMP,
    updated_at      TIMESTAMP WITH TIME ZONE DEFAULT CURRENT_TIMESTAMP
);
 
-- This table definition encodes:
-- 1. The entity being modeled (users)
-- 2. Required attributes (email, password_hash, full_name)
-- 3. Optional attributes with defaults (status, email_verified, timestamps)
-- 4. Data types constraining each attribute's domain
-- 5. Constraints ensuring data integrity (NOT NULL, UNIQUE, CHECK)

Component Deep Dive:

1. Column Data Types (Domains)

Data types aren't just storage specifications—they're semantic declarations about the nature of data:

BIGSERIAL: Auto-incrementing 64-bit integer. Implies this is a surrogate key without business meaning.
VARCHAR(255): Variable-length string up to 255 characters. The limit is a constraint on valid values.
CHAR(60): Fixed-length string of exactly 60 characters. Common for bcrypt hashes which are always 60 characters.
BOOLEAN: True/false values. Semantically different from using 0/1 integers.
TIMESTAMP WITH TIME ZONE: A point in time with timezone awareness. Critical for global applications.

2. Constraints as Business Logic

Constraints encode invariants that must always hold true:

NOT NULL: This attribute is required. The entity cannot exist without this value.
UNIQUE: No two rows can share this value. Implies this could serve as a key.
CHECK: Arbitrary boolean conditions. Here, status must be one of three values.
DEFAULT: Provides a value when none is specified. Implicitly makes the column optional for INSERTs.

3. Primary Key

Common Column Constraints and Their Meanings
Constraint	Enforcement	System Design Implication
NOT NULL	Column must always have a value	Mandatory field; cannot be omitted in application logic
UNIQUE	No duplicate values allowed	Can serve as alternate key; enables efficient lookups
PRIMARY KEY	NOT NULL + UNIQUE; identifies row	Central reference point for relationships
FOREIGN KEY	Value must exist in referenced table	Encodes relationships; database enforces referential integrity
CHECK	Value must satisfy boolean expression	Business rules enforced at database level
DEFAULT	Auto-assigns value if not provided	Simplifies inserts; documents expected defaults

Rows: Individual Facts and Data Integrity

The Row as an Assertion:

When you insert a row, you're asserting that this combination of attribute values is true:

INSERT INTO users (email, password_hash, full_name) 
VALUES ('alice@example.com', '$2b$10$...', 'Alice Johnson');

Row Identity and Keys:

Types of Keys

•Candidate Key — Any column or combination of columns that uniquely identifies rows. A table may have multiple candidate keys (e.g., both id and email in our users table).
•Primary Key — The candidate key chosen as the official row identifier. By convention, often a single column. All other tables reference this when establishing relationships.
•Natural Key — A key derived from intrinsic data attributes (email, SSN, ISBN). Carries business meaning but may change over time.
•Surrogate Key — An artificial key with no business meaning (auto-incremented ID, UUID). Stable, compact, and independent of business logic changes.
•Composite Key — A primary key composed of multiple columns. Common in junction tables for many-to-many relationships.
•Foreign Key — A column referencing another table's primary key. Establishes and enforces relationships between tables.

The Surrogate Key vs Natural Key Debate:

This is one of the most debated topics in database design:

Surrogate Keys (Auto-increment IDs, UUIDs):

Pros: Never change, compact for joins, no business logic dependencies, simple to generate
Cons: Meaningless, require lookups to find records by business attributes, can still cause collisions at extreme scale (auto-increment across shards)

Natural Keys (Email, Product SKU):

Pros: Self-describing, no lookup needed when you have the business value, meaningful in debugging
Cons: May change (email changes, SKU renaming), often larger (slower joins), may not exist for all entities

Modern Best Practice:

-- Surrogate primary key + natural key uniqueness
id    BIGSERIAL PRIMARY KEY,        -- Relationships use this
email VARCHAR(255) NOT NULL UNIQUE  -- Queries use this

UUIDs in Distributed Systems

Relationships: Connecting Tables with Foreign Keys

Understanding Relationship Cardinality:

Cardinality describes how many instances of one entity relate to instances of another. Understanding cardinality is fundamental to correct schema design:

Relationship Cardinality Types
Cardinality	Example	Implementation
One-to-One (1:1)	User ↔ Profile	Foreign key with UNIQUE constraint, or same table
One-to-Many (1:N)	User → Orders	Foreign key in the 'many' side referencing the 'one'
Many-to-Many (M:N)	Orders ↔ Products	Junction table with two foreign keys

One-to-Many: The Most Common Relationship

The one-to-many relationship is ubiquitous. A user has many orders. An author writes many posts. A category contains many products.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
-- Parent table (the 'one' side)
CREATE TABLE users (
    id          BIGSERIAL PRIMARY KEY,
    email       VARCHAR(255) NOT NULL UNIQUE
);
 
-- Child table (the 'many' side)
CREATE TABLE orders (
    id          BIGSERIAL PRIMARY KEY,
    user_id     BIGINT NOT NULL REFERENCES users(id)
                ON DELETE CASCADE
                ON UPDATE CASCADE,
    total       DECIMAL(10, 2) NOT NULL,
    status      VARCHAR(20) NOT NULL,
    created_at  TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);
 
-- The REFERENCES clause creates a foreign key constraint
-- ON DELETE CASCADE: If user is deleted, their orders are deleted
-- ON UPDATE CASCADE: If user.id changes, orders.user_id updates
 
-- Index on foreign key for efficient joins
CREATE INDEX idx_orders_user_id ON orders(user_id);

Many-to-Many: Junction Tables (Association Tables)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
-- First entity
CREATE TABLE products (
    id      BIGSERIAL PRIMARY KEY,
    name    VARCHAR(255) NOT NULL,
    price   DECIMAL(10, 2) NOT NULL
);
 
-- Second entity
CREATE TABLE orders (
    id          BIGSERIAL PRIMARY KEY,
    user_id     BIGINT NOT NULL REFERENCES users(id),
    created_at  TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);
 
-- Junction table connecting products and orders
CREATE TABLE order_items (
    id          BIGSERIAL PRIMARY KEY,
    order_id    BIGINT NOT NULL REFERENCES orders(id) ON DELETE CASCADE,
    product_id  BIGINT NOT NULL REFERENCES products(id) ON DELETE RESTRICT,
    quantity    INTEGER NOT NULL CHECK (quantity > 0),
    unit_price  DECIMAL(10, 2) NOT NULL,
    
    -- Composite unique constraint prevents duplicate products in an order
    UNIQUE (order_id, product_id)
);
 
-- Each order can have many products (via order_items)
-- Each product can appear in many orders (via order_items)
-- The junction table can carry additional attributes (quantity, unit_price)

Referential Integrity Actions

ON DELETE and ON UPDATE clauses define what happens when referenced data changes:

Choose based on business requirements—CASCADE for true ownership relationships, RESTRICT for loose associations.

Normalization: Eliminating Redundancy

The Normal Forms:

Normalization proceeds through a series of "normal forms," each eliminating a specific type of redundancy:

Progressive Normal Forms

•First Normal Form (1NF) — Eliminate repeating groups; each cell contains atomic values. A 'phones' column shouldn't contain '555-1234, 555-5678'—instead, create a separate phones table.
•Second Normal Form (2NF) — Eliminate partial dependencies; non-key columns must depend on the entire primary key, not just part of it. Relevant for composite keys.
•Third Normal Form (3NF) — Eliminate transitive dependencies; non-key columns must depend only on the primary key, not on other non-key columns. If you store 'city' and 'state', and city determines state, extract to a cities table.
•Boyce-Codd Normal Form (BCNF) — A stricter version of 3NF; every determinant must be a candidate key. Handles edge cases 3NF misses.
•Fourth Normal Form (4NF) — Eliminate multi-valued dependencies. If a row has two independent multi-valued attributes, separate them into distinct tables.
•Fifth Normal Form (5NF) — Eliminate join dependencies; the table cannot be reconstructed from smaller tables unless they have overlapping keys.

Practical Normalization Example:

Consider this denormalized orders table:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
-- PROBLEMATIC: This table violates normalization
CREATE TABLE denormalized_orders (
    order_id        BIGINT,
    customer_name   VARCHAR(100),    -- Duplicated for every order
    customer_email  VARCHAR(255),    -- Duplicated for every order
    customer_city   VARCHAR(100),    -- Duplicated for every order
    product_name    VARCHAR(255),    -- Duplicated across orders
    product_price   DECIMAL(10,2),   -- May become inconsistent
    quantity        INTEGER,
    order_date      DATE
);
 
-- Problems:
-- 1. Customer info repeated in every order (storage waste, update anomalies)
-- 2. If customer changes email, must update all their orders
-- 3. If we delete customer's only order, we lose customer info
-- 4. Product price may differ across orders for same product

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
-- NORMALIZED: Separate concerns into proper tables
 
CREATE TABLE customers (
    id      BIGSERIAL PRIMARY KEY,
    name    VARCHAR(100) NOT NULL,
    email   VARCHAR(255) NOT NULL UNIQUE,
    city    VARCHAR(100)
);
 
CREATE TABLE products (
    id      BIGSERIAL PRIMARY KEY,
    name    VARCHAR(255) NOT NULL,
    price   DECIMAL(10, 2) NOT NULL  -- Current price
);
 
CREATE TABLE orders (
    id              BIGSERIAL PRIMARY KEY,
    customer_id     BIGINT NOT NULL REFERENCES customers(id),
    order_date      DATE NOT NULL DEFAULT CURRENT_DATE
);
 
CREATE TABLE order_items (
    id          BIGSERIAL PRIMARY KEY,
    order_id    BIGINT NOT NULL REFERENCES orders(id) ON DELETE CASCADE,
    product_id  BIGINT NOT NULL REFERENCES products(id),
    quantity    INTEGER NOT NULL,
    unit_price  DECIMAL(10, 2) NOT NULL  -- Price at time of order
);
 
-- Benefits:
-- 1. Customer/product info stored once (single source of truth)
-- 2. Updates to customer/product only require one row change
-- 3. Deleting an order doesn't lose customer info
-- 4. Historical price captured in order_items (important distinction!)

Over-Normalization Warning

Relational Algebra: The Foundation of SQL

Core Relational Algebra Operations:

Relational Algebra Operations and SQL Equivalents
Operation	Description	SQL Equivalent
Selection (σ)	Filter rows based on condition	WHERE clause
Projection (π)	Select specific columns	SELECT column_list
Union (∪)	Combine rows from two relations	UNION
Intersection (∩)	Rows common to both relations	INTERSECT
Difference (−)	Rows in first but not second relation	EXCEPT
Cartesian Product (×)	Every row paired with every row	CROSS JOIN
Join (⋈)	Combine related rows from two tables	JOIN ... ON
Rename (ρ)	Rename attributes or relations	AS alias

The Power of Composition:

Relational algebra operations are closed over relations—every operation takes relations as input and produces a relation as output. This closure property enables powerful composition:

-- This complex query is a composition of relational operations
SELECT u.name, COUNT(*) as order_count      -- Projection + Aggregation
FROM users u                                  -- Rename
INNER JOIN orders o ON u.id = o.user_id       -- Join
WHERE o.status = 'completed'                  -- Selection
GROUP BY u.id, u.name                         -- Grouping
HAVING COUNT(*) > 5                           -- Selection on groups
ORDER BY order_count DESC;                    -- Sorting (not purely relational)

Each intermediate result is a relation that feeds into the next operation. This compositional nature is why SQL can express complex queries concisely.

Join Types Deep Dive:

Joins are fundamental to querying normalized data. Understanding join semantics is essential:

SQL Join Types

•INNER JOIN — Returns only rows with matching values in both tables. If a user has no orders, that user doesn't appear in the result.
•LEFT (OUTER) JOIN — Returns all rows from the left table, with matching right-table rows or NULLs if no match. All users appear, even those without orders.
•RIGHT (OUTER) JOIN — Returns all rows from the right table, with matching left-table rows or NULLs. Rarely used; equivalent to swapping tables in LEFT JOIN.
•FULL (OUTER) JOIN — Returns all rows from both tables, with NULLs where there's no match. Useful for finding unmatched records in either table.
•CROSS JOIN — Cartesian product: every row from left combined with every row from right. Rarely intentional; usually indicates a missing join condition.

Query Optimization Insight

Strengths of the Relational Model

After 50+ years, the relational model remains dominant for structured data. Understanding its enduring strengths explains why SQL databases are often the default choice for new systems:

Key Strengths

•Strong Consistency (ACID) — Transactions guarantee that concurrent operations maintain data integrity. When a bank transfer completes, both accounts are correctly updated or neither is.
•Schema Enforcement — The database enforces data types, constraints, and relationships. Invalid data is rejected, not stored. The schema serves as documentation and contract.
•Declarative Query Language — SQL describes what data you want. The optimizer decides how to retrieve it efficiently. This separation enables optimization without application changes.
•Ad-Hoc Query Flexibility — Any query can be written at any time. Unlike some NoSQL stores, you're not limited to pre-defined access patterns. Analytics, debugging, and exploration are straightforward.
•Mature Tooling Ecosystem — Decades of investment have produced robust backup, replication, monitoring, migration, and administration tools. The ecosystem is unmatched.
•Proven at Scale — Despite myths about SQL not scaling, systems like YouTube (MySQL), Facebook (MySQL), Shopify (MySQL), Stripe (PostgreSQL), and Instagram (PostgreSQL) handle massive scale with relational databases.

The Consistency Advantage:

Relational databases provide guarantees that simplify application development:

You can trust what you read. A balance of $1000 is actually $1000—not a stale value from 30 seconds ago.
You can trust that writes succeed or fail cleanly. A transaction either commits entirely or rolls back entirely. No partial updates.
You can trust that constraints hold. If email has a UNIQUE constraint, you know there are no duplicates—guaranteed.
You can trust relationships. If orders.user_id references users.id, every user_id value corresponds to an existing user.

These guarantees shift complexity from application code to the database, where it's been solved and optimized over decades.

The Cost of Consistency

System Design Implications

Understanding the relational model shapes how you approach system design. Here are key implications for architects:

Schema Design is Application Design:

The database schema is a critical part of your application architecture. Poor schema design leads to:

Query performance problems (joins on non-indexed columns, missing indexes)
Data integrity issues (no constraints means bugs persist as data corruption)
Scale bottlenecks (unnormalized data means updates touch many rows)

Invest time in schema design early. Changes are expensive once data exists.

Transaction Boundaries Matter:

Every database operation implicitly or explicitly occurs within a transaction. Understanding transaction boundaries helps prevent:

Lost updates (two concurrent edits, one overwrites the other)
Read skew (reading inconsistent data during another transaction)
Write skew (validation passes but concurrent writes cause violation)

Design your application with transaction boundaries in mind. What operations must be atomic?

Read vs Write Optimization:

For read-heavy workloads, consider:

Materialized views (precomputed join results)
Denormalized read tables (CQRS pattern)
Caching layer (Redis/Memcached)

When Relational Shines

•Complex queries with multiple filters, joins, aggregations
•Transaction-heavy workloads requiring consistency
•Data with natural relationships (e-commerce, SaaS, finance)
•Need for ad-hoc querying and analytics
•Schema enforcement is valuable (regulated industries)

Relational Struggles

•Highly variable/sparse data (many NULLs or extra columns)
•Massive horizontal scale beyond single node (though possible)
•Schema-less or rapidly evolving data structures
•Graph traversals (multi-hop relationships)
•Time-series or append-only workloads at extreme scale

Start with Relational

Summary: The Relational Foundation

We've covered the foundational concepts of the relational model—the theoretical and practical basis for SQL databases. Let's consolidate the key takeaways:

Key Takeaways

•The relational model is mathematically grounded — Relations (tables), tuples (rows), and attributes (columns) derive from set theory, enabling powerful query optimization and consistency guarantees.
•Tables are structured containers with enforced schemas — Columns have types and constraints; the database rejects invalid data at the boundary.
•Rows are factual assertions uniquely identified by keys — Surrogate keys provide stability; natural keys provide meaning. Often, both are valuable.
•Relationships connect tables through foreign keys — One-to-many and many-to-many relationships enable normalized, non-redundant data storage.
•Normalization eliminates redundancy at the cost of joins — Normalize for write safety and consistency; denormalize selectively for read performance.
•SQL is relational algebra made practical — Understanding the underlying algebra explains SQL's power and guides optimization.
•Relational databases offer strong consistency, schema enforcement, and mature tooling — These strengths make them the default choice for most structured data.

What's Next:

Page Complete

1 / 5