Loading learning content...
The relational model's greatest strength lies not in its intuitive table metaphor, but in its rigorous mathematical foundation. This foundation is what enables query optimization, formal correctness proofs, and the remarkable fact that a query written in 1980 still runs correctly on modern databases.
Edgar Codd deliberately chose set theory as the basis for his model because set theory provides:
This page explores the mathematical structures underlying relations. Understanding this foundation transforms database work from rote SQL writing to principled data engineering.
By the end of this page, you will: (1) Understand sets, tuples, and Cartesian products as they apply to relations, (2) Recognize how set theory operations correspond to relational operations, (3) Appreciate closure properties and their importance, (4) Grasp the algebraic structure that enables query optimization, and (5) Connect mathematical concepts to practical database behavior.
The relational model is built upon set theory, one of the most fundamental branches of mathematics. Before we can understand relations mathematically, we must establish the set-theoretic concepts upon which they rest.
Definition (Set): A set is an unordered collection of distinct elements. We denote sets using curly braces: {a, b, c}.
Key Properties of Sets:
Set Operations:
Sets support fundamental operations that correspond directly to relational operations:
| Operation | Symbol | Definition | Example |
|---|---|---|---|
| Union | A ∪ B | All elements in A or B (or both) | {1,2} ∪ {2,3} = {1,2,3} |
| Intersection | A ∩ B | Elements in both A and B | {1,2} ∩ {2,3} = {2} |
| Difference | A − B | Elements in A but not in B | {1,2} − {2,3} = {1} |
| Symmetric Diff. | A △ B | Elements in exactly one of A or B | {1,2} △ {2,3} = {1,3} |
| Subset | A ⊆ B | True if all elements of A are in B | {2} ⊆ {1,2,3} = True |
These operations are closed over sets: applying any operation to sets produces another set. This closure property is crucial—it means we can chain operations without leaving the domain of sets.
Because a relation is a set of tuples, all set operations apply directly to relations. The UNION of two relations is the union of their tuple sets. The INTERSECT of two relations is the intersection. This mathematical foundation is what makes SQL set operations possible and well-defined.
While sets are unordered, relations require ordered structures to associate attributes with values. This is where ordered pairs and tuples come in.
Definition (Ordered Pair): An ordered pair (a, b) is a pair of elements where order matters. Unlike sets, (a, b) ≠ (b, a) unless a = b.
Definition (n-Tuple): An n-tuple (or simply tuple) is an ordered sequence of n elements: (a₁, a₂, ..., aₙ). The value n is called the degree or arity of the tuple.
Crucial Property: Two tuples are equal if and only if they have the same degree and corresponding elements are equal:
(a₁, a₂, ..., aₙ) = (b₁, b₂, ..., bₙ) ⟺ (n = m) ∧ (a₁ = b₁) ∧ (a₂ = b₂) ∧ ... ∧ (aₙ = bₙ)
Compare: (1, 2, 3), (1, 2, 3), (3, 2, 1), (1, 2)(1, 2, 3) = (1, 2, 3) ✓ Equal
(1, 2, 3) ≠ (3, 2, 1) ✗ Same elements, different order
(1, 2, 3) ≠ (1, 2) ✗ Different degrees (3 vs 2)Order and degree both matter for tuple equality. This is why attribute ordering in formal relation definitions is significant, even though SQL abstracts this with named columns.
Component Access:
We reference individual components of a tuple using subscript notation:
In database terminology, if t is a tuple in a relation with attributes A₁, A₂, ..., Aₙ, we write:
Example: Given relation Employee with attributes (ID, Name, Salary) and tuple t = (101, 'Alice', 75000):
The mathematical definition uses positional access (t[i]), but practical systems and SQL use named access (t.Salary). Named access is more robust to schema changes—if columns are reordered, named references still work. The formal model supports both interpretations.
The Cartesian product is the operation that connects individual domains to form the space of all possible tuples. Understanding Cartesian products is essential for grasping what a relation really represents.
Definition (Cartesian Product of Two Sets): The Cartesian product of sets A and B, denoted A × B, is the set of all ordered pairs (a, b) where a ∈ A and b ∈ B:
A × B = {(a, b) | a ∈ A ∧ b ∈ B}
Definition (Generalized Cartesian Product): The Cartesian product of n sets D₁, D₂, ..., Dₙ is:
D₁ × D₂ × ... × Dₙ = {(d₁, d₂, ..., dₙ) | d₁ ∈ D₁ ∧ d₂ ∈ D₂ ∧ ... ∧ dₙ ∈ Dₙ}
Cardinality of Cartesian Products:
The size of a Cartesian product grows multiplicatively:
|A × B| = |A| · |B|
|D₁ × D₂ × ... × Dₙ| = |D₁| · |D₂| · ... · |Dₙ|
This explosive growth is why relations are subsets of Cartesian products—we only include tuples that represent true facts, not every possible combination.
A = {a, b}, B = {1, 2, 3}A × B = {(a,1), (a,2), (a,3), (b,1), (b,2), (b,3)}|A × B| = |A| · |B| = 2 · 3 = 6 tuples. Each element of A is paired with every element of B. This represents ALL possible combinations—a relation selects which of these actually exist.
Visualizing Cartesian Products:
Cartesian products can be visualized as a grid where rows represent elements of one set and columns represent elements of another:
| 1 | 2 | 3
-------|---------|---------|--------
a | (a,1) | (a,2) | (a,3)
b | (b,1) | (b,2) | (b,3)
For three sets A × B × C, we'd need a 3D grid—each point in the cube represents a possible 3-tuple.
Connection to Relations:
A relation R over domains D₁, D₂, ..., Dₙ is defined as:
R ⊆ D₁ × D₂ × ... × Dₙ
This means R is a selection of points from the n-dimensional space of all possible combinations. The relation asserts which combinations actually occur.
In SQL, the Cartesian product is performed by joining tables without a join condition (CROSS JOIN). If TableA has 1,000 rows and TableB has 10,000 rows, the result has 10,000,000 rows. This explosive growth is why accidental Cartesian products (forgot the WHERE clause) can crash queries and exhaust resources.
In the relational model, a domain is not merely a data type—it's a named pool of values with semantic meaning.
Definition (Domain): A domain D is a named set of atomic values. Each value in a domain represents an indivisible unit of information.
Key Properties of Domains:
Domain Compatibility:
Two attributes are domain compatible (or union compatible) if they draw values from the same domain. This concept is crucial for:
Example of Domain Distinction:
| Domain Name | Data Type | Sample Values | Semantic Meaning |
|---|---|---|---|
| EmployeeID | INTEGER | 1001, 1002, 1003 | Unique employee identifier |
| DepartmentID | INTEGER | 10, 20, 30, 40 | Unique department identifier |
| Salary | INTEGER | 50000, 75000, 100000 | Annual compensation in dollars |
| Age | INTEGER | 25, 30, 45, 62 | Years since birth |
All four domains use INTEGER, but:
Strong domain discipline prevents these errors. In formal relational theory, domains enforce semantic constraints that pure data types cannot.
SQL and Domains:
Standard SQL includes a CREATE DOMAIN statement for defining semantic domains, but many databases don't fully implement it. Instead, developers use CHECK constraints, foreign keys, and application logic to approximate domain semantics.
Even if your database doesn't enforce domains, thinking in terms of domains improves design. Use distinct column names that imply domains (employee_id vs department_id, not just id). Document domain constraints. Consider custom types in PostgreSQL (CREATE TYPE) for important semantic distinctions.
Having established sets, tuples, Cartesian products, and domains, we can now formally define relations with complete mathematical precision.
Definition (Relation Schema): A relation schema R(A₁:D₁, A₂:D₂, ..., Aₙ:Dₙ) consists of:
The degree of the schema is n (the number of attributes).
Definition (Relation Instance): A relation instance r of schema R is a finite subset of D₁ × D₂ × ... × Dₙ where each n-tuple t = (d₁, d₂, ..., dₙ) satisfies dᵢ ∈ Dᵢ for all i.
The cardinality of instance r is |r| (the number of tuples).
1234567891011121314151617181920212223
-- Formal Schema Notation:R(A₁:D₁, A₂:D₂, ..., Aₙ:Dₙ) -- Example: Employee Relation SchemaEmployee( EmployeeID: PositiveInteger, Name: String(50), Department: DepartmentName, Salary: PositiveDecimal, HireDate: Date) -- Relation Instance (a specific state of the relation):r(Employee) = { (101, 'Alice Chen', 'Engineering', 95000.00, 2020-03-15), (102, 'Bob Smith', 'Marketing', 72000.00, 2019-07-01), (103, 'Carol Davis', 'Engineering', 105000.00, 2018-01-20), (104, 'David Kim', 'Sales', 68000.00, 2021-09-10)} -- Properties:-- Schema degree: 5 (five attributes)-- Instance cardinality: 4 (four tuples)The Two-Level Structure:
Notice the critical distinction:
Schema Level (Intension): Defines what is possible—the structure, the attributes, the domains. Schemas are relatively stable; they change only when the database structure is altered.
Instance Level (Extension): Defines what is true now—the actual data. Instances change constantly as data is inserted, updated, and deleted.
This separation mirrors the distinction in logic between:
For the Employee relation, the schema says 'we can record employees with IDs, names, departments, salaries, and hire dates.' The instance says 'these four specific employees exist with these specific values.'
In database parlance, the schema is the INTENSION (it intensionally defines the structure), while the current data is the EXTENSION (it extends the schema with actual facts). Understanding this distinction is key to understanding constraints, views, and data independence.
Relations, being sets, inherit algebraic properties from set theory. Additional properties emerge from the specific operations of relational algebra. These properties are not mere mathematical curiosities—they form the foundation for query optimization.
Closure Property:
The most fundamental property is closure: every relational operator takes relations as input and produces a relation as output.
If R and S are relations, then:
Closure enables composition: we can chain operations arbitrarily. The result of one operation becomes input to the next, enabling complex queries from simple primitives.
Why These Properties Enable Optimization:
Consider a query: 'Find names of employees in Engineering earning over $100K.'
Naive Execution:
Optimized Execution (using algebraic properties):
The optimizer uses algebraic properties to prove these executions are equivalent (produce identical results), then chooses the more efficient one. Without mathematical foundations, such transformations would be guesswork.
Query optimizers work by finding algebraically equivalent query plans and choosing the cheapest. A single SQL query might have thousands of equivalent relational algebra expressions. The mathematical properties guarantee they all produce the same result, freeing the optimizer to choose based purely on performance.
Codd's original relational model assumed complete information—every attribute of every tuple has a value from its domain. Reality is messier: sometimes data is missing, unknown, or not applicable. SQL databases handle this with NULL values, introducing significant mathematical complexity.
What NULL Represents:
NULL is not a value—it's the absence of a value. It can represent:
SQL's NULL conflates these distinct cases, which can lead to logical surprises.
Three-Valued Logic (3VL):
With NULL, boolean expressions evaluate to three possible values: TRUE, FALSE, or UNKNOWN.
The truth tables extend standard boolean logic:
| P | Q | P AND Q |
|---|---|---|
| TRUE | TRUE | TRUE |
| TRUE | FALSE | FALSE |
| TRUE | UNKNOWN | UNKNOWN |
| FALSE | TRUE | FALSE |
| FALSE | FALSE | FALSE |
| FALSE | UNKNOWN | FALSE |
| UNKNOWN | TRUE | UNKNOWN |
| UNKNOWN | FALSE | FALSE |
| UNKNOWN | UNKNOWN | UNKNOWN |
| P | Q | P OR Q |
|---|---|---|
| TRUE | TRUE | TRUE |
| TRUE | FALSE | TRUE |
| TRUE | UNKNOWN | TRUE |
| FALSE | TRUE | TRUE |
| FALSE | FALSE | FALSE |
| FALSE | UNKNOWN | UNKNOWN |
| UNKNOWN | TRUE | TRUE |
| UNKNOWN | FALSE | UNKNOWN |
| UNKNOWN | UNKNOWN | UNKNOWN |
Practical Implications:
Comparisons with NULL are UNKNOWN:
WHERE Clauses Only Return TRUE Rows:
NOT UNKNOWN = UNKNOWN:
Use IS NULL / IS NOT NULL for NULL Testing:
Common mistake: WHERE column <> 'value' excludes NULLs. If you want rows where column is not 'value' INCLUDING NULLs, you need: WHERE column <> 'value' OR column IS NULL. Three-valued logic trips up even experienced SQL developers.
We've traversed the mathematical landscape underlying the relational model. Let's consolidate how these abstract concepts connect to practical database work:
Why This Matters in Practice:
| Mathematical Concept | Practical Application |
|---|---|
| Set properties | UNION, INTERSECT, EXCEPT queries work correctly |
| Tuple structure | Column ordering, SELECT *, explicit column lists |
| Cartesian products | CROSS JOIN, understanding join explosions |
| Domain semantics | Type safety, meaningful foreign keys |
| Algebraic properties | Query optimizer transforms queries automatically |
| Three-valued logic | Handling NULLs correctly in WHERE, CASE, JOINs |
What's Next:
With the mathematical foundation established, we'll now examine how these abstract relations materialize as tables—the concrete structures you interact with in every SQL database. The next page bridges the gap from mathematical purity to practical implementation.
You now understand the set-theoretic and algebraic basis of the relational model. This foundation enables you to reason about database behavior mathematically—understanding not just what SQL does, but why it works and how optimizers transform your queries. Next, we'll see how these abstractions become concrete tables.