Loading content...
When E.F. Codd introduced the relational model in 1970, he didn't just propose a new way to organize data—he grounded database management in rigorous mathematics. This wasn't an academic exercise; it was a strategic decision that would prove transformative.
Before the relational model, databases were engineering artifacts. Queries against hierarchical and network databases required intimate knowledge of physical storage structures. Each optimization was ad-hoc, each query language proprietary.
Codd's insight was that by basing data management on set theory and first-order predicate logic, query operations could be defined formally, optimized systematically, and proven correct mathematically. The database could reason about data without knowing how it was stored.
This mathematical foundation is why you can write SELECT * FROM employees WHERE salary > 50000 and trust that the database will find all matching records—regardless of how they're indexed, partitioned, or distributed across storage.
By the end of this page, you will understand how set theory provides the foundation for relations and operations, how predicate logic enables expressive query languages, the role of relational algebra and calculus in query processing, and why mathematical foundations enable query optimization.
The relational model is built directly on set theory—the mathematical study of collections of objects. Understanding set theory illuminates why relations behave as they do.
Basic Set Concepts
A set is an unordered collection of distinct elements. Key properties:
Sets can be defined by:
| Set Operation | Symbol | Relational Equivalent | Example |
|---|---|---|---|
| Union | A ∪ B | UNION | Combine employees from two departments |
| Intersection | A ∩ B | INTERSECT | Employees in both project A AND project B |
| Difference | A − B | EXCEPT / MINUS | Employees in A but not in B |
| Cartesian Product | A × B | CROSS JOIN | All possible employee-department pairs |
| Subset | A ⊆ B | (no direct SQL) | All managers are employees |
| Power Set | P(A) | (no direct SQL) | All possible subsets of attributes |
How Sets Map to Relations
Recall that a relation is formally a set of tuples. This set-theoretic definition immediately gives us:
1. No Duplicate Rows Since sets cannot contain duplicates, relations cannot contain duplicate tuples by definition. This is why SQL's DISTINCT keyword exists—SQL tables can violate pure set semantics.
2. Set Operations Work on Relations Because relations are sets, set operations (union, intersection, difference) apply directly. We can combine relations, find common elements, or compute differences—all with well-defined mathematical semantics.
3. Relations are Subsets of Cartesian Products Given domains D₁, D₂, ..., Dₙ for n attributes:
This is profound: the relation represents which facts are true out of all facts that could be true.
Domains:
Department = {'Engineering', 'Marketing', 'Finance'}
Level = {'Junior', 'Senior', 'Lead'}
Cartesian Product Department × Level (all 9 possible combinations):
{('Engineering', 'Junior'), ('Engineering', 'Senior'), ('Engineering', 'Lead'),
('Marketing', 'Junior'), ('Marketing', 'Senior'), ('Marketing', 'Lead'),
('Finance', 'Junior'), ('Finance', 'Senior'), ('Finance', 'Lead')}OpenPositions Relation (a subset representing actual open positions):
{('Engineering', 'Senior'), ('Marketing', 'Junior'), ('Finance', 'Lead')}
This relation asserts:
✓ Engineering needs a Senior
✓ Marketing needs a Junior
✓ Finance needs a Lead
✗ All other combinations are NOT open positionsBecause relations are sets, every query result is also a set (and therefore a relation). This closure property means operations can be composed arbitrarily: the output of any query can be the input to another. This enables query languages to express arbitrarily complex retrievals through operation composition.
While set theory provides the structural foundation, first-order predicate logic (also called first-order logic or predicate calculus) provides the foundation for expressing queries and constraints.
What is Predicate Logic?
Predicate logic extends propositional logic with:
Predicates in the Relational Model
In relational databases, each relation can be viewed as a predicate:
Employee(emp_id, name, department, salary)
This is a 4-place predicate. The tuple (1001, 'Alice', 'Engineering', 95000) makes the predicate true:
Employee(1001, 'Alice', 'Engineering', 95000) = TRUE
A tuple NOT in the relation makes the predicate false:
Employee(9999, 'Nobody', 'Nowhere', 0) = FALSE
Quantifiers
Predicate logic uses two quantifiers:
Universal Quantifier (∀) — "for all"
Existential Quantifier (∃) — "there exists"
Logical Connectives
Predicates can be combined using:
SQL Query:
SELECT name FROM Employee WHERE salary > 80000 AND department = 'Engineering'
Predicate Logic:
{name | ∃emp_id, dept, sal (Employee(emp_id, name, dept, sal)
∧ sal > 80000
∧ dept = 'Engineering')}Reading the formula:
"The set of all names such that there exists an employee with that name
who has a salary greater than 80000 and works in Engineering"
This is Tuple Relational Calculus notation—a formal query language
based directly on predicate logic.SQL's WHERE clause is essentially predicate logic with user-friendly syntax. WHERE salary > 50000 AND department = 'Engineering' is a predicate that each row must satisfy. The SELECT clause specifies which variables (attributes) to return. JOINs combine predicates from multiple relations. Understanding this connection demystifies SQL and reveals its expressive power.
Relational algebra is a procedural query language that operates on relations and produces relations. It consists of a set of operations that take one or two relations as input and produce a new relation as output.
Relational algebra is procedural: it specifies not just what to retrieve, but how to compute it—a sequence of operations.
The Fundamental Operations
Codd identified six fundamental operations from which all other operations can be derived:
| Operation | Symbol | Description | SQL Equivalent |
|---|---|---|---|
| Selection | σ (sigma) | Filter rows by a predicate | WHERE clause |
| Projection | π (pi) | Select specific columns | SELECT column list |
| Union | ∪ | Combine all rows from two compatible relations | UNION |
| Set Difference | − | Rows in first but not second relation | EXCEPT |
| Cartesian Product | × | All combinations of rows from two relations | CROSS JOIN |
| Rename | ρ (rho) | Rename relation or attributes | AS (alias) |
Derived Operations
From the six fundamentals, we can derive useful shortcut operations:
Intersection (∩) Rows common to both relations. Derived: R ∩ S = R − (R − S)
Join (⋈) Combine related rows from two relations.
Division (÷) Rows in R associated with ALL rows in S. Useful for "find X that has all Y" queries.
Outer Joins Preserve rows that don't match:
Goal: Find names of employees in Engineering earning over $80,000
Relations:
Employee(emp_id, name, department, salary)
Step-by-step Relational Algebra:
1. Select Engineering employees:
σ_department='Engineering' (Employee)
2. Select high earners from step 1:
σ_salary>80000 (σ_department='Engineering' (Employee))
3. Project just the names:
π_name (σ_salary>80000 (σ_department='Engineering' (Employee)))Alternative combined notation:
π_name (σ_department='Engineering' ∧ salary>80000 (Employee))
This algebraic expression is EQUIVALENT to:
SELECT name
FROM Employee
WHERE department = 'Engineering' AND salary > 80000
The algebra provides a formal basis for query execution.You rarely write relational algebra directly, but it's the internal language of database query engines. SQL queries are parsed into algebraic expressions, optimized by reordering operations, then executed. Understanding algebra helps you understand query plans, predict performance, and write more efficient SQL.
While relational algebra is procedural (specify HOW), relational calculus is declarative (specify WHAT). It describes the desired result without specifying how to compute it.
There are two equivalent forms of relational calculus:
1. Tuple Relational Calculus (TRC)
Variables range over tuples. Example:
{t.name | Employee(t) ∧ t.salary > 80000}
Reads: "The set of name values from tuples t where t is in Employee and t's salary exceeds 80000."
2. Domain Relational Calculus (DRC)
Variables range over domain values (attributes). Example:
{<n> | ∃e, d, s (Employee(e, n, d, s) ∧ s > 80000)}
Reads: "The set of name values n where there exist values e, d, s such that (e, n, d, s) is in Employee and s exceeds 80000."
Codd's Theorem: Equivalence
A fundamental result in database theory is Codd's Theorem, which states:
Relational algebra and the safe subset of relational calculus are expressively equivalent.
This means:
The "safe" qualifier is important. Unsafe calculus expressions can describe infinite results:
{t | ¬Employee(t)} -- All tuples that are NOT employees
This describes infinitely many tuples! Safe calculus restricts expressions to guarantee finite results.
Why This Equivalence Matters
You can write queries declaratively (tell the database what you want) and the database can transform them to a procedural execution plan (how to get it). The mathematical equivalence guarantees the transformation is correct.
SQL combines declarative and procedural elements. The FROM clause establishes relations, WHERE filters (like σ), SELECT projects (like π), and JOIN combines (like ×, ⋈). Subqueries can be viewed as calculus-style expressions. This hybrid nature makes SQL expressive yet sometimes confusing.
One of the greatest benefits of mathematical foundations is query optimization. Because relational operations obey algebraic laws, databases can transform queries into more efficient forms while guaranteeing identical results.
Algebraic Equivalences
Just as arithmetic obeys laws (a + b = b + a), relational algebra obeys laws:
Selection Laws:
Projection Laws:
Selection-Projection Commutativity:
Join Laws:
Original Query (inefficient):
π_name(Employee ⋈ Department) where department.budget > 1000000
This joins FIRST (expensive), then filters (cheap on small result).
Let Employee have 10,000 rows, Department have 100 rows.
Join produces all 10,000 employee-department combinations first.Optimized Query (equivalent but faster):
π_name(Employee ⋈ (σ_budget>1000000(Department)))
This filters Department FIRST (cheap), then joins with smaller set.
If only 10 departments have budget > 1M:
- Original: Join 10000 × 100, then filter → 1,000,000 comparisons
- Optimized: Filter to 10 depts, join 10000 × 10 → 100,000 comparisons
10x faster, mathematically guaranteed to give identical results!Common Optimization Strategies
1. Push Selections Down Apply filters as early as possible to reduce intermediate result sizes. This is almost always beneficial.
2. Push Projections Down Remove unneeded columns early to reduce memory and I/O. Be careful: some operations need columns that might be projected out prematurely.
3. Reorder Joins Since joins are commutative and associative, the optimizer can choose the most efficient join order. For n tables, there are (2n-2)! / (n-1)! possible orders.
4. Choose Join Algorithms Based on table sizes and indexes, choose between:
5. Use Indexes Replace table scans with index lookups when conditions use indexed columns.
Modern databases show query execution plans (EXPLAIN in PostgreSQL/MySQL). These reveal the algebraic operations the optimizer chose. Understanding relational algebra helps you read these plans, identify bottlenecks, and write queries that optimize better.
The mathematical foundation enables us to express and enforce integrity constraints—rules that define valid database states. Constraints can be expressed as predicate logic formulas that must be true for every valid database instance.
Types of Integrity Constraints
| Constraint Type | Predicate Logic Expression | SQL Equivalent |
|---|---|---|
| Domain Constraint | ∀t ∈ R: t.age ∈ {1..120} | CHECK (age BETWEEN 1 AND 120) |
| Not Null | ∀t ∈ R: t.name ≠ NULL | name NOT NULL |
| Unique Key | ∀t1,t2 ∈ R: t1.key = t2.key → t1 = t2 | UNIQUE (key) |
| Primary Key | Unique ∧ Not Null on key attributes | PRIMARY KEY (id) |
| Foreign Key | ∀t ∈ R: ∃s ∈ S: t.fk = s.pk | FOREIGN KEY REFERENCES S(pk) |
| Entity Integrity | ∀t ∈ R: t.pk ≠ NULL | PRIMARY KEY implies NOT NULL |
Referential Integrity in Depth
Referential integrity ensures that foreign key values match existing primary keys. Formally:
∀r ∈ R: r.foreign_key = NULL ∨ ∃s ∈ S: r.foreign_key = s.primary_key
This means: Every foreign key value is either NULL (if allowed) or refers to an existing primary key.
Enforcement Actions:
When a referenced primary key is deleted or updated:
Business Rule:
"A manager's salary must be greater than all their direct reports"
Predicate Logic:
∀m, e ∈ Employee:
(e.manager_id = m.emp_id) → (m.salary > e.salary)
SQL CHECK (complex, requires triggers in most DBMSs):
Only expressible via triggers or application logic in standard SQLAnother Rule:
"Every department must have at least one employee"
Predicate Logic:
∀d ∈ Department: ∃e ∈ Employee: e.department_id = d.id
This is an ASSERTION (not directly supported in most DBMSs).
Workaround: Use triggers or application-level enforcement.
The gap between theoretical expressiveness and practical SQL
is why understanding the theory helps—you know what's missing.While predicate logic can express almost any constraint, SQL cannot enforce all of them directly. Complex constraints involving multiple tables, aggregations, or temporal conditions often require triggers, stored procedures, or application-level enforcement. Knowing the theory helps you identify what the database CAN'T enforce automatically.
Functional dependencies (FDs) are a mathematical concept central to database design. They describe how attribute values depend on each other.
Definition
A functional dependency X → Y holds in relation R if, whenever two tuples have the same values for attributes X, they must also have the same values for attributes Y.
In predicate logic:
∀t1, t2 ∈ R: t1.X = t2.X → t1.Y = t2.Y
Example:
emp_id → name, department, salary
If two tuples have the same emp_id, they must have the same name, department, and salary. This makes emp_id a candidate key.
zip_code → city, state (in US)
A zip code determines the city and state.
Armstrong's Axioms
Functional dependencies obey a complete set of inference rules called Armstrong's Axioms:
1. Reflexivity: If Y ⊆ X, then X → Y
2. Augmentation: If X → Y, then XZ → YZ
3. Transitivity: If X → Y and Y → Z, then X → Z
Derived Rules:
These rules are sound (only produce valid implications) and complete (can derive all valid implications).
Relation: CourseSection(course_id, section, semester, year, instructor, room)
Functional Dependencies:
1. {course_id, section, semester, year} → instructor
2. {course_id, section, semester, year} → room
3. {room, semester, year} → course_id, section (room is used for one course at a time)
Question: What are the candidate keys?Analysis:
- From FDs 1 & 2 (union): {course_id, section, semester, year} → {instructor, room}
- This means {course_id, section, semester, year} determines ALL other attributes
- Therefore it's a superkey
- Is it minimal? Yes—no subset determines all attributes
- Therefore it's a CANDIDATE KEY
- From FD 3: {room, semester, year} → {course_id, section}
- Combined with FD 1: {room, semester, year} → entire relation
- Therefore {room, semester, year} is ANOTHER candidate key!
The relation has two candidate keys.Functional dependencies are the mathematical basis for normalization—the process of organizing relations to eliminate redundancy and anomalies. Each normal form (1NF, 2NF, 3NF, BCNF) is defined in terms of FDs. Without understanding FDs mathematically, normalization becomes a set of mysterious rules rather than logical derivations.
The mathematical foundations of the relational model aren't just theoretical elegance—they enable practical capabilities that would be impossible otherwise.
Relational Completeness
A query language is relationally complete if it can express any query that relational algebra/calculus can express. Codd defined this as the minimum bar for a relational query language.
SQL is relationally complete (and more—it includes aggregation, recursion in modern versions, etc.). But this baseline ensures you can always express the fundamental queries the relational model was designed to support.
Beyond Completeness: Extensions
Modern SQL extends beyond relational completeness with:
Each extension has its own theoretical foundations, but all build on the solid set-theoretic and logical base.
You don't need to work through predicate logic proofs to use a database. But when you need to understand why a query is slow, debug a constraint violation, or design a normalized schema, the mathematical foundations provide the framework for reasoning precisely about what's happening and what should happen.
The relational model's mathematical foundations are what elevate it from a data storage technique to a principled data management system. Let's consolidate the key concepts:
What's Next:
Having explored the mathematical foundations, we'll now examine Codd's 12 Rules—the definitive criteria that Dr. Codd established to determine whether a database system is truly relational. These rules translate the mathematical theory into practical requirements for database implementations.
You now understand the mathematical foundations that power the relational model—set theory, predicate logic, relational algebra, and relational calculus. This isn't abstract theory; it's the engine that enables query optimization, constraint enforcement, normalization, and every other principled aspect of relational databases. Next, we'll see how Codd codified these principles into practical rules for RDBMS implementations.