Loading learning content...
In 1970, Edgar F. Codd, a mathematician and computer scientist at IBM, published a paper that would fundamentally transform how humanity stores and retrieves information. Titled "A Relational Model of Data for Large Shared Data Banks", this paper introduced a deceptively simple yet profoundly powerful concept: the relation.
Before Codd's work, database systems were dominated by hierarchical and network models—complex, inflexible structures that required programmers to navigate intricate pointer chains and understand physical storage layouts. Codd's insight was revolutionary: separate the logical representation of data from its physical storage. The vehicle for this separation was the mathematical concept of a relation.
Understanding what a relation truly is—not just superficially as a table, but with mathematical precision—is essential for mastering relational databases. This precision is what gives the relational model its power: the ability to reason about data with the rigor of mathematics.
By the end of this page, you will: (1) Understand the formal, mathematical definition of a relation, (2) Distinguish between relations and informal notions of tables, (3) Appreciate why mathematical precision matters for database theory, (4) Recognize the key properties that define relations, and (5) Connect abstract definitions to practical database concepts.
At its core, a relation is a mathematical construct borrowed from set theory. To understand it properly, we must first establish the foundational concepts upon which it rests.
Definition (Informal): A relation is a two-dimensional structure composed of rows and columns, where each row represents a unique entity or fact, and each column represents a specific attribute of those entities.
Definition (Formal): Given domains D₁, D₂, ..., Dₙ (not necessarily distinct), a relation r is a subset of the Cartesian product D₁ × D₂ × ... × Dₙ. That is:
$$r \subseteq D_1 \times D_2 \times \cdots \times D_n$$
This formal definition may seem abstract, but it captures something profound: a relation is any subset of all possible combinations of values from its constituent domains. This means relations are fundamentally about sets of facts that we choose to assert as true.
Think of a relation as a collection of assertions. Each row (tuple) asserts that a particular combination of values exists together in the real world. The relation r(Employee, Department, Salary) containing (John, Engineering, 85000) asserts the fact: 'John works in Engineering and earns 85000.' The power of the relational model comes from treating these facts as mathematical sets that can be manipulated with algebraic precision.
Breaking Down the Definition:
Let's unpack this definition piece by piece:
Domains (D₁, D₂, ..., Dₙ): Each domain is a set of atomic (indivisible) values. For example, a domain for 'Age' might be the set of positive integers {1, 2, 3, ...}, while a domain for 'Name' might be the set of all character strings up to 100 characters.
Cartesian Product (×): The Cartesian product D₁ × D₂ × ... × Dₙ is the set of all possible n-tuples where the first element comes from D₁, the second from D₂, and so on. This represents every conceivable combination.
Subset (⊆): A relation is a subset of this Cartesian product. This is crucial—we don't include every possible combination, only those that represent true facts in our domain.
n-tuples: Each element in the relation is an ordered sequence of n values, called a tuple. If n = 3, we have 3-tuples (or triples).
Let's make this concrete with a carefully constructed example that demonstrates every aspect of the definition.
Scenario: We want to model student enrollment information for a university course.
Step 1: Define the Domains
| Domain Name | Symbol | Definition | Example Values |
|---|---|---|---|
| Student ID | D₁ | The set of 6-digit positive integers | 100001, 100002, 234567 |
| Student Name | D₂ | Strings of 1-50 alphabetic characters | Alice, Bob, Carol |
| Grade | D₃ | The set {A, B, C, D, F, I, W} | A, B, C, F |
| Credits | D₄ | The set {1, 2, 3, 4} | 3, 4 |
Step 2: Understand the Cartesian Product
The Cartesian product D₁ × D₂ × D₃ × D₄ contains every possible 4-tuple. Some examples:
The Cartesian product is enormous—potentially infinite if any domain is infinite. For a simplified example with only 100 students, 5 grades, and 4 credit options, we'd have 100 × ∞ × 5 × 4 possible combinations. This is why the subset aspect is crucial.
Step 3: Define the Relation
The relation Enrollment is the subset of this Cartesian product containing only the combinations that actually exist in our university's records:
| StudentID | StudentName | Grade | Credits |
|---|---|---|---|
| 100001 | Alice Chen | A | 4 |
| 100002 | Bob Smith | B | 3 |
| 100003 | Carol Davis | A | 4 |
| 100004 | David Kim | C | 3 |
| 100005 | Emma Wilson | B | 4 |
Key Observations:
The relation is finite even though some domains may be infinite. While the domain of StudentName theoretically contains infinitely many strings, our relation only contains 5 tuples.
Each tuple represents an asserted fact. The tuple (100001, "Alice Chen", A, 4) asserts: "Student 100001, named Alice Chen, received grade A in a 4-credit course."
Absence has meaning. If the tuple (100006, "Frank Jones", B, 3) is not in the relation, we assert that this fact is not true (or at least not recorded)—this is the Closed World Assumption.
All valid combinations from domains are structurally possible. Whether (100001, "Bob Smith", F, 4) is in our relation depends on what's true in the real world, not on structural constraints (though other constraints may apply).
In relational databases, we operate under the Closed World Assumption: if a fact is not explicitly present in the database, it is assumed to be false. This is a pragmatic choice—we can only query what we know. It contrasts with the Open World Assumption used in knowledge representation, where absence means 'unknown' rather than 'false.'
Relations in the relational model possess several fundamental properties that distinguish them from arbitrary data structures. Understanding these properties is essential for database design and query formulation.
In the mathematical definition, tuple and attribute ordering don't matter. However, practical SQL databases do maintain a physical column order, and some queries depend on column positions. Similarly, while relations mathematically cannot have duplicates, SQL tables can contain duplicate rows unless explicitly constrained. The theoretical model provides the ideal; implementations approximate it.
Why These Properties Matter:
No Duplicates → Meaningful Identity: Because each tuple is unique, we can always identify and refer to specific facts. This enables updates, deletions, and references without ambiguity.
Order Independence → Location Transparency: Since order doesn't matter, the database system can store and retrieve tuples in whatever physical order is most efficient. This enables query optimization.
Atomic Values → Query Simplicity: With atomic values, every query operates on simple, single values. You never need to 'look inside' a cell—enabling straightforward comparison, aggregation, and joining.
Named Attributes → Self-Describing Structures: Because attributes are named, not positional, relations are self-documenting. Queries can reference 'Employee.Salary' rather than 'the third column.'
The terms 'relation' and 'table' are often used interchangeably, but they represent fundamentally different concepts. A relation is a mathematical abstraction; a table is a concrete representation or implementation.
Understanding this distinction prevents conceptual errors that lead to poor database design.
SQL and the Relational Model:
SQL, the standard language for relational databases, is based on the relational model but departs from it in significant ways:
| Aspect | Pure Relational Model | SQL Implementation |
|---|---|---|
| Duplicates | Forbidden | Allowed (use DISTINCT to remove) |
| NULL values | Not in original model | Fully supported |
| Ordering | Undefined | Explicit via ORDER BY |
| Attribute position | Irrelevant | SELECT * returns fixed order |
| Bag vs. Set semantics | Set (unique tuples) | Bag (multiset, allows duplicates) |
These differences aren't flaws—they're pragmatic choices. Real-world data sometimes has duplicates, missing values (NULLs) are common, and users often want ordered results. But understanding the pure relational model helps you appreciate the theoretical foundation and recognize when SQL behavior might be surprising.
When designing databases, think in terms of relations (mathematical sets of facts), then implement as tables. This leads to cleaner designs because you focus on what information you're modeling, not how it will be stored. The implementation details can be adjusted; the conceptual model should be precise.
The relational model uses precise terminology borrowed from mathematics. Unfortunately, different communities use different terms for the same concepts, leading to confusion. Here's a comprehensive mapping:
| Formal Relational Model | Informal/File System | SQL | Description |
|---|---|---|---|
| Relation | Table, File | Table | The entire two-dimensional structure |
| Tuple | Row, Record | Row | One horizontal entry in the relation |
| Attribute | Column, Field | Column | One vertical component of the relation |
| Domain | Data type, Pool of values | Data type + constraints | Set of allowable values for an attribute |
| Degree (arity) | Number of columns | Column count | How many attributes the relation has |
| Cardinality | Number of rows | Row count | How many tuples the relation contains |
| Relation schema | Table structure | CREATE TABLE definition | The relation's name and attribute definitions |
| Relation instance | Table contents | Current table state | The set of tuples at a given time |
Why Terminology Matters:
Using precise terminology isn't pedantry—it enables clear communication and prevents errors:
Tuple vs. Row: A tuple is an ordered set of values. A row in SQL may have a row ID, may be ordered, and may contain NULLs. Conflating them leads to confusion about identity and equality.
Relation vs. Table: Saying 'relation' signals you're thinking mathematically about data as sets of facts. Saying 'table' might imply mutable state, physical storage, or SQL-specific behavior.
Attribute vs. Column: Attributes are named components of tuples. Columns in SQL have positions and specific data types with implementation-defined behaviors.
Domain vs. Data Type: A domain is a semantic concept (e.g., 'EmployeeID' as a meaningful identifier with constraints), while a data type is a syntactic one (e.g., INTEGER). A domain may be implemented as a data type, but carries additional meaning.
Throughout this curriculum, we will use formal terminology (relation, tuple, attribute) when discussing theory and informal terminology (table, row, column) when discussing SQL implementation. Being comfortable with both is essential for reading academic literature and writing production code.
Two fundamental metrics characterize every relation:
Degree (Arity): The degree of a relation is the number of attributes (columns) it contains. This is a property of the relation schema and remains constant throughout the relation's existence (unless altered by a schema change).
Cardinality: The cardinality of a relation is the number of tuples (rows) it contains. This is a property of a specific relation instance and changes as data is inserted, updated, or deleted.
The distinction is fundamental: degree is structural (part of the schema), while cardinality is dynamic (changes with content).
Department(DeptID, DeptName, Manager, Budget, Location)Degree: 5 (five attributes)
Cardinality: Changes based on how many departments existIf this relation currently contains 12 departments, its cardinality is 12. Adding a new department increases cardinality to 13. Adding a new attribute (like 'Founded_Year') would increase degree to 6—but this requires schema modification.
Why This Distinction Matters for Performance:
Database performance is dramatically affected by both degree and cardinality:
| Metric | Increase Impact | Performance Consideration |
|---|---|---|
| Degree | More data per tuple | Wider rows consume more memory; fewer rows fit in cache; index choices become more complex |
| Cardinality | More tuples total | Larger tables require more storage; scans take longer; index selectivity becomes crucial |
A relation with degree 50 (50 columns) and cardinality 1,000 behaves very differently from one with degree 3 and cardinality 10,000,000.
Database designers must consider both: narrow tables (low degree) can often be stored more efficiently, while proper indexing and partitioning strategies depend heavily on anticipated cardinality.
The relation is not merely one data structure among many—it represents a paradigm shift in how we conceptualize data. To fully appreciate its significance, we must understand what came before.
Pre-Relational Systems:
Before 1970, database systems were dominated by:
Hierarchical Model (IMS, 1966): Data organized as trees. Parent-child relationships hard-coded. Queries required navigating pointer chains. Changing structure meant rewriting programs.
Network Model (CODASYL, 1969): Data organized as graphs. More flexible than hierarchical, but still required explicit navigation. Programs were tightly coupled to physical data structures.
Both models had severe limitations:
Codd's insight was to separate the 'what' from the 'how.' Users should declare what data they want, not how to get it. This separation is provided by relations—abstract, logical structures that can be manipulated mathematically without knowing anything about physical storage.
Why Relations Won:
The relation succeeded because it provides:
Data Independence: Programs reference relations (logical structures), not files (physical structures). The DBMS can change physical organization without breaking applications.
Non-Procedural Queries: Users can ask 'Give me all employees in Engineering earning over $100K' without specifying how to search, which indexes to use, or in what order to access data.
Mathematical Foundation: Relations support a complete algebra of operations (select, project, join, etc.) with well-defined semantics. This enables query optimization, formal verification, and principled reasoning.
Simplicity: A relation is conceptually simple—just a table of facts. Users don't need to understand trees, graphs, or pointers.
Extensibility: New data can be added as new relations without modifying existing structures. New queries can be written without new programs.
These advantages were so compelling that relational databases became dominant by the 1990s and remain so today, despite the rise of NoSQL alternatives for specific use cases.
We have established the foundational concept upon which all relational database theory rests: the relation. Let's consolidate what we've learned:
What's Next:
With the definition of a relation firmly established, we're ready to explore its mathematical foundations in greater depth. The next page examines the set-theoretic and algebraic structures that give relations their power—enabling us to manipulate data with the precision of mathematics and the efficiency of optimized algorithms.
You now understand what a relation truly is—not just informally as a table, but as a rigorous mathematical structure. This precision will serve you throughout your study of databases, enabling you to reason about schemas, queries, and optimizations with clarity and confidence.