Loading content...
In the previous pages, we established that a relation may have multiple candidate keys—each capable of uniquely identifying tuples. But in practice, databases don't treat all candidate keys equally. One must be designated as the primary means of identification.
Consider an Employee table with three candidate keys: {EmployeeID}, {SSN}, and {Email}. Any of these could technically serve as the identifier, but the database system needs to know which one is primary—the one used for:
This designated candidate key is called the primary key—arguably the most important concept in relational database design.
By the end of this page, you will understand the formal definition and properties of primary keys, master the criteria for selecting the best primary key from available candidates, appreciate the distinction between natural and surrogate primary keys, and recognize primary key implementation patterns across different database systems.
A primary key is a candidate key that has been explicitly designated as the principal identifier for tuples in a relation. Each relation has exactly one primary key, though it may have multiple candidate keys.
Essential properties of a primary key:
It is a candidate key: This means it's a superkey (uniquely identifies tuples) and it's minimal (no proper subset is a superkey).
It is singular: A relation has exactly one primary key. Other candidate keys become alternate keys.
It enforces entity integrity: The primary key cannot contain NULL values in any component attribute.
It is the target of foreign keys: Other relations reference this relation through its primary key.
Mathematical formalization:
Let R be a relation schema with candidate keys CK = {K₁, K₂, ..., Kₙ}. The primary key is a designated element PK ∈ CK such that:
PK → R (superkey property)∀A ∈ PK: (PK - {A})⁺ ≠ R (minimality)∀t ∈ r(R), ∀A ∈ PK: t[A] ≠ NULL (entity integrity)The selection of PK from CK is a design decision, not a mathematical determination. The relational model tells us the candidates; the designer chooses the primary.
| Property | Candidate Key | Primary Key |
|---|---|---|
| Uniqueness | Yes | Yes |
| Minimality | Yes | Yes |
| Multiple per relation? | Yes (can be many) | No (exactly one) |
| NULL values allowed? | Depends on DBMS | Never (entity integrity) |
| Foreign key target? | Can be, but unusual | Primary target for FKs |
| Physical implications? | Usually none | Often affects clustering, indexing |
One of the fundamental constraints of the relational model is entity integrity, and it applies specifically to primary keys.
No attribute of a primary key can be NULL. If the primary key is composite (consists of multiple attributes), NONE of those attributes may contain NULL values for any tuple.
Why entity integrity matters:
The primary key serves as the identity of each tuple. If we allowed NULL values:
Identity becomes ambiguous: A tuple with a NULL primary key cannot be reliably identified or referenced.
Comparisons fail: In SQL, NULL = NULL evaluates to UNKNOWN (not TRUE). This means we couldn't reliably determine if two tuples were 'the same'.
Foreign key references break: Other tables referencing this tuple would reference... nothing? The semantics become undefined.
Uniqueness enforcement fails: We can't guarantee uniqueness when NULL is involved, since NULL ≠ NULL.
Example: The NULL Problem
Suppose we allowed NULL in primary keys:
| EmployeeID | Name | Department |
|---|---|---|
| E001 | John | Engineering |
| NULL | Jane | Marketing |
| NULL | Bob | Sales |
Problems:
Payroll.EmployeeID reference Jane? It can't reliably.SELECT * WHERE EmployeeID = NULL returns nothing (NULL ≠ NULL in SQL).Entity integrity prevents this chaos by requiring primary keys to always have meaningful, non-NULL values.
For composite primary keys, entity integrity applies to EVERY attribute in the key. If the primary key is {StudentID, CourseID, Semester}, then ALL THREE must be non-NULL for every tuple. A NULL in any component violates entity integrity.
When a relation has multiple candidate keys, choosing the primary key is a crucial design decision. Several criteria guide this choice:
Example: Applying Selection Criteria
For an Employee table with candidate keys {EmployeeID}, {SSN}, {Email}:
| Criterion | EmployeeID | SSN | |
|---|---|---|---|
| Stability | ⭐⭐⭐ Rarely changes | ⭐⭐ Can change | ⭐ Often changes |
| Simplicity | ⭐⭐⭐ Short integer | ⭐⭐ 9 digits | ⭐ Variable length |
| Privacy | ⭐⭐⭐ Not sensitive | ⭐ Highly sensitive | ⭐⭐ Semi-sensitive |
| Existence | ⭐⭐⭐ Always assigned | ⭐⭐ Not universal | ⭐⭐ Usually exists |
| Performance | ⭐⭐⭐ Integer comparison | ⭐⭐ String | ⭐ Variable string |
Winner: EmployeeID — Best overall score across criteria.
SSN becomes an alternate key (with unique constraint). Email becomes an alternate key (with unique constraint but may change).
Primary key selection involves tradeoffs. In some domains, a natural key (like ISBN for books) is ideal. In others, a synthetic/surrogate key is better. Context matters—understand your domain's specific requirements.
One of the most debated topics in database design is whether to use natural keys (derived from real-world attributes) or surrogate keys (artificially generated identifiers).
A natural key is a candidate key composed of attributes that exist in the real world and have business meaning. Examples: ISBN for books, SSN for US residents, VIN for vehicles, country code + passport number for travelers.
A surrogate key is an artificial attribute created solely to serve as the primary key. It has no business meaning outside the database. Examples: auto-increment integers (1, 2, 3...), UUIDs, system-generated codes.
When to use natural keys:
The natural key is truly stable: ISBN doesn't change for a book edition. Country codes are internationally standardized.
External integration is critical: APIs, data exports, and external systems use the natural identifier.
Lookup patterns match: Users search by the natural key (e.g., find book by ISBN).
Uniqueness is guaranteed externally: An authority ensures no duplicates (ISBN agency, passport authority).
When to use surrogate keys:
No good natural key exists: People change names. Addresses change. Email changes.
Natural keys are composite: A 5-attribute natural key makes joins expensive.
Privacy matters: Don't expose SSN in every foreign key column.
Mergers/acquisitions are possible: Different divisions may have overlapping natural key spaces.
Performance is critical: Integer auto-increment outperforms string comparisons in joins.
Many systems use BOTH: a surrogate primary key for relationships and indexing, PLUS natural keys as alternate keys (unique constraints) for validation and external integration. This combines the advantages of both approaches.
Modern databases provide various mechanisms for automatically generating primary key values. Understanding these options is essential for practical database design.
| Strategy | Format Example | Pros | Cons |
|---|---|---|---|
| Auto-Increment (IDENTITY/SERIAL) | 1, 2, 3, 4, 5... | Simple, compact, ordered, fast | Sequential (security risk), single-DB only |
| UUID/GUID | 550e8400-e29b-41d4-a716-446655440000 | Globally unique, distributed-safe | Large (16 bytes), random (index fragmentation) |
| ULID | 01ARZ3NDEKTSV4RRFFQ69G5FAV | Sortable, distributed-safe, compact | Newer standard, less support |
| Snowflake ID | 1382971839198277632 | Sortable, distributed, compact (64-bit) | Requires ID service, clock dependency |
| Sequence | Explicit NEXT VALUE | Controlled, predictable | Requires extra object, potential bottleneck |
SQL Implementation Examples:
1234567891011121314151617181920212223242526272829303132333435363738
-- Auto-Increment (MySQL)CREATE TABLE Customer ( CustomerID INT AUTO_INCREMENT PRIMARY KEY, Name VARCHAR(100) NOT NULL, Email VARCHAR(255) UNIQUE); -- IDENTITY (SQL Server)CREATE TABLE Customer ( CustomerID INT IDENTITY(1,1) PRIMARY KEY, Name NVARCHAR(100) NOT NULL, Email NVARCHAR(255) UNIQUE); -- SERIAL (PostgreSQL)CREATE TABLE Customer ( CustomerID SERIAL PRIMARY KEY, Name VARCHAR(100) NOT NULL, Email VARCHAR(255) UNIQUE); -- UUID (PostgreSQL with extension)CREATE EXTENSION IF NOT EXISTS "uuid-ossp"; CREATE TABLE Customer ( CustomerID UUID DEFAULT uuid_generate_v4() PRIMARY KEY, Name VARCHAR(100) NOT NULL, Email VARCHAR(255) UNIQUE); -- Sequence (Oracle/PostgreSQL)CREATE SEQUENCE customer_seq START WITH 1 INCREMENT BY 1; CREATE TABLE Customer ( CustomerID INT DEFAULT NEXT VALUE FOR customer_seq PRIMARY KEY, Name VARCHAR(100) NOT NULL, Email VARCHAR(255) UNIQUE);For single-database applications, auto-increment integers are usually best (simple, fast, compact). For distributed systems or when data merging is expected, UUIDs or Snowflake IDs prevent collisions. Consider your specific scalability and integration requirements.
When the natural candidate key consists of multiple attributes, we face a design decision: use the composite key as the primary key, or introduce a surrogate. Understanding the implications of composite primary keys is crucial.
A composite primary key (or compound primary key) consists of two or more columns that together uniquely identify each row. No single column suffices—the combination is required.
Example: Order Line Items
CREATE TABLE OrderLine (
OrderID INT NOT NULL,
LineNumber INT NOT NULL,
ProductID INT NOT NULL,
Quantity INT NOT NULL,
UnitPrice DECIMAL(10,2) NOT NULL,
PRIMARY KEY (OrderID, LineNumber),
FOREIGN KEY (OrderID) REFERENCES Orders(OrderID),
FOREIGN KEY (ProductID) REFERENCES Products(ProductID)
);
Here, {OrderID, LineNumber} is the composite primary key:
OrderID alone doesn't work (many lines per order)LineNumber alone doesn't work (same number in different orders)Implications of composite primary keys:
When to use composite vs surrogate:
Use composite primary key when:
Use surrogate primary key when:
Different database management systems implement primary keys with subtle variations. Understanding these differences is essential for practical database work.
12345678910111213141516171819202122232425262728293031323334353637383940
-- Column-level constraint (single-column key)CREATE TABLE Employee ( EmployeeID INT PRIMARY KEY, -- Inline constraint Name VARCHAR(100) NOT NULL); -- Table-level constraint (single or composite)CREATE TABLE Employee ( EmployeeID INT NOT NULL, Name VARCHAR(100) NOT NULL, PRIMARY KEY (EmployeeID) -- Separate declaration); -- Table-level with composite keyCREATE TABLE Enrollment ( StudentID INT NOT NULL, CourseID INT NOT NULL, Semester VARCHAR(20) NOT NULL, Grade CHAR(2), PRIMARY KEY (StudentID, CourseID, Semester)); -- Named constraint (recommended for maintenance)CREATE TABLE Employee ( EmployeeID INT NOT NULL, Name VARCHAR(100) NOT NULL, CONSTRAINT pk_employee PRIMARY KEY (EmployeeID)); -- Adding PK to existing table (ALTER statement)ALTER TABLE EmployeeADD CONSTRAINT pk_employee PRIMARY KEY (EmployeeID); -- PostgreSQL: Primary key with custom tablespaceCREATE TABLE Customer ( CustomerID SERIAL, Name VARCHAR(100) NOT NULL, CONSTRAINT pk_customer PRIMARY KEY (CustomerID) USING INDEX TABLESPACE fast_tablespace);| DBMS | Default Index Type | Clustering | Auto-Increment Syntax |
|---|---|---|---|
| MySQL (InnoDB) | B+Tree | PK is clustered index | AUTO_INCREMENT |
| PostgreSQL | B-Tree | Not clustered by default (use CLUSTER) | SERIAL / IDENTITY |
| SQL Server | B-Tree | Clustered by default (configurable) | IDENTITY(seed, increment) |
| Oracle | B-Tree | Not clustered (IOT optional) | SEQUENCE / IDENTITY (12c+) |
| SQLite | B-Tree | INTEGER PRIMARY KEY is rowid | AUTOINCREMENT (different from default) |
In MySQL InnoDB and SQL Server (by default), the primary key determines physical row ordering. This means: (1) PK determines I/O patterns, (2) Random PKs (like UUID) cause fragmentation, (3) Wide PKs increase secondary index sizes. Choose your primary key with storage implications in mind.
We've comprehensively covered primary keys—the designated identifier that serves as the foundational reference point for relational data. Let's consolidate the key insights:
What's next:
Now that we understand primary keys, we'll explore alternate keys—the candidate keys that were not selected as primary but still serve important roles in ensuring data integrity and providing alternate access paths.
You now understand primary keys as the designated tuple identifier. You can apply selection criteria to choose the best primary key, evaluate natural versus surrogate key tradeoffs, implement primary keys across different DBMS platforms, and appreciate the physical implications of primary key choices. Next, we'll explore alternate keys and their role in database integrity.