Loading learning content...
In 1970, E.F. Codd introduced the relational model and forever changed how we think about data storage and retrieval. But Codd understood something crucial: a data model without rules is merely a container for chaos. Entity integrity emerged as one of the two fundamental integrity rules that transform relational databases from passive storage into active guardians of data quality.
Entity integrity addresses a deceptively simple question: How do we ensure that every row in a table can be uniquely identified and meaningfully referenced? The answer to this question has profound implications for every database operation—from simple queries to complex transactions spanning multiple tables.
This page provides an exhaustive exploration of entity integrity, examining its theoretical foundations, practical manifestations, implementation strategies, and the subtle edge cases that can undermine even well-designed systems.
By the end of this page, you will understand: (1) The formal definition and theoretical basis of entity integrity, (2) Why NULL values are prohibited in primary keys, (3) How entity integrity relates to entity identity in the real world, (4) Implementation mechanisms across different database systems, (5) Common violations and how to prevent them, and (6) The relationship between entity integrity and other constraints.
Entity integrity is formally stated as follows:
Entity Integrity Rule: No attribute participating in the primary key of a base relation is allowed to accept NULL values.
This statement, while concise, encodes several critical concepts that deserve careful unpacking:
The mathematical perspective:
From a set-theoretic viewpoint, a relation is a subset of the Cartesian product of its attribute domains. Each tuple (row) in a relation represents a fact about some entity in the real world. The primary key serves as the unique identifier for this entity.
If we permitted NULL values in primary keys, we would face a fundamental logical paradox. Consider the comparison NULL = NULL. In SQL and relational logic, this evaluates to UNKNOWN, not TRUE. Therefore, two tuples with NULL primary keys cannot be definitively compared for equality, violating the fundamental requirement that primary keys uniquely identify tuples.
Relational databases operate with three-valued logic: TRUE, FALSE, and UNKNOWN. NULL comparisons always yield UNKNOWN. If primary keys could be NULL, the fundamental operation of determining tuple uniqueness would become logically undefined. Two rows with NULL keys might be the same entity, or they might not—we cannot know.
To truly understand entity integrity, we must examine what it means to identify an entity. In the relational model, each row in a table corresponds to an entity or fact in the real world. The primary key serves as the surrogate or natural identifier for this entity.
The Entity-Identity Principle:
Every entity in the real world has an identity—something that makes it distinguishable from all other entities of the same type. For a person, this might be a national ID number. For a product, it might be a manufacturer's serial number or a generated SKU. For an event, it might be a timestamp combined with a location.
When we model entities in a relational database, we must capture this real-world identity in the primary key. If the primary key could be NULL, we would essentially be saying: "This entity might exist, but we cannot identify it." Such a statement is a logical contradiction—how can something exist if we cannot distinguish it from anything else?
| Domain | Entity Type | Natural Identity | Why NULL Would Be Problematic |
|---|---|---|---|
| Healthcare | Patient | Medical Record Number | Cannot track treatments, allocate resources, or maintain history |
| Finance | Transaction | Transaction ID | Cannot audit, reverse, or reconcile financial movements |
| E-commerce | Order | Order Number | Cannot fulfill, track, or customer service the order |
| Government | Citizen | National ID | Cannot verify identity, grant benefits, or enforce obligations |
| Education | Student | Student ID | Cannot record grades, attendance, or degree progress |
| Manufacturing | Component | Serial Number | Cannot trace defects, manage recalls, or ensure quality |
The practical manifestation:
Consider an employee database. Each employee has a unique employee ID that identifies them within the organization. If we allowed NULL employee IDs:
NULL in a primary key essentially creates a phantom entity—something that occupies space in the database but cannot participate meaningfully in any operation that requires identification.
A useful mental model: if an entity is important enough to store in a database, it is important enough to identify. Entity integrity simply enforces this principle at the database level. The moment you decide to insert a row, you are asserting that this entity exists and matters—therefore, it must be identifiable.
The prohibition of NULL values in primary keys is not arbitrary—it emerges from several fundamental requirements of relational systems:
A detailed example of the problems:
Consider an Orders table where order_id is the primary key. Suppose we allowed NULL values:
12345678910111213141516171819202122232425262728293031323334353637
-- Hypothetical scenario: NULL allowed in primary key (this is NOT valid SQL)CREATE TABLE Orders ( order_id INT PRIMARY KEY, -- Hypothetically allows NULL customer_id INT, order_date DATE, total_amount DECIMAL(10,2)); -- Insert orders with NULL keysINSERT INTO Orders VALUES (NULL, 101, '2024-01-15', 250.00);INSERT INTO Orders VALUES (NULL, 102, '2024-01-16', 175.50);INSERT INTO Orders VALUES (1001, 101, '2024-01-17', 320.00); -- PROBLEM 1: Which order does the OrderItems reference?CREATE TABLE OrderItems ( item_id INT PRIMARY KEY, order_id INT REFERENCES Orders(order_id), -- Foreign key product_id INT, quantity INT); -- This item belongs to... which NULL order?INSERT INTO OrderItems VALUES (1, NULL, 5001, 2); -- PROBLEM 2: How do we update the NULL order?UPDATE Orders SET total_amount = 300.00 WHERE order_id = NULL; -- This matches NOTHING (NULL = NULL is UNKNOWN) -- PROBLEM 3: How do we join for reporting?SELECT o.order_id, c.customer_nameFROM Orders oJOIN Customers c ON o.customer_id = c.customer_idWHERE o.order_id = NULL; -- Returns no rows! -- PROBLEM 4: Are the two NULL orders the same order?-- We cannot know! They might represent the same entity or different entities.Each NULL in a primary key creates ripple effects throughout the database. Foreign keys become meaningless. Queries return unexpected results. Updates fail silently. Reports omit data. The relational model's elegant algebra breaks down completely. Entity integrity is not merely a best practice—it is a structural necessity.
Entity integrity becomes more nuanced when dealing with composite primary keys—primary keys consisting of multiple attributes. The rule is unambiguous: no attribute participating in the primary key may be NULL.
This means every component column of a composite key must have a non-NULL value. A single NULL in any component invalidates the entire key.
123456789101112131415161718192021222324
-- Enrollment table with composite primary keyCREATE TABLE Enrollment ( student_id INT NOT NULL, course_id VARCHAR(10) NOT NULL, semester CHAR(6) NOT NULL, enrollment_date DATE NOT NULL, grade CHAR(2), -- Can be NULL until graded PRIMARY KEY (student_id, course_id, semester)); -- VALID: All primary key components have valuesINSERT INTO Enrollment VALUES (1001, 'CS101', '2024FA', '2024-08-25', NULL); -- INVALID: student_id is NULL (violates entity integrity)INSERT INTO Enrollment VALUES (NULL, 'CS101', '2024FA', '2024-08-25', 'A'); -- INVALID: course_id is NULL (violates entity integrity) INSERT INTO Enrollment VALUES (1001, NULL, '2024FA', '2024-08-25', 'A'); -- INVALID: semester is NULL (violates entity integrity)INSERT INTO Enrollment VALUES (1001, 'CS101', NULL, '2024-08-25', 'A'); -- Note: grade CAN be NULL because it's not part of the primary keyWhy each component matters:
In the enrollment example above, the composite key (student_id, course_id, semester) represents a unique enrollment event. Each component carries essential identifying information:
student_id identifies who is enrolledcourse_id identifies what they are enrolled insemester identifies when they are enrolledIf any of these were NULL, the enrollment record would be incomplete:
None of these scenarios represents a meaningful fact worth recording.
Composite primary keys frequently appear in junction tables (bridge tables) that implement many-to-many relationships. In these tables, entity integrity is especially critical because the junction table's purpose is precisely to connect two entities. A NULL in either foreign key component means the connection is incomplete—connecting an entity to nothing.
| Table | Composite Key Components | Meaning of Each Component | Consequence of NULL |
|---|---|---|---|
| OrderItems | order_id + product_id | Which order contains which product | Item belongs to no order or is no product |
| FlightBooking | flight_id + passenger_id + seat_number | Who sits where on which flight | Unbookable reservation |
| MovieCast | movie_id + actor_id + role_name | Who plays what role in which movie | Unattributable performance |
| ExamScore | student_id + exam_id + question_number | Specific student response to specific question | Unscorable answer |
Modern relational database management systems automatically enforce entity integrity when you declare a primary key. However, the implementation details and additional options vary across systems. Understanding these mechanisms helps you leverage them effectively.
1234567891011121314151617181920212223242526272829303132333435
-- PostgreSQL automatically enforces NOT NULL on primary key columnsCREATE TABLE employees ( employee_id SERIAL PRIMARY KEY, first_name VARCHAR(50) NOT NULL, last_name VARCHAR(50) NOT NULL, email VARCHAR(100) UNIQUE, hire_date DATE DEFAULT CURRENT_DATE); -- The PRIMARY KEY constraint implicitly adds:-- 1. NOT NULL constraint on employee_id-- 2. UNIQUE constraint on employee_id-- 3. Creates a B-tree index on employee_id -- Attempting NULL insertion fails:INSERT INTO employees (employee_id, first_name, last_name)VALUES (NULL, 'John', 'Doe');-- ERROR: null value in column "employee_id" violates not-null constraint -- SERIAL auto-generates values, preventing NULL naturallyINSERT INTO employees (first_name, last_name, email)VALUES ('John', 'Doe', 'john.doe@example.com');-- employee_id automatically assigned (e.g., 1) -- Composite key exampleCREATE TABLE order_items ( order_id INT NOT NULL, product_id INT NOT NULL, quantity INT NOT NULL DEFAULT 1, unit_price DECIMAL(10,2) NOT NULL, PRIMARY KEY (order_id, product_id), FOREIGN KEY (order_id) REFERENCES orders(order_id), FOREIGN KEY (product_id) REFERENCES products(product_id));All major databases provide mechanisms to automatically generate primary key values (SERIAL, AUTO_INCREMENT, IDENTITY, Sequences). These features don't just prevent NULL—they eliminate the need for application code to generate and manage unique identifiers. Use them unless you have a specific reason for natural keys.
The choice between surrogate keys (artificially generated identifiers) and natural keys (real-world identifiers) has significant implications for entity integrity maintenance.
Surrogate keys are system-generated values with no inherent meaning outside the database. Examples include auto-incremented integers, UUIDs, or database-generated sequences.
Natural keys are values that have meaning in the real world and uniquely identify entities. Examples include Social Security Numbers, ISBN numbers, or email addresses.
The practical recommendation:
For operational databases, surrogate keys are generally preferred because they inherently satisfy entity integrity and avoid the pitfalls of natural key management. However, a hybrid approach often works best:
12345678910111213141516171819202122232425262728293031323334
-- Hybrid approach: Surrogate PK + Natural Key UniquenessCREATE TABLE customers ( -- Surrogate primary key: auto-generated, never NULL customer_id BIGSERIAL PRIMARY KEY, -- Natural key: enforced unique, used for business operations email VARCHAR(255) NOT NULL UNIQUE, -- Additional natural identifier (may be NULL until verified) ssn CHAR(11) UNIQUE, -- Can be NULL; not part of PK -- Business attributes first_name VARCHAR(50) NOT NULL, last_name VARCHAR(50) NOT NULL, created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP); -- Foreign keys reference the surrogate key (stable, efficient)CREATE TABLE orders ( order_id BIGSERIAL PRIMARY KEY, customer_id BIGINT NOT NULL REFERENCES customers(customer_id), order_date DATE NOT NULL DEFAULT CURRENT_DATE, total_amount DECIMAL(12,2) NOT NULL); -- Application lookups can use natural key (email)SELECT customer_id, first_name, last_name FROM customers WHERE email = 'user@example.com'; -- But foreign keys use stable surrogateSELECT o.order_id, o.order_date, c.emailFROM orders oJOIN customers c ON o.customer_id = c.customer_id;Natural keys are appropriate when: (1) The identifier is guaranteed to exist before record creation, (2) The identifier is immutable by policy (ISO country codes), (3) The table is a reference/lookup table that rarely changes, or (4) The natural key is already required for all business operations. Examples include ISO currency codes, airport codes, or standardized product codes like ISBNs.
While databases enforce entity integrity at the constraint level, violations can still occur through various pathways. Understanding these failure modes helps you design robust systems.
Prevention strategies:
| Layer | Strategy | Implementation |
|---|---|---|
| Database | Always-on constraints | Never disable PK constraints; use TRUNCATE instead of constraint-off bulk deletes |
| Database | Auto-generation defaults | Use SERIAL/IDENTITY/AUTO_INCREMENT to prevent NULL at the source |
| Application | Validation before persist | Validate entity has identifier before attempting database save |
| Application | ORM configuration | Configure ORM to reject NULL primary keys; use database-generated IDs |
| ETL | Pre-load validation | Validate source data for key completeness before loading |
| ETL | Staging tables | Load into staging with same constraints; fix violations before promotion |
| Operations | Constraint monitoring | Automated checks that all tables have enabled primary key constraints |
| Operations | Change control | Require approval for any DDL that modifies or disables constraints |
123456789101112131415161718192021222324252627282930313233343536
-- PostgreSQL: Query to find tables without primary keysSELECT t.table_schema, t.table_name FROM information_schema.tables tLEFT JOIN information_schema.table_constraints tc ON t.table_schema = tc.table_schema AND t.table_name = tc.table_name AND tc.constraint_type = 'PRIMARY KEY'WHERE t.table_type = 'BASE TABLE' AND t.table_schema NOT IN ('pg_catalog', 'information_schema') AND tc.constraint_name IS NULLORDER BY t.table_schema, t.table_name; -- PostgreSQL: Find disabled constraints (requires pg_catalog access)SELECT conrelid::regclass AS table_name, conname AS constraint_name, contype AS constraint_type, NOT convalidated AS is_not_validatedFROM pg_constraintWHERE contype = 'p' -- Primary key AND NOT convalidated; -- MySQL: Tables without primary keysSELECT t.TABLE_SCHEMA, t.TABLE_NAMEFROM information_schema.TABLES tLEFT JOIN information_schema.TABLE_CONSTRAINTS tc ON t.TABLE_SCHEMA = tc.TABLE_SCHEMA AND t.TABLE_NAME = tc.TABLE_NAME AND tc.CONSTRAINT_TYPE = 'PRIMARY KEY'WHERE t.TABLE_TYPE = 'BASE TABLE' AND t.TABLE_SCHEMA NOT IN ('mysql', 'information_schema', 'performance_schema') AND tc.CONSTRAINT_NAME IS NULL;Include entity integrity checks in your database health monitoring. Regular automated scans for tables missing primary keys, disabled constraints, or NULL values in key columns can catch problems before they cascade into application failures.
Entity integrity is the bedrock of reliable relational databases. Let's consolidate the essential knowledge:
What's next:
Entity integrity ensures that each row can be identified. But databases are networks of related tables, not isolated silos. Referential integrity—the subject of the next page—ensures that relationships between tables remain consistent. Together, entity and referential integrity form the twin pillars of relational data reliability.
You now possess a comprehensive understanding of entity integrity—from its theoretical foundations in set theory and logic, through its practical manifestations in SQL DDL, to the strategies for preventing violations across your entire data stack. Next, we explore referential integrity, the constraint that governs relationships between tables.