Database Management SystemSQL Fundamentals

Constraints in DDL

LevelIntermediate

Duration90 mins

TopicSQL Fundamentals

1 / 5

PRIMARY KEY Constraint

The Foundation of Entity Identity

In the relational model, every row in a table must be uniquely identifiable. This fundamental requirement isn't merely a database convention—it's the cornerstone upon which all relational operations, from simple lookups to complex multi-table joins, are built. The PRIMARY KEY constraint is the mechanism through which relational database management systems enforce this uniqueness guarantee.

Consider the philosophical underpinning: if two rows in a table are indistinguishable, then they represent the same real-world entity and should logically be the same row. The PRIMARY KEY formalizes this identity principle, transforming it from a logical ideal into a system-enforced invariant that can never be violated, regardless of application bugs, user errors, or concurrent modifications.

What You Will Master

By the end of this page, you will understand PRIMARY KEY constraints at a depth that goes far beyond syntax. You'll grasp why primary keys are fundamental to relational theory, how different key strategies affect performance, the subtle implications of composite keys, and how to make informed decisions about key design that will scale with your applications.

Conceptual Foundation

The concept of a primary key emerges directly from E.F. Codd's relational model, first articulated in his seminal 1970 paper. In relational theory, a relation (table) is defined as a set of tuples (rows), and by definition, a set cannot contain duplicate elements. This mathematical property necessitates a mechanism to distinguish each tuple.

Key Terminology:

Superkey: Any set of attributes (columns) that uniquely identifies each tuple. A superkey may contain more attributes than necessary.
Candidate Key: A minimal superkey—a superkey where no proper subset is also a superkey. Removing any attribute would destroy uniqueness.
Primary Key: The candidate key chosen by the database designer to serve as the principal identifier for the relation.
Alternate Keys: Candidate keys not chosen as the primary key (these often become UNIQUE constraints).

Understanding this hierarchy is essential. A table may have multiple candidate keys, but exactly one is designated as primary. This choice has implications for indexing, foreign key references, and even conceptual data modeling.

Candidate Key AnalysisConsider an Employee table with the following attributes: EmployeeID (company-assigned), SSN (Social Security Number), Email (company email). All three could serve as candidate keys—each uniquely identifies an employee.

Input

Candidate Keys: {EmployeeID}, {SSN}, {Email}

Output

Primary Key Choice: EmployeeID
Alternate Keys: SSN → UNIQUE constraint, Email → UNIQUE constraint

The Stability Principle

Primary keys should be immutable in practice. While SQL allows primary key updates, changing them cascades through foreign key relationships, invalidates external references (like URLs containing IDs), and creates audit trail complexity. Choose keys that will never need to change.

PRIMARY KEY Properties

A PRIMARY KEY constraint enforces two fundamental properties simultaneously, and this combination is what makes it special:

Property 1: Uniqueness

No two rows in the table may have the same value (or combination of values, for composite keys) in the primary key column(s). This is enforced at the database level—any INSERT or UPDATE that would create a duplicate is rejected with a constraint violation error.

Property 2: Non-Nullability

Every row must have a value for the primary key column(s)—NULL values are prohibited. This differs from UNIQUE constraints, which by default allow NULL values (since NULL ≠ NULL in SQL's three-valued logic).

These properties combine to guarantee that every row is positively and uniquely identifiable. There can be no ambiguity about which row is which.

PRIMARY KEY vs Other Constraints
Property	PRIMARY KEY	UNIQUE	NOT NULL
Enforces Uniqueness	✓ Yes	✓ Yes	✗ No
Prevents NULL	✓ Yes (implicit)	✗ No (by default)	✓ Yes
Limit Per Table	Exactly One	Multiple Allowed	Multiple Allowed
Creates Index	✓ Yes (clustered by default in many RDBMS)	✓ Yes (non-clustered typically)	✗ No
Can Be Referenced by FK	✓ Yes (preferred)	✓ Yes	✗ No (alone)

The Index Implication:

When you declare a PRIMARY KEY, most database systems automatically create an index on the key column(s). In SQL Server and MySQL's InnoDB, this becomes the clustered index, meaning the physical row order on disk matches the primary key order. This has profound performance implications:

Sequential primary key values (like auto-increment integers) insert efficiently, always adding to the end of the table
Random primary key values (like UUIDs) cause page splits and fragmentation
Range queries on primary key are exceptionally fast, as related rows are physically contiguous

In PostgreSQL, the primary key index is a standard B-tree index (not exclusively clustered), but Postgres can optionally CLUSTER a table by any index.

UUID Primary Keys: A Double-Edged Sword

While UUIDs provide globally unique identifiers (valuable for distributed systems and API exposure), their randomness causes severe write amplification in clustered indexes. Consider UUID v7 (time-ordered) or ULID as alternatives that preserve sortability while maintaining global uniqueness.

Syntax and Declaration

SQL provides two syntactic forms for declaring PRIMARY KEY constraints: column-level (inline) and table-level (out-of-line). Understanding both is essential, as each has appropriate use cases.

Column-Level Declaration:

The constraint is specified inline with the column definition. Best for single-column primary keys.

column_level_primary_key.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
-- Column-level PRIMARY KEY declaration
-- Constraint name is automatically generated by the DBMS
 
CREATE TABLE employees (
    employee_id     INT PRIMARY KEY,
    first_name      VARCHAR(50) NOT NULL,
    last_name       VARCHAR(50) NOT NULL,
    email           VARCHAR(100) UNIQUE NOT NULL,
    hire_date       DATE NOT NULL
);
 
-- With explicit constraint naming (recommended for maintainability)
CREATE TABLE departments (
    department_id   INT CONSTRAINT pk_departments PRIMARY KEY,
    department_name VARCHAR(100) NOT NULL,
    location        VARCHAR(100)
);

Table-Level Declaration:

The constraint is specified after all column definitions. Required for composite primary keys, but also useful for single-column keys when you want explicit naming.

table_level_primary_key.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
-- Table-level PRIMARY KEY declaration
-- Preferred style for production databases due to explicit naming
 
CREATE TABLE customers (
    customer_id     INT NOT NULL,
    email           VARCHAR(100) NOT NULL,
    first_name      VARCHAR(50),
    last_name       VARCHAR(50),
    registration_date TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    
    -- Table-level constraint with explicit name
    CONSTRAINT pk_customers PRIMARY KEY (customer_id)
);
 
-- Alternative: Anonymous constraint (name auto-generated)
CREATE TABLE products (
    product_id      INT NOT NULL,
    product_name    VARCHAR(200) NOT NULL,
    unit_price      DECIMAL(10,2),
    
    PRIMARY KEY (product_id)
);

Always Name Your Constraints

Explicitly naming constraints (pk_tablename, fk_table_reference, uq_table_column) makes error messages meaningful, simplifies ALTER TABLE operations, and improves schema documentation. Auto-generated names like SYS_C007142 are impossible to reason about.

Adding PRIMARY KEY to Existing Tables:

In real-world scenarios, you'll often need to add or modify constraints on existing tables. This requires understanding both the syntax and the preconditions.

alter_primary_key.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
-- Adding a PRIMARY KEY to an existing table
-- Preconditions: Column must exist, have no NULLs, and contain unique values
 
-- Step 1: Ensure no NULL values exist
UPDATE legacy_orders 
SET order_id = sequence_generator.NEXTVAL 
WHERE order_id IS NULL;
 
-- Step 2: Ensure no duplicate values exist
-- (This should be verified first; duplicates require business logic to resolve)
 
-- Step 3: Add the constraint
ALTER TABLE legacy_orders
ADD CONSTRAINT pk_legacy_orders PRIMARY KEY (order_id);
 
 
-- Dropping a PRIMARY KEY (rare, but sometimes necessary)
-- Note: This will fail if foreign keys reference this primary key
ALTER TABLE legacy_orders
DROP CONSTRAINT pk_legacy_orders;
 
-- In MySQL, alternate syntax:
ALTER TABLE legacy_orders DROP PRIMARY KEY;
 
 
-- Modifying a PRIMARY KEY (effectively drop and recreate)
ALTER TABLE orders DROP CONSTRAINT pk_orders;
ALTER TABLE orders ADD CONSTRAINT pk_orders PRIMARY KEY (new_order_id);

Composite Primary Keys

A composite primary key (also called a compound primary key) uses multiple columns together to uniquely identify each row. This is essential when no single column provides uniqueness, but a combination does.

When to Use Composite Primary Keys:

Junction/Bridge Tables: In many-to-many relationships, the junction table's natural key is the combination of the two foreign keys it references.
Time-Series Data: When uniqueness depends on both an entity and a timestamp (e.g., sensor readings per device per minute).
Hierarchical Data: When child entities are only unique within their parent context (e.g., line items within an order).
Multi-Tenant Systems: When rows are unique per tenant (tenant_id + entity_id).

composite_primary_keys.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
-- Example 1: Junction Table (Many-to-Many)
-- Students can enroll in multiple courses; courses have multiple students
CREATE TABLE student_enrollments (
    student_id      INT NOT NULL,
    course_id       INT NOT NULL,
    enrollment_date DATE NOT NULL DEFAULT CURRENT_DATE,
    grade           CHAR(2),
    
    -- Composite primary key: combination must be unique
    CONSTRAINT pk_student_enrollments 
        PRIMARY KEY (student_id, course_id),
    
    -- Foreign key references (covered in next page)
    CONSTRAINT fk_enrollment_student 
        FOREIGN KEY (student_id) REFERENCES students(student_id),
    CONSTRAINT fk_enrollment_course 
        FOREIGN KEY (course_id) REFERENCES courses(course_id)
);
 
 
-- Example 2: Time-Series Data
-- IoT sensor readings, unique per device per timestamp
CREATE TABLE sensor_readings (
    sensor_id       VARCHAR(50) NOT NULL,
    reading_time    TIMESTAMP NOT NULL,
    temperature     DECIMAL(5,2),
    humidity        DECIMAL(5,2),
    pressure        DECIMAL(7,2),
    
    CONSTRAINT pk_sensor_readings 
        PRIMARY KEY (sensor_id, reading_time)
);
 
 
-- Example 3: Order Line Items (Weak Entity Pattern)
-- Line items are only unique within their order context
CREATE TABLE order_items (
    order_id        INT NOT NULL,
    line_number     INT NOT NULL,  -- Sequential within order
    product_id      INT NOT NULL,
    quantity        INT NOT NULL CHECK (quantity > 0),
    unit_price      DECIMAL(10,2) NOT NULL,
    
    CONSTRAINT pk_order_items 
        PRIMARY KEY (order_id, line_number),
    CONSTRAINT fk_order_items_order 
        FOREIGN KEY (order_id) REFERENCES orders(order_id)
);
 
 
-- Example 4: Multi-Tenant SaaS Application
CREATE TABLE tenant_users (
    tenant_id       INT NOT NULL,
    user_id         INT NOT NULL,  -- Unique only within tenant
    email           VARCHAR(100) NOT NULL,
    role            VARCHAR(50) NOT NULL,
    
    CONSTRAINT pk_tenant_users 
        PRIMARY KEY (tenant_id, user_id),
    
    -- Email unique within tenant, not globally
    CONSTRAINT uq_tenant_user_email 
        UNIQUE (tenant_id, email)
);

Column Order Matters in Composite Keys

The order of columns in a composite primary key affects index efficiency. Place the most frequently queried column first. For example, (tenant_id, user_id) is optimal if queries almost always filter by tenant_id, as the index can quickly narrow to the tenant's rows.

Composite Keys vs. Surrogate Keys Debate:

A long-standing debate in database design concerns whether to use natural composite keys or introduce a surrogate (synthetic) primary key alongside a unique constraint.

Approach	Composite Natural Key	Surrogate Key + Unique Constraint
Definition	`PRIMARY KEY (student_id, course_id)`	`PRIMARY KEY (enrollment_id)` + `UNIQUE (student_id, course_id)`
Foreign Key Referencing	Cascades multiple columns	Cascades single column
Join Complexity	Multi-column joins required	Single-column joins
Storage	No extra column	Extra column per row
ORM Compatibility	Often problematic	Generally simpler
Data Meaning	Key is meaningful	Key is opaque identifier

Both approaches are valid. Composite natural keys are theoretically pure and save storage. Surrogate keys simplify application code and foreign key relationships. Most modern applications favor surrogate keys for operational convenience, but composite keys remain important for certain patterns.

Auto-Generated Primary Keys

In practice, most tables use auto-generated surrogate keys. These provide guaranteed uniqueness without requiring application logic to generate values. Different database systems implement this differently, and understanding the mechanisms is crucial for proper usage.

Key Generation Strategies:

Auto-Increment/Identity: Database generates sequential integers automatically
Sequence Objects: Explicit sequence generators (more control, cross-table usable)
UUID/GUID Generation: Application or database generates universally unique identifiers
Timestamp-Based IDs: Sortable IDs incorporating time (Snowflake, ULID)

mysql_auto_increment.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
-- MySQL: AUTO_INCREMENT
CREATE TABLE orders (
    order_id        INT AUTO_INCREMENT,
    customer_id     INT NOT NULL,
    order_date      DATETIME DEFAULT CURRENT_TIMESTAMP,
    total_amount    DECIMAL(12,2),
    
    PRIMARY KEY (order_id)
);
 
-- Inserting without specifying the auto-increment column
INSERT INTO orders (customer_id, total_amount) 
VALUES (101, 250.00);
-- order_id automatically assigned: 1
 
-- Check last generated value
SELECT LAST_INSERT_ID();
 
-- Resetting the counter (use with caution!)
ALTER TABLE orders AUTO_INCREMENT = 1000;
 
-- Getting current auto-increment value
SELECT AUTO_INCREMENT 
FROM information_schema.TABLES 
WHERE TABLE_SCHEMA = 'mydb' AND TABLE_NAME = 'orders';

Auto-Increment Gaps Are Normal

Auto-increment values may have gaps due to rolled-back transactions, failed inserts, or server restarts. Never assume sequential values or count rows by finding max(id). Gaps in primary keys are a feature (ensuring uniqueness), not a bug.

Primary Key Design Principles

Choosing the right primary key is a design decision with long-lasting consequences. A poorly chosen key can create performance problems, complicate applications, and even cause data integrity issues. Here are battle-tested principles from decades of database practice:

Primary Key Design Principles

•Immutability: Once assigned, a primary key value should never change. Updates cascade through foreign keys, invalidate caches, and break external references (bookmarks, URLs, API consumers).
•Minimality: Use the smallest data type that accommodates expected growth. INT (4 bytes, 2.1B values) often suffices; BIGINT (8 bytes) for massive scale. Avoid VARCHAR keys when numeric alternatives exist.
•Meaninglessness: Surrogate keys (arbitrary numbers) are generally preferred over natural keys (SSN, email). Natural keys can change, may not exist for all entities, and may be sensitive.
•Simplicity: Single-column keys simplify joins, foreign key definitions, and ORM mappings. Use composite keys only when the domain genuinely requires them.
•Sequentiality (for clustered indexes): Sequential keys (auto-increment) insert efficiently into B-tree indexes. Random keys (UUID v4) cause page splits and fragmentation.
•Invisibility to Users: Primary keys are internal identifiers. Expose them in URLs/APIs minimally, and never let users edit them. Consider separate 'public IDs' if human-readable references are needed.

Primary Key Anti-Patterns to Avoid
Anti-Pattern	Problem	Better Alternative
Email as PK	Emails change; case sensitivity; large string comparison	Surrogate INT + UNIQUE constraint on email
SSN/National ID as PK	Privacy concerns; not always available; can be corrected	Surrogate INT + encrypted storage of SSN
Composite key with 4+ columns	Complex joins; error-prone foreign keys	Surrogate INT + multi-column UNIQUE constraint
VARCHAR(255) as PK	Index bloat; slow comparisons; collation issues	Surrogate INT; VARCHAR only if truly necessary
FLOAT/DOUBLE as PK	Precision issues; comparison hazards	Never use floating-point as keys
Timestamps as sole PK	Clock skew; duplicates in same millisecond	Composite with sequence or use dedicated ID

The Pragmatic Default

When in doubt, use a single-column auto-incrementing integer primary key. It's efficient, simple, well-understood by ORMs, and solves 95% of use cases. Override this default only when specific requirements demand alternatives (distributed systems → UUIDs, natural domain modeling → composite keys).

Advanced Considerations

Clustered Index Implications:

In SQL Server and MySQL/InnoDB, the PRIMARY KEY defines the clustered index by default. This means:

Physical Row Order: Rows are stored on disk in primary key order. Range scans on the PK are I/O efficient.
Secondary Index Structure: Non-clustered indexes store the primary key as their row locator. Wide primary keys (e.g., 36-byte UUIDs) bloat all secondary indexes.
Insert Patterns: Sequential PKs insert at the end of the table (fast). Random PKs cause page splits and fragmentation (slow, requires maintenance).

Distributed System Considerations:

In distributed databases (sharded MySQL, Cassandra, CockroachDB), primary key design affects:

Data Distribution: Keys should distribute evenly across shards. Sequential integers concentrate writes on one shard (hotspot).
Cross-Shard Queries: Minimizing cross-shard operations depends on co-locating related data, often influenced by key design.
Global Uniqueness: Auto-increment doesn't work across multiple database instances without coordination. Distributed IDs (Snowflake, ULID, UUID) are necessary.

distributed_primary_keys.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
-- Pattern: Snowflake-style distributed ID
-- 64-bit ID: timestamp (41 bits) + node ID (10 bits) + sequence (12 bits)
-- Time-sortable, globally unique, no coordination required
 
-- PostgreSQL: Using UUID v7 (time-ordered UUID, proposed standard)
-- Requires PostgreSQL 17+ or extension
CREATE EXTENSION IF NOT EXISTS pg_uuidv7;
 
CREATE TABLE distributed_orders (
    order_id        UUID DEFAULT uuid_generate_v7() PRIMARY KEY,
    customer_id     INT NOT NULL,
    order_date      TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);
 
 
-- Alternative: ULID (Universally Unique Lexicographically Sortable Identifier)
-- 128-bit: 48-bit timestamp + 80-bit randomness
-- Encoded as 26-character base32 string, sorts chronologically
 
CREATE TABLE events (
    event_id        CHAR(26) PRIMARY KEY,  -- ULID generated by application
    event_type      VARCHAR(50) NOT NULL,
    payload         JSONB,
    created_at      TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);
 
 
-- Pattern: Composite sharding key
-- Ensures related data is co-located on same shard
CREATE TABLE tenant_transactions (
    tenant_id       INT NOT NULL,
    transaction_id  BIGINT NOT NULL,  -- Sequence per tenant
    amount          DECIMAL(15,2),
    transaction_date TIMESTAMP,
    
    -- Composite key with tenant first ensures tenant data locality
    PRIMARY KEY (tenant_id, transaction_id)
);

Key Design Is Context-Dependent

There's no universally 'best' primary key strategy. Single-node OLTP systems thrive with auto-increment integers. Distributed systems require globally unique generators. Analytics workloads may benefit from time-partitioned keys. Understand your system's constraints before deciding.

Summary: PRIMARY KEY Mastery

We've explored PRIMARY KEY constraints from first principles to advanced distributed considerations. Let's consolidate the essential knowledge:

Key Takeaways

•PRIMARY KEY = UNIQUE + NOT NULL: It uniquely identifies each row and prohibits nulls, ensuring every row is positively identifiable.
•One Per Table: A table can have only one primary key, though it may span multiple columns (composite key).
•Automatic Indexing: Primary keys create indexes automatically—typically clustered in SQL Server/InnoDB, affecting physical storage order.
•Immutability Principle: Primary keys should never change after assignment. Design for stability from the start.
•Surrogate vs Natural: Surrogate keys (auto-increment integers) are generally preferred for operational simplicity; use natural/composite keys when domain modeling demands it.
•Distributed Considerations: Sequential auto-increment fails in distributed systems; use UUIDs, ULIDs, or Snowflake IDs for global uniqueness.
•Always Name Constraints: Explicit constraint names (pk_tablename) improve maintainability and debugging.

What's Next:

With PRIMARY KEY understood, we're ready to explore how tables reference each other's primary keys. The next page covers FOREIGN KEY constraints—the mechanism that enforces referential integrity and actually makes relational databases relational.

Page Complete

You now have a comprehensive understanding of PRIMARY KEY constraints—from theoretical foundations through syntax variations to advanced distributed system considerations. This knowledge forms the basis for all referential integrity enforcement covered in the following pages.

1 / 5

Loading learning content...

Database Management SystemSQL Fundamentals

Constraints in DDL

LevelIntermediate

Duration90 mins

TopicSQL Fundamentals

1 / 5

PRIMARY KEY Constraint

The Foundation of Entity Identity

What You Will Master

Conceptual Foundation

Key Terminology:

Superkey: Any set of attributes (columns) that uniquely identifies each tuple. A superkey may contain more attributes than necessary.
Candidate Key: A minimal superkey—a superkey where no proper subset is also a superkey. Removing any attribute would destroy uniqueness.
Primary Key: The candidate key chosen by the database designer to serve as the principal identifier for the relation.
Alternate Keys: Candidate keys not chosen as the primary key (these often become UNIQUE constraints).

Input

Candidate Keys: {EmployeeID}, {SSN}, {Email}

Output

Primary Key Choice: EmployeeID
Alternate Keys: SSN → UNIQUE constraint, Email → UNIQUE constraint

The Stability Principle

PRIMARY KEY Properties

A PRIMARY KEY constraint enforces two fundamental properties simultaneously, and this combination is what makes it special:

Property 1: Uniqueness

Property 2: Non-Nullability

These properties combine to guarantee that every row is positively and uniquely identifiable. There can be no ambiguity about which row is which.

PRIMARY KEY vs Other Constraints
Property	PRIMARY KEY	UNIQUE	NOT NULL
Enforces Uniqueness	✓ Yes	✓ Yes	✗ No
Prevents NULL	✓ Yes (implicit)	✗ No (by default)	✓ Yes
Limit Per Table	Exactly One	Multiple Allowed	Multiple Allowed
Creates Index	✓ Yes (clustered by default in many RDBMS)	✓ Yes (non-clustered typically)	✗ No
Can Be Referenced by FK	✓ Yes (preferred)	✓ Yes	✗ No (alone)

The Index Implication:

Sequential primary key values (like auto-increment integers) insert efficiently, always adding to the end of the table
Random primary key values (like UUIDs) cause page splits and fragmentation
Range queries on primary key are exceptionally fast, as related rows are physically contiguous

In PostgreSQL, the primary key index is a standard B-tree index (not exclusively clustered), but Postgres can optionally CLUSTER a table by any index.

UUID Primary Keys: A Double-Edged Sword

Syntax and Declaration

Column-Level Declaration:

The constraint is specified inline with the column definition. Best for single-column primary keys.

column_level_primary_key.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
-- Column-level PRIMARY KEY declaration
-- Constraint name is automatically generated by the DBMS
 
CREATE TABLE employees (
    employee_id     INT PRIMARY KEY,
    first_name      VARCHAR(50) NOT NULL,
    last_name       VARCHAR(50) NOT NULL,
    email           VARCHAR(100) UNIQUE NOT NULL,
    hire_date       DATE NOT NULL
);
 
-- With explicit constraint naming (recommended for maintainability)
CREATE TABLE departments (
    department_id   INT CONSTRAINT pk_departments PRIMARY KEY,
    department_name VARCHAR(100) NOT NULL,
    location        VARCHAR(100)
);

Table-Level Declaration:

The constraint is specified after all column definitions. Required for composite primary keys, but also useful for single-column keys when you want explicit naming.

table_level_primary_key.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
-- Table-level PRIMARY KEY declaration
-- Preferred style for production databases due to explicit naming
 
CREATE TABLE customers (
    customer_id     INT NOT NULL,
    email           VARCHAR(100) NOT NULL,
    first_name      VARCHAR(50),
    last_name       VARCHAR(50),
    registration_date TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    
    -- Table-level constraint with explicit name
    CONSTRAINT pk_customers PRIMARY KEY (customer_id)
);
 
-- Alternative: Anonymous constraint (name auto-generated)
CREATE TABLE products (
    product_id      INT NOT NULL,
    product_name    VARCHAR(200) NOT NULL,
    unit_price      DECIMAL(10,2),
    
    PRIMARY KEY (product_id)
);

Always Name Your Constraints

Adding PRIMARY KEY to Existing Tables:

In real-world scenarios, you'll often need to add or modify constraints on existing tables. This requires understanding both the syntax and the preconditions.

alter_primary_key.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
-- Adding a PRIMARY KEY to an existing table
-- Preconditions: Column must exist, have no NULLs, and contain unique values
 
-- Step 1: Ensure no NULL values exist
UPDATE legacy_orders 
SET order_id = sequence_generator.NEXTVAL 
WHERE order_id IS NULL;
 
-- Step 2: Ensure no duplicate values exist
-- (This should be verified first; duplicates require business logic to resolve)
 
-- Step 3: Add the constraint
ALTER TABLE legacy_orders
ADD CONSTRAINT pk_legacy_orders PRIMARY KEY (order_id);
 
 
-- Dropping a PRIMARY KEY (rare, but sometimes necessary)
-- Note: This will fail if foreign keys reference this primary key
ALTER TABLE legacy_orders
DROP CONSTRAINT pk_legacy_orders;
 
-- In MySQL, alternate syntax:
ALTER TABLE legacy_orders DROP PRIMARY KEY;
 
 
-- Modifying a PRIMARY KEY (effectively drop and recreate)
ALTER TABLE orders DROP CONSTRAINT pk_orders;
ALTER TABLE orders ADD CONSTRAINT pk_orders PRIMARY KEY (new_order_id);

Composite Primary Keys

When to Use Composite Primary Keys:

Junction/Bridge Tables: In many-to-many relationships, the junction table's natural key is the combination of the two foreign keys it references.
Time-Series Data: When uniqueness depends on both an entity and a timestamp (e.g., sensor readings per device per minute).
Hierarchical Data: When child entities are only unique within their parent context (e.g., line items within an order).
Multi-Tenant Systems: When rows are unique per tenant (tenant_id + entity_id).

composite_primary_keys.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
-- Example 1: Junction Table (Many-to-Many)
-- Students can enroll in multiple courses; courses have multiple students
CREATE TABLE student_enrollments (
    student_id      INT NOT NULL,
    course_id       INT NOT NULL,
    enrollment_date DATE NOT NULL DEFAULT CURRENT_DATE,
    grade           CHAR(2),
    
    -- Composite primary key: combination must be unique
    CONSTRAINT pk_student_enrollments 
        PRIMARY KEY (student_id, course_id),
    
    -- Foreign key references (covered in next page)
    CONSTRAINT fk_enrollment_student 
        FOREIGN KEY (student_id) REFERENCES students(student_id),
    CONSTRAINT fk_enrollment_course 
        FOREIGN KEY (course_id) REFERENCES courses(course_id)
);
 
 
-- Example 2: Time-Series Data
-- IoT sensor readings, unique per device per timestamp
CREATE TABLE sensor_readings (
    sensor_id       VARCHAR(50) NOT NULL,
    reading_time    TIMESTAMP NOT NULL,
    temperature     DECIMAL(5,2),
    humidity        DECIMAL(5,2),
    pressure        DECIMAL(7,2),
    
    CONSTRAINT pk_sensor_readings 
        PRIMARY KEY (sensor_id, reading_time)
);
 
 
-- Example 3: Order Line Items (Weak Entity Pattern)
-- Line items are only unique within their order context
CREATE TABLE order_items (
    order_id        INT NOT NULL,
    line_number     INT NOT NULL,  -- Sequential within order
    product_id      INT NOT NULL,
    quantity        INT NOT NULL CHECK (quantity > 0),
    unit_price      DECIMAL(10,2) NOT NULL,
    
    CONSTRAINT pk_order_items 
        PRIMARY KEY (order_id, line_number),
    CONSTRAINT fk_order_items_order 
        FOREIGN KEY (order_id) REFERENCES orders(order_id)
);
 
 
-- Example 4: Multi-Tenant SaaS Application
CREATE TABLE tenant_users (
    tenant_id       INT NOT NULL,
    user_id         INT NOT NULL,  -- Unique only within tenant
    email           VARCHAR(100) NOT NULL,
    role            VARCHAR(50) NOT NULL,
    
    CONSTRAINT pk_tenant_users 
        PRIMARY KEY (tenant_id, user_id),
    
    -- Email unique within tenant, not globally
    CONSTRAINT uq_tenant_user_email 
        UNIQUE (tenant_id, email)
);

Column Order Matters in Composite Keys

Composite Keys vs. Surrogate Keys Debate:

A long-standing debate in database design concerns whether to use natural composite keys or introduce a surrogate (synthetic) primary key alongside a unique constraint.

Approach	Composite Natural Key	Surrogate Key + Unique Constraint
Definition	`PRIMARY KEY (student_id, course_id)`	`PRIMARY KEY (enrollment_id)` + `UNIQUE (student_id, course_id)`
Foreign Key Referencing	Cascades multiple columns	Cascades single column
Join Complexity	Multi-column joins required	Single-column joins
Storage	No extra column	Extra column per row
ORM Compatibility	Often problematic	Generally simpler
Data Meaning	Key is meaningful	Key is opaque identifier

Auto-Generated Primary Keys

Key Generation Strategies:

Auto-Increment/Identity: Database generates sequential integers automatically
Sequence Objects: Explicit sequence generators (more control, cross-table usable)
UUID/GUID Generation: Application or database generates universally unique identifiers
Timestamp-Based IDs: Sortable IDs incorporating time (Snowflake, ULID)

mysql_auto_increment.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
-- MySQL: AUTO_INCREMENT
CREATE TABLE orders (
    order_id        INT AUTO_INCREMENT,
    customer_id     INT NOT NULL,
    order_date      DATETIME DEFAULT CURRENT_TIMESTAMP,
    total_amount    DECIMAL(12,2),
    
    PRIMARY KEY (order_id)
);
 
-- Inserting without specifying the auto-increment column
INSERT INTO orders (customer_id, total_amount) 
VALUES (101, 250.00);
-- order_id automatically assigned: 1
 
-- Check last generated value
SELECT LAST_INSERT_ID();
 
-- Resetting the counter (use with caution!)
ALTER TABLE orders AUTO_INCREMENT = 1000;
 
-- Getting current auto-increment value
SELECT AUTO_INCREMENT 
FROM information_schema.TABLES 
WHERE TABLE_SCHEMA = 'mydb' AND TABLE_NAME = 'orders';

Auto-Increment Gaps Are Normal

Primary Key Design Principles

•Immutability: Once assigned, a primary key value should never change. Updates cascade through foreign keys, invalidate caches, and break external references (bookmarks, URLs, API consumers).
•Minimality: Use the smallest data type that accommodates expected growth. INT (4 bytes, 2.1B values) often suffices; BIGINT (8 bytes) for massive scale. Avoid VARCHAR keys when numeric alternatives exist.
•Meaninglessness: Surrogate keys (arbitrary numbers) are generally preferred over natural keys (SSN, email). Natural keys can change, may not exist for all entities, and may be sensitive.
•Simplicity: Single-column keys simplify joins, foreign key definitions, and ORM mappings. Use composite keys only when the domain genuinely requires them.
•Sequentiality (for clustered indexes): Sequential keys (auto-increment) insert efficiently into B-tree indexes. Random keys (UUID v4) cause page splits and fragmentation.
•Invisibility to Users: Primary keys are internal identifiers. Expose them in URLs/APIs minimally, and never let users edit them. Consider separate 'public IDs' if human-readable references are needed.

Primary Key Anti-Patterns to Avoid
Anti-Pattern	Problem	Better Alternative
Email as PK	Emails change; case sensitivity; large string comparison	Surrogate INT + UNIQUE constraint on email
SSN/National ID as PK	Privacy concerns; not always available; can be corrected	Surrogate INT + encrypted storage of SSN
Composite key with 4+ columns	Complex joins; error-prone foreign keys	Surrogate INT + multi-column UNIQUE constraint
VARCHAR(255) as PK	Index bloat; slow comparisons; collation issues	Surrogate INT; VARCHAR only if truly necessary
FLOAT/DOUBLE as PK	Precision issues; comparison hazards	Never use floating-point as keys
Timestamps as sole PK	Clock skew; duplicates in same millisecond	Composite with sequence or use dedicated ID

The Pragmatic Default

Advanced Considerations

Clustered Index Implications:

In SQL Server and MySQL/InnoDB, the PRIMARY KEY defines the clustered index by default. This means:

Physical Row Order: Rows are stored on disk in primary key order. Range scans on the PK are I/O efficient.
Secondary Index Structure: Non-clustered indexes store the primary key as their row locator. Wide primary keys (e.g., 36-byte UUIDs) bloat all secondary indexes.
Insert Patterns: Sequential PKs insert at the end of the table (fast). Random PKs cause page splits and fragmentation (slow, requires maintenance).

Distributed System Considerations:

In distributed databases (sharded MySQL, Cassandra, CockroachDB), primary key design affects:

Data Distribution: Keys should distribute evenly across shards. Sequential integers concentrate writes on one shard (hotspot).
Cross-Shard Queries: Minimizing cross-shard operations depends on co-locating related data, often influenced by key design.
Global Uniqueness: Auto-increment doesn't work across multiple database instances without coordination. Distributed IDs (Snowflake, ULID, UUID) are necessary.

distributed_primary_keys.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
-- Pattern: Snowflake-style distributed ID
-- 64-bit ID: timestamp (41 bits) + node ID (10 bits) + sequence (12 bits)
-- Time-sortable, globally unique, no coordination required
 
-- PostgreSQL: Using UUID v7 (time-ordered UUID, proposed standard)
-- Requires PostgreSQL 17+ or extension
CREATE EXTENSION IF NOT EXISTS pg_uuidv7;
 
CREATE TABLE distributed_orders (
    order_id        UUID DEFAULT uuid_generate_v7() PRIMARY KEY,
    customer_id     INT NOT NULL,
    order_date      TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);
 
 
-- Alternative: ULID (Universally Unique Lexicographically Sortable Identifier)
-- 128-bit: 48-bit timestamp + 80-bit randomness
-- Encoded as 26-character base32 string, sorts chronologically
 
CREATE TABLE events (
    event_id        CHAR(26) PRIMARY KEY,  -- ULID generated by application
    event_type      VARCHAR(50) NOT NULL,
    payload         JSONB,
    created_at      TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);
 
 
-- Pattern: Composite sharding key
-- Ensures related data is co-located on same shard
CREATE TABLE tenant_transactions (
    tenant_id       INT NOT NULL,
    transaction_id  BIGINT NOT NULL,  -- Sequence per tenant
    amount          DECIMAL(15,2),
    transaction_date TIMESTAMP,
    
    -- Composite key with tenant first ensures tenant data locality
    PRIMARY KEY (tenant_id, transaction_id)
);

Key Design Is Context-Dependent

Summary: PRIMARY KEY Mastery

We've explored PRIMARY KEY constraints from first principles to advanced distributed considerations. Let's consolidate the essential knowledge:

Key Takeaways

•PRIMARY KEY = UNIQUE + NOT NULL: It uniquely identifies each row and prohibits nulls, ensuring every row is positively identifiable.
•One Per Table: A table can have only one primary key, though it may span multiple columns (composite key).
•Automatic Indexing: Primary keys create indexes automatically—typically clustered in SQL Server/InnoDB, affecting physical storage order.
•Immutability Principle: Primary keys should never change after assignment. Design for stability from the start.
•Surrogate vs Natural: Surrogate keys (auto-increment integers) are generally preferred for operational simplicity; use natural/composite keys when domain modeling demands it.
•Distributed Considerations: Sequential auto-increment fails in distributed systems; use UUIDs, ULIDs, or Snowflake IDs for global uniqueness.
•Always Name Constraints: Explicit constraint names (pk_tablename) improve maintainability and debugging.

What's Next:

Page Complete

1 / 5