Loading content...
In our exploration of superkeys, we discovered a paradox of abundance: a single relation can have hundreds or even thousands of superkeys. While mathematically interesting, this abundance creates a practical problem—which superkey should we use for identification?
Consider a relation with attributes {A, B, C, D, E} where {A} is a superkey. By the superset property, all of the following are also superkeys:
{A, B}, {A, C}, {A, D}, {A, E}{A, B, C}, {A, B, D}, {A, B, E}, {A, C, D}, {A, C, E}, {A, D, E}This explosion raises critical questions:
{A, B, C, D, E} to identify a tuple when {A} alone suffices?The candidate key concept answers these questions by identifying the minimal superkeys—those with no redundant attributes.
By the end of this page, you will understand the formal definition of candidate keys and their relationship to superkeys, master algorithms for finding candidate keys from functional dependencies, appreciate why relations may have multiple candidate keys, and develop intuition for identifying candidate keys in practical database design scenarios.
Before defining candidate keys formally, we must understand what 'minimal' means in this context. Minimality isn't about the number of attributes—it's about irreducibility.
A set is minimal with respect to a property if removing any element causes the property to be lost. For superkeys, this means:
A superkey K is minimal if no proper subset of K is also a superkey. Equivalently, removing any single attribute from K destroys its ability to uniquely identify tuples.
Example: Examining Minimality
Consider Student(StudentID, Email, Name, Major) where both {StudentID} and {Email} uniquely identify tuples:
Is {StudentID} minimal?
Is {StudentID, Email} minimal?
Is {Name, Major} minimal? (Assuming it's a superkey)
Key insight: Minimality is about removing single attributes one at a time, not about having the smallest possible count.
With the minimality concept clarified, we can precisely define candidate keys:
A candidate key of a relation R is a minimal superkey of R. That is, K is a candidate key if: (1) K is a superkey — K → R (K functionally determines all attributes of R), and (2) K is minimal — for no proper subset K' ⊂ K is K' also a superkey.
Mathematical formalization:
Let R be a relation schema with attribute set U. Let F be the set of functional dependencies on R.
A set K ⊆ U is a candidate key of R if:
K⁺ = U (the closure of K under F equals the complete attribute set)∀A ∈ K: (K - {A})⁺ ≠ U (removing any single attribute causes the closure to shrink below U)Equivalently, using functional dependencies:
K is a candidate key if:
K → U (K determines everything)∀A ∈ K: K - {A} ↛ U (K minus any attribute doesn't determine everything)The two conditions are essential:
| Condition | What it ensures | Consequence if violated |
|---|---|---|
| Superkey property (K → R) | K can identify tuples | K is useless for identification |
| Minimality (no proper subset works) | K has no redundancy | K wastes storage and complicates constraints |
The superkey property ensures correctness—K actually works for unique identification. The minimality property ensures efficiency—K uses no more attributes than necessary. Together, they define the 'tightest' possible unique identifier.
Finding all candidate keys of a relation given its functional dependencies is a fundamental database task. We present a systematic algorithm and then work through detailed examples.
Step 1: Identify attributes that appear only on the LEFT side of FDs (must be in every key). Step 2: Identify attributes that appear only on the RIGHT side of FDs (never in any key). Step 3: Identify attributes on both sides (may or may not be in keys). Step 4: Start with left-only attributes, compute closure. If complete, this is a candidate key. Step 5: If incomplete, systematically add combinations of 'both sides' attributes until closure is complete. Step 6: For each candidate found, verify minimality.
Detailed Example:
Consider R(A, B, C, D, E) with functional dependencies:
F = {AB → C, C → D, D → E, B → E}Step 1: Classify attributes
| Attribute | Appears on Left | Appears on Right | Classification |
|---|---|---|---|
| A | Yes (AB→C) | No | Left-only |
| B | Yes (AB→C, B→E) | No | Left-only |
| C | Yes (C→D) | Yes (from AB→C) | Both |
| D | Yes (D→E) | Yes (from C→D) | Both |
| E | No | Yes (from D→E, B→E) | Right-only |
Step 2: Start with left-only attributes
Left-only: {A, B} — these MUST be in every candidate key.
Step 3: Compute closure of {A, B}
{A, B}⁺ = {A, B}AB → C: {A, B}⁺ = {A, B, C}C → D: {A, B}⁺ = {A, B, C, D}D → E: {A, B}⁺ = {A, B, C, D, E}B → E: Already have EClosure = {A, B, C, D, E} = R ✓
Step 4: Verify minimality
{A}⁺ = {A} ≠ R — A alone doesn't work{B}⁺ = {B, E} ≠ R — B alone doesn't workBoth attributes are necessary, so {A, B} is minimal.
Conclusion: The only candidate key is {A, B}
Attributes that appear only on the right side of functional dependencies can NEVER be part of any candidate key. They are always determined by other attributes and never determine anything themselves. In the example, E is right-only and cannot appear in any candidate key.
A critical principle in relational theory is that relations can have multiple candidate keys. Each represents a valid, minimal way to uniquely identify tuples. The database designer must choose one as the primary key.
Example: Multiple Candidate Keys
Consider Employee(EmployeeID, SSN, Email, Name, Department)
Functional dependencies (from real-world semantics):
EmployeeID → SSN, Email, Name, DepartmentSSN → EmployeeID, Email, Name, DepartmentEmail → EmployeeID, SSN, Name, DepartmentLet's verify each single-attribute candidate:
| Attribute | Closure | Is Superkey? | Is Minimal? | Candidate Key? |
|---|---|---|---|---|
| {EmployeeID} | {All attributes} | Yes | Yes (single attribute) | Yes ✓ |
| {SSN} | {All attributes} | Yes | Yes (single attribute) | Yes ✓ |
| {Email} | {All attributes} | Yes | Yes (single attribute) | Yes ✓ |
| {Name} | {Name} | No | N/A | No |
| {Department} | {Department} | No | N/A | No |
This relation has three candidate keys: {EmployeeID}, {SSN}, and {Email}.
Each is:
Why multiple candidate keys exist:
Real-world identity has multiple facets: An employee is uniquely identified by their company ID, government SSN, and work email—each from a different domain of identification.
Historical accumulation: Systems often acquire multiple identification mechanisms over time as requirements evolve.
Integration requirements: When systems merge, each may bring its own identification scheme.
Design implications:
Superkey ⊇ Candidate Key ⊇ Primary Key. Every primary key is a candidate key. Every candidate key is a superkey. But the reverse is not true—most superkeys are not candidate keys (they're not minimal), and most candidate keys are not the primary key (only one is chosen).
While single-attribute candidate keys are common, many relations require composite candidate keys—keys consisting of multiple attributes. Understanding composite keys is essential for proper database design.
A composite key (or compound key) is a candidate key consisting of two or more attributes. The individual attributes are called key attributes or prime attributes. None of them alone is sufficient for unique identification—they must work together.
Example: Course Enrollment
Consider Enrollment(StudentID, CourseID, Semester, Grade, EnrollDate)
Business rules:
Functional dependencies:
{StudentID, CourseID, Semester} → Grade, EnrollDateAnalysis:
| Attribute Combination | Uniquely Identifies? | Why? |
|---|---|---|
| {StudentID} | No | Same student in many courses |
| {CourseID} | No | Many students per course |
| {Semester} | No | Many enrollments per semester |
| {StudentID, CourseID} | No | Same student, same course in different semesters |
| {StudentID, Semester} | No | Student takes multiple courses per semester |
| {CourseID, Semester} | No | Many students per course per semester |
| {StudentID, CourseID, Semester} | Yes | At most one enrollment per combination |
The candidate key is {StudentID, CourseID, Semester} — a 3-attribute composite key.
Minimality check:
All three attributes are necessary — the key is minimal.
The distinction between prime and non-prime attributes is fundamental to normalization theory and database design. These categories determine how functional dependencies should be structured.
An attribute is prime (or key attribute) if it is a member of ANY candidate key of the relation. Prime attributes participate in tuple identification.
An attribute is non-prime (or non-key attribute) if it is NOT a member of any candidate key. Non-prime attributes are determined by keys but don't participate in identification.
Example: Identifying Prime and Non-Prime Attributes
Consider R(A, B, C, D, E) with candidate keys {A, B} and {A, C}:
| Attribute | In Candidate Key {A,B}? | In Candidate Key {A,C}? | Classification |
|---|---|---|---|
| A | Yes | Yes | Prime |
| B | Yes | No | Prime |
| C | No | Yes | Prime |
| D | No | No | Non-Prime |
| E | No | No | Non-Prime |
Notice:
Why this classification matters:
Second Normal Form (2NF): Requires no partial dependencies on candidate keys. If a non-prime attribute depends on only part of a composite candidate key, 2NF is violated.
Third Normal Form (3NF): Requires no transitive dependencies through non-prime attributes. Non-prime → non-prime dependencies create update anomalies.
Boyce-Codd Normal Form (BCNF): Requires every determinant to be a superkey. The distinction helps identify violations.
Index design: Prime attributes often need indexes for efficient joins and lookups.
An attribute doesn't need to be in EVERY candidate key to be prime—membership in ANY candidate key suffices. In the example above, B is prime even though it's not in candidate key {A, C}.
For complex schemas with many functional dependencies, we need a more sophisticated approach to find all candidate keys. Here's a comprehensive algorithm:
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768
def find_all_candidate_keys(attributes, fds): """ Find all candidate keys of a relation. Args: attributes: Set of all attributes in the relation fds: List of functional dependencies as (lhs, rhs) tuples Returns: List of candidate keys (each key is a frozenset) """ def closure(attr_set, fds): """Compute attribute closure under FDs""" result = set(attr_set) changed = True while changed: changed = False for lhs, rhs in fds: if lhs.issubset(result) and not rhs.issubset(result): result.update(rhs) changed = True return result def is_superkey(attr_set, attributes, fds): """Check if attr_set is a superkey""" return closure(attr_set, fds) == attributes def is_minimal(attr_set, attributes, fds): """Check if superkey is minimal (no proper subset is superkey)""" for attr in attr_set: subset = attr_set - {attr} if is_superkey(subset, attributes, fds): return False return True # Classify attributes left_only = set() right_only = set() both_sides = set() for lhs, rhs in fds: left_only.update(lhs) right_only.update(rhs) neither = attributes - left_only - right_only both_sides = left_only & right_only left_only = left_only - both_sides right_only = right_only - both_sides # Start with left-only and neither (must be in all keys) must_have = left_only | neither # Candidates to try: add subsets of 'both_sides' from itertools import combinations candidate_keys = [] maybe_attrs = list(both_sides) for size in range(len(maybe_attrs) + 1): for combo in combinations(maybe_attrs, size): candidate = must_have | set(combo) if is_superkey(candidate, attributes, fds): if is_minimal(candidate, attributes, fds): # Check not superset of existing key if not any(ck < candidate for ck in candidate_keys): candidate_keys.append(frozenset(candidate)) return candidate_keysAlgorithm walkthrough with complex example:
Consider R(A, B, C, D, E, F) with FDs:
AB → CC → DD → ECF → BE → AStep 1: Classify attributes
{F} (only on left in CF→B)Step 2: Try combinations with F
{F}⁺ = {F} — not a superkey
{A, F}⁺ = {A, F} — not a superkey
{B, F}⁺ = {B, F} — not a superkey
{C, F}⁺ = {C, F, D, E, A, B} = all! — superkey! ✓
{C}⁺ = {C, D, E, A} ≠ R, {F}⁺ = {F} ≠ R{C, F} is a candidate key ✓{A, B, F}⁺ = {A, B, F, C, D, E} = all — superkey
{A, B}⁺ = {A, B, C, D, E} ≠ R (no F){A, F}⁺ = {A, F} ≠ R{B, F}⁺ = {B, F} ≠ R{A, B, F} is a candidate key ✓Candidate keys: {C, F} and {A, B, F}
We've explored candidate keys in depth—the minimal superkeys that form the foundation for primary key selection. Let's consolidate the key insights:
What's next:
Now that we understand candidate keys, we'll explore primary keys—the specific candidate key chosen to serve as the main identification mechanism for a relation. We'll examine selection criteria, implementation considerations, and the critical properties that primary keys must maintain.
You now understand candidate keys as minimal superkeys. You can formally define and identify candidate keys, apply algorithms to find all candidate keys from functional dependencies, recognize composite keys and their applications, and classify attributes as prime or non-prime. Next, we'll explore primary key selection and its implications for database design.