Database Management SystemsFunctional Dependencies

Functional Dependency Concept

LevelIntermediate

Duration60 mins

TopicFunctional Dependencies

4 / 5

FD from Data vs Schema

The Origin of Functional Dependencies

Where do functional dependencies come from? This seemingly simple question reveals a fundamental divide in database design philosophy—and getting the answer wrong leads to brittle, error-prone schemas.

Can we examine existing data and discover functional dependencies? Or must we understand the real-world domain and declare them? The difference between these two approaches is the difference between designing for the data you have and designing for all valid data your system might ever hold.

This page explores both approaches, their strengths and limitations, and how professional database designers combine them to create robust schemas.

What You Will Learn

By the end of this page, you will understand the distinction between syntactic FDs (from data) and semantic FDs (from schema/domain), recognize why observed data can suggest but never prove FDs, and learn how to properly analyze domain requirements to establish true functional dependencies.

Two Sources of Functional Dependencies

Functional dependencies can be identified through two fundamentally different approaches:

From Data (Syntactic/Empirical)

•Examine existing data instances
•Look for patterns: same X → same Y
•Use algorithms to detect dependencies
•Also called: data-driven, bottom-up, discovered FDs
•Risk: patterns might be coincidental
•Useful for legacy system analysis

From Schema (Semantic/Domain)

•Understand real-world domain rules
•Derive FDs from business requirements
•Consult domain experts and documentation
•Also called: design-driven, top-down, declared FDs
•Source of truth for constraints
•Required for correct schema design

The Fundamental Asymmetry:

These two approaches have a crucial asymmetric relationship:

Data can SUGGEST but never PROVE an FD
- Even if current data shows SSN → Name perfectly, future data might violate it
- We might not have observed all possible scenarios yet
Schema can DECLARE and ENFORCE an FD
- If we know SSN must uniquely identify a person (by law/policy), we declare SSN → Name
- The DBMS enforces it; violations are rejected

Think of it like this: observing 1000 white swans doesn't prove "all swans are white." But if we define swan to require whiteness, then non-white swans are excluded by definition.

The Data-Only Trap

A common mistake is designing schemas purely from observed data patterns. 'The data shows one manager per department, so DeptID → ManagerID.' But what if the business allows co-managers? The data happened to show one-to-one because that's all that was entered so far. Design from RULES, not observations.

FDs from Data: Empirical Discovery

Sometimes we must work with existing data and attempt to discover FDs. This is common when reverse-engineering legacy systems or analyzing datasets without documentation. Let's understand how this works and its limitations.

The Discovery Process:

Examine the data for patterns where same X values always have same Y values
Candidate FDs emerge from consistent patterns
Validation checks if any counter-examples exist
Verification with domain experts confirms whether the pattern is a true constraint

Algorithms for FD Discovery:

Data mining algorithms (TANE, FUN, FastFDs) can automatically discover FDs from datasets. They systematically check attribute combinations to find deterministic relationships.

fd_discovery_from_data.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
-- Example: Discovering FDs from Data
 
-- Given this SALES table with sample data:
-- TransID  StoreID  StoreName     Region    Product   Quantity
-- T001     S001     Downtown      North     Widget    5
-- T002     S001     Downtown      North     Gadget    3
-- T003     S002     Uptown        South     Widget    2
-- T004     S002     Uptown        South     Gadget    7
-- T005     S003     Mall          North     Widget    4
 
-- Observation 1: Same StoreID always has same StoreName
-- S001 → Downtown (twice)
-- S002 → Uptown (twice)
-- S003 → Mall (once)
-- CANDIDATE FD: StoreID → StoreName
 
-- Observation 2: Same StoreID always has same Region
-- S001 → North (twice)
-- S002 → South (twice)
-- S003 → North (once)
-- CANDIDATE FD: StoreID → Region
 
-- Observation 3: Same Region does NOT always have same StoreName
-- North → Downtown AND North → Mall
-- NOT A VALID FD: Region → StoreName
 
-- SQL to check for FD violations:
-- Check if StoreID → StoreName holds
SELECT StoreID, COUNT(DISTINCT StoreName) as name_count
FROM SALES
GROUP BY StoreID
HAVING COUNT(DISTINCT StoreName) > 1;
 
-- If zero rows returned, the FD MIGHT hold
-- If rows returned, the FD definitely does NOT hold
 
-- WARNING: Zero violations in current data does NOT prove 
-- the FD is a true constraint! It only shows no counter-examples YET.

Limitations of Data-Driven Discovery

Data-driven FD discovery has fundamental limits: • Small samples may miss violations • Data entry errors may hide true constraints • Business rules may differ from practice • Future data may violate current patterns

Use data discovery for HYPOTHESIS GENERATION, not final design decisions.

FDs from Schema: Semantic Derivation

The proper source of functional dependencies is semantic understanding of the domain—the real-world rules and constraints that govern the data.

Sources of Semantic FDs:

Business Rules
- "Each employee has exactly one department"
- Translates to: EmpID → DeptID
Legal/Regulatory Requirements
- "Social Security Numbers uniquely identify individuals"
- Translates to: SSN → PersonID (and other attributes)
Physical Reality
- "A book has one ISBN"
- Translates to: ISBN → BookTitle, Author, etc.
Domain Expert Knowledge
- "A flight on a date has one assigned gate"
- Translates to: {FlightNo, Date} → Gate
System Design Decisions
- "We will only store one phone number per customer"
- Translates to: CustomerID → PhoneNumber (by design choice)

Semantic vs. Data-Observed FDs: Examples
Scenario	Data Shows	Semantic Truth	Correct Approach
Employee-Manager	Each dept shows one manager	Departments may have co-managers	DeptID does NOT → ManagerID
Course-Instructor	Each course has one instructor	Courses can be team-taught	{CourseID, Semester} → Instructors (set)
Customer-Address	Each customer has one address	Business requires one address	CustomerID → Address (enforced)
Product-Price	Same product, same price	Prices can vary by region	{ProductID, Region} → Price

semantic_fd_derivation.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
-- Semantic FD Derivation Process
 
-- STEP 1: Gather Business Requirements
-- Interview: "Tell me about employees and departments"
-- Answer: "Each employee works in exactly one department at any time.
--          An employee can transfer, but is only in one dept at a moment."
 
-- STEP 2: Derive FD from Requirement
-- Requirement implies: EmpID → DeptID (at any point in time)
-- If tracking history: {EmpID, Date} → DeptID
 
-- STEP 3: Validate FD is Correct
-- Q: "Can an employee be in two departments simultaneously?"
-- A: "No, never. Even dotted-line reporting, they have a 'home' dept."
-- FD Confirmed: EmpID → DeptID
 
-- STEP 4: Consider Future Requirements
-- Q: "Might this change? Part-time in multiple depts?"
-- A: "No plans for that. If it changes, we'd redesign."
-- FD Remains Valid: EmpID → DeptID
 
-- STEP 5: Implement in Schema
CREATE TABLE EMPLOYEE (
    EmpID INT PRIMARY KEY,
    Name VARCHAR(100),
    DeptID INT NOT NULL,  -- FD enforced: EmpID → DeptID
    FOREIGN KEY (DeptID) REFERENCES DEPARTMENT(DeptID)
);
 
-- The PRIMARY KEY constraint enforces: EmpID determines everything
-- The NOT NULL + FK enforces: every employee has exactly one dept

The Requirements Interview

Professional database designers spend significant time interviewing stakeholders. The questions 'Can X have multiple Y?' and 'Must X always have the same Y?' directly determine FDs. Never assume—always ask.

The Critical Difference: Constraint vs. Coincidence

The distinction between semantic and data-derived FDs reduces to one question: Is this a constraint (must hold) or a coincidence (happens to hold)?

Constraint vs. Coincidence Analysis
Observation	Constraint?	Coincidence?	How to Determine
SSN → Name	Yes - legally mandated uniqueness	No	SSNs are assigned uniquely by law
Name → BirthDate	No	Yes - appears in small sample	Many people share names
ZipCode → City	Usually yes	Edge cases exist	Some ZIP codes span cities
EmpID → Salary	Depends on design	Might be historical only	Ask: 'Can an employee have multiple salaries?'
OrderID → CustomerID	Yes - by design	No	Orders are placed by customers - fundamental

The Danger of Coincidental FDs:

Building a schema on coincidental FDs creates time bombs:

Works until it doesn't - The system operates fine until someone enters data that violates the assumed (but not enforced) FD
Silent corruption - Without enforcement, violations go unnoticed until reports produce wrong results
Migration failures - Moving to a new system, the "assumed" FD isn't enforced, and data quality degrades

Example Disaster:

Assumption: CourseID → Instructor (each course has one instructor)
Reality: Courses can be team-taught
Data entered: CS101 → Dr. Smith in one record, CS101 → Dr. Jones in another
Result: Reports show wrong instructor counts, schedules conflict, students confused

The error wasn't in the data entry—it was in assuming an FD that wasn't truly a constraint.

Design for Exceptions

Real-world data is messier than initial samples suggest. Design for the exceptions: 'What if two instructors teach together?' 'What if an employee has two roles?' 'What if a product is sold in multiple regions at different prices?' If exceptions are possible, don't assert the FD.

Combining Data Analysis and Semantic Understanding

In practice, professional database designers use BOTH approaches—data analysis suggests hypotheses, semantic understanding confirms or rejects them.

The Hybrid Methodology:

Phase 1: Data Exploration

Run FD discovery algorithms on existing data
Generate candidate FDs from observed patterns
Flag attributes with potential dependencies

Phase 2: Semantic Validation

For each candidate FD, consult domain experts
Ask: "Does X ALWAYS determine Y, or just in current data?"
Ask: "Are there scenarios where X could have multiple Y values?"

Phase 3: Design Decision

Confirmed constraints become enforced FDs in the schema
Coincidental patterns are documented but NOT enforced
Ambiguous cases are designed flexibly (allow multiple values)

Phase 4: Implementation

Enforced FDs become primary keys, unique constraints, or triggers
Schema structure reflects true semantic constraints

hybrid_fd_analysis.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
-- Hybrid FD Analysis Example
 
-- PHASE 1: Data Exploration
-- Analyze an existing ORDERS table
 
-- Find potential determinants for CustomerName
SELECT 
    'CustomerID' as potential_determinant,
    COUNT(DISTINCT CustomerName) as distinct_values,
    COUNT(*) as total_rows
FROM ORDERS
GROUP BY CustomerID
HAVING COUNT(DISTINCT CustomerName) > 1;
 
-- Result: Zero rows (no violations found)
-- Hypothesis: CustomerID → CustomerName
 
-- PHASE 2: Semantic Validation
-- Question to business: "Does each customer ID always have the same name?"
-- Answer: "Yes - CustomerID is our unique identifier for customers"
-- Confirmed: CustomerID → CustomerName is a true constraint
 
-- Another analysis
SELECT 
    'OrderDate' as potential_determinant,
    COUNT(DISTINCT Warehouse) as distinct_values
FROM ORDERS
GROUP BY OrderDate
HAVING COUNT(DISTINCT Warehouse) > 1;
 
-- Result: Multiple rows (violations exist)
-- Observation: Same order date, multiple warehouses
-- No FD: OrderDate does NOT → Warehouse
 
-- But wait - check per-order
SELECT OrderID, COUNT(DISTINCT Warehouse) 
FROM ORDERS
GROUP BY OrderID
HAVING COUNT(DISTINCT Warehouse) > 1;
 
-- Result: Zero rows
-- Hypothesis: OrderID → Warehouse
-- Question to business: "Is each order fulfilled from one warehouse?"
-- Answer: "Usually, but split shipments happen for large orders"
-- REJECTED: OrderID does NOT → Warehouse (exceptions exist)
 
-- PHASE 3: Design Decision
-- CustomerID → CustomerName: ENFORCE
-- OrderID → Warehouse: DO NOT ENFORCE (allow multiples per order)
 
-- PHASE 4: Implementation
CREATE TABLE CUSTOMER (
    CustomerID INT PRIMARY KEY,  -- Enforces: CustomerID is determinant
    CustomerName VARCHAR(100) NOT NULL
);
 
CREATE TABLE ORDER_SHIPMENT (
    OrderID INT,
    ShipmentID INT,
    Warehouse VARCHAR(50),
    PRIMARY KEY (OrderID, ShipmentID),  -- Allows multiple per order
    FOREIGN KEY (OrderID) REFERENCES ORDERS(OrderID)
);

Document Your Analysis

Always document WHY you concluded an FD exists or doesn't exist. 'CustomerID → CustomerName because CustomerID is the primary identifier per business policy' is valuable for future maintainers. It distinguishes constraints from coincidences.

FD Discovery Algorithms (Overview)

For completeness, let's briefly survey automated FD discovery—useful for legacy system analysis and data profiling.

Notable FD Discovery Algorithms
Algorithm	Year	Approach	Characteristics
TANE	1999	Level-wise, partition-based	Pioneering algorithm, handles large schemas
FUN	2001	Bottom-up, attribute-pair focused	Efficient for sparse dependencies
FastFDs	2002	Difference-set based	Fast for medium datasets
DFD	2014	Sampling + validation	Handles very large datasets
HyFD	2017	Hybrid (sampling + validation)	State-of-the-art performance

How These Algorithms Work (Conceptually):

Partition-Based (TANE, FUN)
- Create partitions of rows based on attribute values
- If partition on X refines perfectly to partition on Y, then X → Y
- Explore candidate FDs level by level (single attrs, then pairs, etc.)
Difference-Set Based (FastFDs)
- For each pair of rows, compute which attributes differ
- Use set algebra to prune impossible FDs quickly
Hybrid/Sampling (HyFD, DFD)
- Sample data to generate candidate FDs
- Validate candidates against full dataset
- Prune search space dynamically

Complexity Challenge:

For n attributes, there are potentially O(2^n) possible left-hand sides and O(2^n) possible right-hand sides. Smart algorithms use pruning rules (e.g., if A → B, then AC → B without checking) to avoid exponential blowup.

When to Use FD Discovery Tools

Use FD discovery tools when: • Reverse-engineering undocumented legacy systems • Profiling data quality before migration • Generating hypotheses for domain expert validation • Checking if declared FDs actually hold in production data

Never use them as the SOLE source of schema FDs.

Practical Guidelines for FD Identification

Based on everything we've covered, here are actionable guidelines for identifying FDs in real projects.

FD Identification Guidelines

•Start with semantics — Begin by understanding business rules, not by analyzing data. Ask stakeholders: 'What must always be true about this data?'
•Document sources — For each FD, note whether it came from business rule, regulation, physical reality, or design decision. This helps future maintenance.
•Question coincidences — When data shows a pattern, ask: 'Is this enforced or accidental?' 'What would change if the pattern broke?'
•Consider time dimensions — Does the FD hold over time? EmpID → DeptID might hold NOW but not historically. Consider if you need temporal versioning.
•Beware of sample bias — Small datasets may not show exceptions. 'We've never had co-instructors' doesn't mean the FD holds—maybe it just hasn't happened yet.
•Use data to verify, not define — After declaring semantic FDs, check if current data complies. Violations indicate either data errors or incorrect assumptions.
•Plan for exceptions — If an exception might ever occur, don't enforce the FD. Flexible schemas handle change better than rigid ones.
•Document exceptions explicitly — If you know 'usually X → Y but sometimes not,' document the exception cases for future reference.

The 'What If' Test

For any proposed FD X → Y, ask: 'What if tomorrow we need X to have multiple Y values?' If the answer is 'that's impossible by definition' (e.g., ISBN → title), enforce the FD. If the answer is 'that's inconceivable but technically possible,' don't enforce—design flexibly.

Summary: Sources of Functional Dependencies

We've explored the crucial distinction between data-driven and semantically-driven FD identification. Let's consolidate the key insights:

Key Takeaways

•Two Sources — FDs can be discovered from data (empirical) or derived from domain semantics (declarative). The sources are fundamentally different.
•Data Suggests, Semantics Confirms — Data patterns suggest hypotheses; domain understanding confirms or refutes them. Never rely solely on data patterns.
•Constraint vs. Coincidence — The key question is whether a pattern MUST hold (constraint) or just HAPPENS to hold (coincidence). Only constraints become schema FDs.
•Semantic FDs Are Primary — Business rules, regulations, and domain expert knowledge are the authoritative sources for FDs in schema design.
•Hybrid Approach Works Best — Use data analysis to generate hypotheses, then validate semantically. Document the reasoning for future maintainers.
•Design for Exceptions — If exceptions might ever occur, design flexibility into the schema rather than enforcing a potentially-violable FD.
•FD Discovery Tools — Algorithms like TANE and HyFD automate pattern discovery but cannot replace semantic analysis.

What's Next:

With FD sources clarified, we'll now explore the Semantic Meaning of functional dependencies—what FDs truly represent about the real world, how they encode business rules, and how to reason about their implications for data integrity. This completes our foundational understanding of FD concepts.

Page Complete

You now understand the critical distinction between FDs derived from data observation versus semantic domain knowledge. This understanding protects you from building schemas on coincidental patterns—a common cause of database design failures.

4 / 5

Loading learning content...

Database Management SystemsFunctional Dependencies

Functional Dependency Concept

LevelIntermediate

Duration60 mins

TopicFunctional Dependencies

4 / 5

FD from Data vs Schema

The Origin of Functional Dependencies

This page explores both approaches, their strengths and limitations, and how professional database designers combine them to create robust schemas.

What You Will Learn

Two Sources of Functional Dependencies

Functional dependencies can be identified through two fundamentally different approaches:

From Data (Syntactic/Empirical)

•Examine existing data instances
•Look for patterns: same X → same Y
•Use algorithms to detect dependencies
•Also called: data-driven, bottom-up, discovered FDs
•Risk: patterns might be coincidental
•Useful for legacy system analysis

From Schema (Semantic/Domain)

•Understand real-world domain rules
•Derive FDs from business requirements
•Consult domain experts and documentation
•Also called: design-driven, top-down, declared FDs
•Source of truth for constraints
•Required for correct schema design

The Fundamental Asymmetry:

These two approaches have a crucial asymmetric relationship:

Data can SUGGEST but never PROVE an FD
- Even if current data shows SSN → Name perfectly, future data might violate it
- We might not have observed all possible scenarios yet
Schema can DECLARE and ENFORCE an FD
- If we know SSN must uniquely identify a person (by law/policy), we declare SSN → Name
- The DBMS enforces it; violations are rejected

Think of it like this: observing 1000 white swans doesn't prove "all swans are white." But if we define swan to require whiteness, then non-white swans are excluded by definition.

The Data-Only Trap

FDs from Data: Empirical Discovery

The Discovery Process:

Examine the data for patterns where same X values always have same Y values
Candidate FDs emerge from consistent patterns
Validation checks if any counter-examples exist
Verification with domain experts confirms whether the pattern is a true constraint

Algorithms for FD Discovery:

Data mining algorithms (TANE, FUN, FastFDs) can automatically discover FDs from datasets. They systematically check attribute combinations to find deterministic relationships.

fd_discovery_from_data.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
-- Example: Discovering FDs from Data
 
-- Given this SALES table with sample data:
-- TransID  StoreID  StoreName     Region    Product   Quantity
-- T001     S001     Downtown      North     Widget    5
-- T002     S001     Downtown      North     Gadget    3
-- T003     S002     Uptown        South     Widget    2
-- T004     S002     Uptown        South     Gadget    7
-- T005     S003     Mall          North     Widget    4
 
-- Observation 1: Same StoreID always has same StoreName
-- S001 → Downtown (twice)
-- S002 → Uptown (twice)
-- S003 → Mall (once)
-- CANDIDATE FD: StoreID → StoreName
 
-- Observation 2: Same StoreID always has same Region
-- S001 → North (twice)
-- S002 → South (twice)
-- S003 → North (once)
-- CANDIDATE FD: StoreID → Region
 
-- Observation 3: Same Region does NOT always have same StoreName
-- North → Downtown AND North → Mall
-- NOT A VALID FD: Region → StoreName
 
-- SQL to check for FD violations:
-- Check if StoreID → StoreName holds
SELECT StoreID, COUNT(DISTINCT StoreName) as name_count
FROM SALES
GROUP BY StoreID
HAVING COUNT(DISTINCT StoreName) > 1;
 
-- If zero rows returned, the FD MIGHT hold
-- If rows returned, the FD definitely does NOT hold
 
-- WARNING: Zero violations in current data does NOT prove 
-- the FD is a true constraint! It only shows no counter-examples YET.

Limitations of Data-Driven Discovery

Use data discovery for HYPOTHESIS GENERATION, not final design decisions.

FDs from Schema: Semantic Derivation

The proper source of functional dependencies is semantic understanding of the domain—the real-world rules and constraints that govern the data.

Sources of Semantic FDs:

Business Rules
- "Each employee has exactly one department"
- Translates to: EmpID → DeptID
Legal/Regulatory Requirements
- "Social Security Numbers uniquely identify individuals"
- Translates to: SSN → PersonID (and other attributes)
Physical Reality
- "A book has one ISBN"
- Translates to: ISBN → BookTitle, Author, etc.
Domain Expert Knowledge
- "A flight on a date has one assigned gate"
- Translates to: {FlightNo, Date} → Gate
System Design Decisions
- "We will only store one phone number per customer"
- Translates to: CustomerID → PhoneNumber (by design choice)

Semantic vs. Data-Observed FDs: Examples
Scenario	Data Shows	Semantic Truth	Correct Approach
Employee-Manager	Each dept shows one manager	Departments may have co-managers	DeptID does NOT → ManagerID
Course-Instructor	Each course has one instructor	Courses can be team-taught	{CourseID, Semester} → Instructors (set)
Customer-Address	Each customer has one address	Business requires one address	CustomerID → Address (enforced)
Product-Price	Same product, same price	Prices can vary by region	{ProductID, Region} → Price

semantic_fd_derivation.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
-- Semantic FD Derivation Process
 
-- STEP 1: Gather Business Requirements
-- Interview: "Tell me about employees and departments"
-- Answer: "Each employee works in exactly one department at any time.
--          An employee can transfer, but is only in one dept at a moment."
 
-- STEP 2: Derive FD from Requirement
-- Requirement implies: EmpID → DeptID (at any point in time)
-- If tracking history: {EmpID, Date} → DeptID
 
-- STEP 3: Validate FD is Correct
-- Q: "Can an employee be in two departments simultaneously?"
-- A: "No, never. Even dotted-line reporting, they have a 'home' dept."
-- FD Confirmed: EmpID → DeptID
 
-- STEP 4: Consider Future Requirements
-- Q: "Might this change? Part-time in multiple depts?"
-- A: "No plans for that. If it changes, we'd redesign."
-- FD Remains Valid: EmpID → DeptID
 
-- STEP 5: Implement in Schema
CREATE TABLE EMPLOYEE (
    EmpID INT PRIMARY KEY,
    Name VARCHAR(100),
    DeptID INT NOT NULL,  -- FD enforced: EmpID → DeptID
    FOREIGN KEY (DeptID) REFERENCES DEPARTMENT(DeptID)
);
 
-- The PRIMARY KEY constraint enforces: EmpID determines everything
-- The NOT NULL + FK enforces: every employee has exactly one dept

The Requirements Interview

The Critical Difference: Constraint vs. Coincidence

The distinction between semantic and data-derived FDs reduces to one question: Is this a constraint (must hold) or a coincidence (happens to hold)?

Constraint vs. Coincidence Analysis
Observation	Constraint?	Coincidence?	How to Determine
SSN → Name	Yes - legally mandated uniqueness	No	SSNs are assigned uniquely by law
Name → BirthDate	No	Yes - appears in small sample	Many people share names
ZipCode → City	Usually yes	Edge cases exist	Some ZIP codes span cities
EmpID → Salary	Depends on design	Might be historical only	Ask: 'Can an employee have multiple salaries?'
OrderID → CustomerID	Yes - by design	No	Orders are placed by customers - fundamental

The Danger of Coincidental FDs:

Building a schema on coincidental FDs creates time bombs:

Works until it doesn't - The system operates fine until someone enters data that violates the assumed (but not enforced) FD
Silent corruption - Without enforcement, violations go unnoticed until reports produce wrong results
Migration failures - Moving to a new system, the "assumed" FD isn't enforced, and data quality degrades

Example Disaster:

Assumption: CourseID → Instructor (each course has one instructor)
Reality: Courses can be team-taught
Data entered: CS101 → Dr. Smith in one record, CS101 → Dr. Jones in another
Result: Reports show wrong instructor counts, schedules conflict, students confused

The error wasn't in the data entry—it was in assuming an FD that wasn't truly a constraint.

Design for Exceptions

Combining Data Analysis and Semantic Understanding

In practice, professional database designers use BOTH approaches—data analysis suggests hypotheses, semantic understanding confirms or rejects them.

The Hybrid Methodology:

Phase 1: Data Exploration

Run FD discovery algorithms on existing data
Generate candidate FDs from observed patterns
Flag attributes with potential dependencies

Phase 2: Semantic Validation

For each candidate FD, consult domain experts
Ask: "Does X ALWAYS determine Y, or just in current data?"
Ask: "Are there scenarios where X could have multiple Y values?"

Phase 3: Design Decision

Confirmed constraints become enforced FDs in the schema
Coincidental patterns are documented but NOT enforced
Ambiguous cases are designed flexibly (allow multiple values)

Phase 4: Implementation

Enforced FDs become primary keys, unique constraints, or triggers
Schema structure reflects true semantic constraints

hybrid_fd_analysis.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
-- Hybrid FD Analysis Example
 
-- PHASE 1: Data Exploration
-- Analyze an existing ORDERS table
 
-- Find potential determinants for CustomerName
SELECT 
    'CustomerID' as potential_determinant,
    COUNT(DISTINCT CustomerName) as distinct_values,
    COUNT(*) as total_rows
FROM ORDERS
GROUP BY CustomerID
HAVING COUNT(DISTINCT CustomerName) > 1;
 
-- Result: Zero rows (no violations found)
-- Hypothesis: CustomerID → CustomerName
 
-- PHASE 2: Semantic Validation
-- Question to business: "Does each customer ID always have the same name?"
-- Answer: "Yes - CustomerID is our unique identifier for customers"
-- Confirmed: CustomerID → CustomerName is a true constraint
 
-- Another analysis
SELECT 
    'OrderDate' as potential_determinant,
    COUNT(DISTINCT Warehouse) as distinct_values
FROM ORDERS
GROUP BY OrderDate
HAVING COUNT(DISTINCT Warehouse) > 1;
 
-- Result: Multiple rows (violations exist)
-- Observation: Same order date, multiple warehouses
-- No FD: OrderDate does NOT → Warehouse
 
-- But wait - check per-order
SELECT OrderID, COUNT(DISTINCT Warehouse) 
FROM ORDERS
GROUP BY OrderID
HAVING COUNT(DISTINCT Warehouse) > 1;
 
-- Result: Zero rows
-- Hypothesis: OrderID → Warehouse
-- Question to business: "Is each order fulfilled from one warehouse?"
-- Answer: "Usually, but split shipments happen for large orders"
-- REJECTED: OrderID does NOT → Warehouse (exceptions exist)
 
-- PHASE 3: Design Decision
-- CustomerID → CustomerName: ENFORCE
-- OrderID → Warehouse: DO NOT ENFORCE (allow multiples per order)
 
-- PHASE 4: Implementation
CREATE TABLE CUSTOMER (
    CustomerID INT PRIMARY KEY,  -- Enforces: CustomerID is determinant
    CustomerName VARCHAR(100) NOT NULL
);
 
CREATE TABLE ORDER_SHIPMENT (
    OrderID INT,
    ShipmentID INT,
    Warehouse VARCHAR(50),
    PRIMARY KEY (OrderID, ShipmentID),  -- Allows multiple per order
    FOREIGN KEY (OrderID) REFERENCES ORDERS(OrderID)
);

Document Your Analysis

FD Discovery Algorithms (Overview)

For completeness, let's briefly survey automated FD discovery—useful for legacy system analysis and data profiling.

Notable FD Discovery Algorithms
Algorithm	Year	Approach	Characteristics
TANE	1999	Level-wise, partition-based	Pioneering algorithm, handles large schemas
FUN	2001	Bottom-up, attribute-pair focused	Efficient for sparse dependencies
FastFDs	2002	Difference-set based	Fast for medium datasets
DFD	2014	Sampling + validation	Handles very large datasets
HyFD	2017	Hybrid (sampling + validation)	State-of-the-art performance

How These Algorithms Work (Conceptually):

Partition-Based (TANE, FUN)
- Create partitions of rows based on attribute values
- If partition on X refines perfectly to partition on Y, then X → Y
- Explore candidate FDs level by level (single attrs, then pairs, etc.)
Difference-Set Based (FastFDs)
- For each pair of rows, compute which attributes differ
- Use set algebra to prune impossible FDs quickly
Hybrid/Sampling (HyFD, DFD)
- Sample data to generate candidate FDs
- Validate candidates against full dataset
- Prune search space dynamically

Complexity Challenge:

When to Use FD Discovery Tools

Never use them as the SOLE source of schema FDs.

Practical Guidelines for FD Identification

Based on everything we've covered, here are actionable guidelines for identifying FDs in real projects.

FD Identification Guidelines

•Start with semantics — Begin by understanding business rules, not by analyzing data. Ask stakeholders: 'What must always be true about this data?'
•Document sources — For each FD, note whether it came from business rule, regulation, physical reality, or design decision. This helps future maintenance.
•Question coincidences — When data shows a pattern, ask: 'Is this enforced or accidental?' 'What would change if the pattern broke?'
•Consider time dimensions — Does the FD hold over time? EmpID → DeptID might hold NOW but not historically. Consider if you need temporal versioning.
•Beware of sample bias — Small datasets may not show exceptions. 'We've never had co-instructors' doesn't mean the FD holds—maybe it just hasn't happened yet.
•Use data to verify, not define — After declaring semantic FDs, check if current data complies. Violations indicate either data errors or incorrect assumptions.
•Plan for exceptions — If an exception might ever occur, don't enforce the FD. Flexible schemas handle change better than rigid ones.
•Document exceptions explicitly — If you know 'usually X → Y but sometimes not,' document the exception cases for future reference.

The 'What If' Test

Summary: Sources of Functional Dependencies

We've explored the crucial distinction between data-driven and semantically-driven FD identification. Let's consolidate the key insights:

Key Takeaways

•Two Sources — FDs can be discovered from data (empirical) or derived from domain semantics (declarative). The sources are fundamentally different.
•Data Suggests, Semantics Confirms — Data patterns suggest hypotheses; domain understanding confirms or refutes them. Never rely solely on data patterns.
•Constraint vs. Coincidence — The key question is whether a pattern MUST hold (constraint) or just HAPPENS to hold (coincidence). Only constraints become schema FDs.
•Semantic FDs Are Primary — Business rules, regulations, and domain expert knowledge are the authoritative sources for FDs in schema design.
•Hybrid Approach Works Best — Use data analysis to generate hypotheses, then validate semantically. Document the reasoning for future maintainers.
•Design for Exceptions — If exceptions might ever occur, design flexibility into the schema rather than enforcing a potentially-violable FD.
•FD Discovery Tools — Algorithms like TANE and HyFD automate pattern discovery but cannot replace semantic analysis.

What's Next:

Page Complete

4 / 5