File System Vs Dbms - Learning Module

Loading content...

0/252

Limitations of File Systems

Why File Systems Weren't Enough

In the previous page, we explored how file-based data management systems worked. Now we must confront a critical question: If file systems worked, why did we need something else?

The answer lies in understanding that file systems didn't merely have inconveniences—they had fundamental structural limitations that became increasingly problematic as:

Organizations grew in size and complexity
Data volumes increased by orders of magnitude
Business operations became more integrated
Real-time access became expected rather than exceptional
Competition demanded data-driven decision-making

These limitations weren't bugs to be fixed; they were inherent to the file-based approach itself.

What You Will Learn

By the end of this page, you will understand the complete taxonomy of file system limitations—from data redundancy and integrity problems to security weaknesses and concurrent access failures. You'll see how these limitations aren't independent issues but interconnected consequences of the file-based architecture.

A Taxonomy of File System Limitations

To systematically understand file system limitations, we can organize them into categories based on the type of problem they create. This taxonomy helps us see not just individual issues but the patterns that connect them:

Categories of File System Limitations
Category	Core Problem	Business Impact
Data Redundancy	Same data stored multiple times	Wasted storage, update complexity, inconsistency
Data Inconsistency	Different versions of same data	Conflicting information, unreliable reporting, customer confusion
Data Isolation	Data trapped in application silos	Inability to answer cross-functional questions, integration nightmares
Integrity Problems	No central enforcement of rules	Invalid data, broken relationships, corruption
Security Limitations	Coarse-grained access control	All-or-nothing access, difficulty meeting compliance
Concurrency Issues	No coordination of simultaneous access	Lost updates, phantom reads, corrupted files
Atomicity Failures	No transaction guarantees	Partial updates, inconsistent state after failures
Program-Data Dependence	Logic and data coupled	Maintenance burden, change resistance, high costs

Let's examine each of these limitations in depth, understanding not just what the problem is but why the file-based architecture makes it inevitable.

Data Redundancy: The Multiplication Problem

Data redundancy occurs when the same piece of information is stored in multiple locations within an organization's data files. In file-based systems, redundancy isn't an accident—it's a structural inevitability.

Why Redundancy Is Unavoidable in File Systems:

•Application Independence — Each application is designed to be self-contained. It cannot depend on another application's files being available or structured correctly.
•No Data Sharing Mechanism — File systems provide no standard way to reference data in another file. If two applications need customer addresses, each must store its own copy.
•Departmental Autonomy — Different departments develop their systems at different times, with different requirements, and don't coordinate data structures.
•Performance Considerations — Cross-file lookups were slow. Storing data locally with each application improved processing speed.

A Quantitative Example:

Consider a mid-sized insurance company with 500,000 policyholders. The same customer information might appear in:

Application	Customer Data Stored	Records
Policy Administration	Name, Address, Phone, DOB, SSN	500,000
Billing	Name, Address, Phone, Payment Info	500,000
Claims	Name, Address, Phone, Claim History	200,000 (active)
Underwriting	Name, Address, DOB, SSN, Risk Info	100,000 (recent)
Marketing	Name, Address, Demographics	500,000
Agent Portal	Name, Address, Phone, Agent Info	500,000

Storage Waste Calculation

If core customer data (name, address, phone, identifiers) consumes 500 bytes per record, and this data is duplicated across 6 systems: 500 bytes × 500,000 customers × 6 copies = 1.5 GB of redundant storage. In 1980 terms, that represented significant disk cost. But storage waste was the least of the problems.

The True Cost of Redundancy:

Storage was merely the visible symptom. The real costs were operational:

Operational Costs of Redundancy

•Data Entry Multiplication — The same customer update might need to be keyed into 3, 4, or 6 systems. Each entry costs time and introduces error potential.
•Update Complexity — When a customer moves, how do you ensure all systems are updated? What if the claims system uses a different address format?
•Reconciliation Overhead — Organizations needed periodic projects to identify and reconcile differences between systems—expensive, disruptive, and never-ending.
•Training Burden — Staff needed to learn which system to update for which purpose, often with complex rules about what propagated where.
•Audit Nightmares — When auditors asked 'How many customers do you have?', different systems gave different answers. Which was correct?

Data Inconsistency: When Copies Diverge

Data inconsistency is the inevitable consequence of data redundancy. When the same information exists in multiple places, those copies will eventually contain different values. This isn't a matter of 'if'—it's 'when' and 'how badly'.

Inconsistency Patterns:

Temporal Inconsistency occurs when updates don't propagate to all copies at the same time.

Scenario: A customer calls at 2:00 PM to change their address. The customer service representative updates the billing system immediately. But:

The policy system is only updated during the nightly batch run
The claims system requires a separate form submission
The marketing system is updated monthly from the master file

Result: For hours, days, or weeks, different systems show different addresses for the same customer.

Converting Mermaid diagram...

Data Isolation: Islands of Information

Data isolation refers to the problem of data being trapped within the boundaries of individual applications or departments, inaccessible to other parts of the organization that need it. In file-based systems, each application was a silo, and breaking down those silos was technically and organizationally difficult.

Why Data Becomes Isolated:

•Incompatible File Formats — Each application used its own record layouts, character encodings, and data representations. Reading another application's files required reverse-engineering its format.
•No Query Language — There was no universal way to ask questions across files. Each access required custom programming.
•Physical Separation — Files might reside on different machines, storage systems, or even different physical locations.
•Organizational Boundaries — Departments viewed 'their' data as proprietary. No mechanism existed for controlled sharing.

The Business Consequence: Unanswerable Questions

Data isolation meant that many basic business questions—questions that seem trivial today—were nearly impossible to answer:

Questions That Couldn't Be Answered

•Which customers bought Product A and also called support?
•What's the average order value for customers in the Western region?
•Which employees work in departments with declining revenue?
•How many suppliers also buy from us as customers?
•What products are purchased together most frequently?

Why They Couldn't Be Answered

•Sales and Support files have no common key
•Order file has no region; customer file has no region flag
•HR and Finance files are entirely separate
•Supplier and Customer masters use different ID schemes
•Order system doesn't track basket-level data

The Ad-Hoc Integration Problem:

When cross-system queries were truly needed, organizations resorted to ad-hoc integration projects:

•Extract data from each relevant file (often requiring programmer help)
•Convert formats to a common layout (custom conversion programs)
•Merge files based on whatever matching keys exist (error-prone)
•Write reports against the merged data (custom programming)
•Repeat next time because the merged file is now out of date

The Hidden Cost

Studies in the 1970s found that organizations spent 60-70% of their programming resources on integration and data access tasks rather than building new functionality. Every cross-system report was a custom project. 'Can you give me a report?' meant 'Can you fund a 3-week development effort?'

Integrity Constraint Failures

Data integrity refers to the accuracy, consistency, and validity of data according to business rules. In file-based systems, integrity enforcement was fragmented, incomplete, and unreliable.

Types of Integrity Constraints:

Integrity Constraints and File System Support
Constraint Type	Example	File System Support
Domain Constraint	Age must be between 0 and 150	Application code only; no central enforcement
Entity Integrity	Every record must have a unique identifier	Not enforced; duplicates can be inserted
Referential Integrity	OrderID must refer to existing Customer	Not enforced; orphan records common
Business Rules	Discount cannot exceed 50%	Scattered across multiple programs
Format Constraints	Phone numbers must match pattern	Each application validates differently

The Scattered Validation Problem:

In file-based systems, validation logic was duplicated across every application that accessed the data:

validation_example.cob
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
* Every program that writes to CUSTOMER file must include:
 
  VALIDATE-CUSTOMER-DATA.
      IF CUSTOMER-NAME = SPACES
          MOVE "ERROR: NAME REQUIRED" TO ERROR-MSG
          PERFORM ERROR-ROUTINE.
      IF CUSTOMER-ZIP NOT NUMERIC
          MOVE "ERROR: ZIP MUST BE NUMERIC" TO ERROR-MSG
          PERFORM ERROR-ROUTINE.
      IF CUSTOMER-STATE NOT IN VALID-STATES-TABLE
          MOVE "ERROR: INVALID STATE CODE" TO ERROR-MSG
          PERFORM ERROR-ROUTINE.
      IF CUSTOMER-BALANCE < 0
          MOVE "ERROR: BALANCE CANNOT BE NEGATIVE" TO ERROR-MSG
          PERFORM ERROR-ROUTINE.
 
* This same logic appears in:
*   - Customer entry program
*   - Customer update program  
*   - Order entry program (for new customers)
*   - Batch conversion program
*   - Data fix utility
*   - Each with slight variations and bugs...

Inconsistent Validation

When validation is scattered across programs, it's never consistent. One program allows 5-digit and 9-digit ZIP codes; another only allows 5. One accepts 'NY', 'N.Y.', and 'New York'; another only accepts 'NY'. A utility program bypasses validation entirely 'for performance'. The result: data that passes some programs' checks but fails others.

The Referential Integrity Problem:

Perhaps the most severe integrity issue in file-based systems was the inability to maintain referential integrity—ensuring that references between related data remain valid.

Example: Order and Customer Files

referential_integrity.txt
CUSTOMER FILE:
CustID | Name           | Address
-------+----------------+-------------------------
C001   | Acme Corp      | 123 Main St, NY
C002   | Beta Inc       | 456 Oak Ave, CA
C003   | Gamma LLC      | 789 Elm St, TX
 
ORDER FILE:
OrderID | CustID | OrderDate  | Amount
--------+--------+------------+---------
O1001   | C001   | 2024-01-15 | 1500.00
O1002   | C002   | 2024-01-16 | 2300.00
O1003   | C001   | 2024-01-17 | 890.00
O1004   | C004   | 2024-01-18 | 1200.00  ← ORPHAN! C004 doesn't exist
O1005   | C002   | 2024-01-19 | 3100.00
 
What happens if Gamma LLC (C003) is deleted?
What happens if we try to find the customer for O1004?
No automatic protection in file systems.

Consequences of Integrity Failures:

•Orphaned Records — References to deleted data, causing program crashes or incorrect calculations
•Invalid Data — Values outside acceptable ranges, causing downstream processing errors
•Duplicate Records — Multiple records for the same entity, causing overcounting and confusion
•Inconsistent Relationships — Parent without children or children without parents
•Data Corruption — Invalid data propagating through batch processes, contaminating more files

Security Limitations

File-based systems offered only primitive security mechanisms, typically limited to what the operating system provided at the file level. This created significant vulnerabilities and compliance challenges.

Operating System File Permissions:

Typical file-level security provided:

•Read — Can view file contents
•Write — Can modify file contents
•Execute — Can run if file is a program
•Delete — Can remove the file

These permissions applied to the entire file. You could not:

What File-Level Security Cannot Do

•Grant read access to some fields but not others (e.g., see name but not salary)
•Allow viewing some records but not others (e.g., only customers in your region)
•Permit updates to certain fields while protecting others
•Control access based on data values (e.g., see orders under $10,000 but not over)
•Track who accessed which specific records
•Implement role-based access control

The All-or-Nothing Problem:

Consider an HR file containing:

•Employee name
•Department
•Work phone
•Social Security Number
•Salary
•Performance reviews
•Medical conditions

Security Requirements:

•Everyone can see names, departments, phones
•Only HR can see SSN
•Only HR and managers can see salary
•Only HR can see medical info
•Managers can only see their department

Impossible to Implement

These requirements are impossible with file-level security. Solutions involved maintaining multiple copies of data with different fields, complex application-level security code, or simply giving up and granting broad access. Each approach introduced its own problems: redundancy, inconsistent enforcement, or security violations.

No Audit Trail:

File systems typically provided no auditing of data access. Organizations couldn't answer:

Who viewed the customer list last Tuesday?
Which programs modified the payroll file?
When was this record last updated, and by whom?
How many times has this file been accessed this month?

Without audit trails, security breaches couldn't be detected, investigated, or prevented. Insider threats went unnoticed until external consequences surfaced.

Concurrency Control Failures

As organizations moved toward interactive processing and multiple users needed to access the same data simultaneously, file-based systems faced a fundamental challenge: they had no built-in mechanisms for coordinating concurrent access.

The Lost Update Problem:

Consider two users updating the same customer record:

Converting Mermaid diagram...

What Happened:

Both users read the same initial balance ($1000). User 1 calculated the new balance after a deposit ($1200) and wrote it. User 2, still working with the original $1000, calculated the withdrawal ($700) and wrote it. User 1's deposit was completely lost. The correct final balance should be $900 ($1000 + $200 - $300).

Other Concurrency Problems:

•Dirty Read — Reading data that another user has modified but not yet committed. If that user's changes are rolled back, you've acted on invalid data.
•Non-Repeatable Read — Reading the same record twice and getting different values because another user modified it between reads.
•Phantom Records — A query returns different sets of records on successive executions because another user inserted or deleted matching records.
•File Locking Deadlocks — When applications implemented their own locking, two programs could each hold a lock the other needed, permanently blocking both.

Workaround Attempts:

Organizations tried various approaches to manage concurrency:

Approach	Mechanism	Problems
Exclusive File Locking	Lock entire file during update	Only one user can access at a time; severe bottleneck
Record Reservation	Application marks records as 'in use'	Requires custom code; orphaned locks when programs crash
Batch-Only Updates	No interactive updates; collect changes for batch	Defeats purpose of interactive systems
Optimistic Checking	Verify record unchanged before write	Still has race condition window; complex to implement

The Interactive Revolution Problem

As organizations moved from batch to interactive processing in the 1970s and 1980s, concurrency became critical. A bank teller couldn't wait for other tellers to finish; an airline reservation couldn't lock out all other agents. File-based systems simply weren't designed for this world.

Atomicity and Recovery Failures

Atomicity is the property that a group of operations either all succeed together or all fail together—there's no partial completion. File systems provided no atomicity guarantees, making recovery from failures extremely difficult.

The Partial Update Problem:

Consider a funds transfer that must update two files:

transfer_program.pseudo
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
FUNDS-TRANSFER PROGRAM:
    1. Read source account from ACCOUNTS file
    2. Verify sufficient balance
    3. Subtract transfer amount from source balance  
    4. Write updated source account ← SUCCESS
    
    5. Read destination account from ACCOUNTS file
    
    *** SYSTEM CRASH OCCURS HERE ***
    
    6. Add transfer amount to destination balance
    7. Write updated destination account ← NEVER EXECUTED
 
RESULT AFTER CRASH:
    - Source account: $500 DEDUCTED
    - Destination account: $0 CREDITED
    - Money has DISAPPEARED from the system

No Automatic Recovery

When the system restarts, there's no way to automatically detect what was in progress or undo the partial changes. Manual investigation, reconciliation, and correction were required—often discovered days later when accounts wouldn't balance.

Recovery Challenges:

File systems provided no systematic recovery mechanisms for interrupted operations:

•No Transaction Logs — No record of what changes were in progress when failure occurred
•No Rollback Capability — Cannot undo completed steps of a partially-completed operation
•No Consistency Points — No way to restore to a known-consistent state
•Manual Diagnosis Required — Programmers must analyze what each program was doing to understand state

The Backup Problem:

Recovery typically relied on restoring from backup and reprocessing transactions. But:

Backup Recovery Issues

•Backups were periodic — Nightly or weekly. All changes since backup must be reprocessed.
•Transaction logs often didn't exist — No record of individual transactions to reprocess.
•Reprocessing was risky — How do you avoid applying the same transaction twice?
•Multi-file consistency — Restoring related files to the same point in time was nearly impossible.
•Downtime during restore — Recovery could take hours or days while systems remained unavailable.

Summary: The Case Against File Systems

We've now explored the comprehensive set of limitations that made file-based data management increasingly untenable as organizations grew and their data needs became more sophisticated. Let's consolidate these insights:

Key Takeaways

•Data redundancy is structural — Not an accident but an inevitable consequence of application-centric design, causing storage waste and update complexity.
•Inconsistency follows redundancy — Multiple copies guarantee divergence over time, undermining trust in all organizational data.
•Data isolation prevents insight — Information trapped in silos cannot answer cross-functional business questions.
•Integrity cannot be centrally enforced — Validation scattered across programs leads to invalid, inconsistent data entering the system.
•Security is all-or-nothing — File-level permissions cannot implement the fine-grained access control that sensitive data requires.
•Concurrency requires coordination — Multiple users accessing the same data without proper controls leads to lost updates and corrupted state.
•Atomicity is unachievable — Without transaction support, failures leave data in inconsistent, partially-updated states with no automatic recovery.
•Program-data dependence magnifies all problems — Changes to data structure ripple through dozens of programs, making evolution prohibitively expensive.

What's Next:

Now that we understand the comprehensive limitations of file-based systems, we'll examine two of the most critical problems in greater depth: data redundancy and inconsistency (the subject of our next page) and data isolation (the following page). These deep dives will complete our understanding of why Database Management Systems became essential.

Page Complete

You now have a comprehensive taxonomy of file system limitations and understand why each is inherent to the file-based architecture rather than an implementation flaw. This understanding is essential for appreciating the design goals and features of Database Management Systems.

Limitations of File Systems

Why File Systems Weren't Enough

In the previous page, we explored how file-based data management systems worked. Now we must confront a critical question: If file systems worked, why did we need something else?

The answer lies in understanding that file systems didn't merely have inconveniences—they had fundamental structural limitations that became increasingly problematic as:

Organizations grew in size and complexity
Data volumes increased by orders of magnitude
Business operations became more integrated
Real-time access became expected rather than exceptional
Competition demanded data-driven decision-making

These limitations weren't bugs to be fixed; they were inherent to the file-based approach itself.

What You Will Learn

A Taxonomy of File System Limitations

Categories of File System Limitations
Category	Core Problem	Business Impact
Data Redundancy	Same data stored multiple times	Wasted storage, update complexity, inconsistency
Data Inconsistency	Different versions of same data	Conflicting information, unreliable reporting, customer confusion
Data Isolation	Data trapped in application silos	Inability to answer cross-functional questions, integration nightmares
Integrity Problems	No central enforcement of rules	Invalid data, broken relationships, corruption
Security Limitations	Coarse-grained access control	All-or-nothing access, difficulty meeting compliance
Concurrency Issues	No coordination of simultaneous access	Lost updates, phantom reads, corrupted files
Atomicity Failures	No transaction guarantees	Partial updates, inconsistent state after failures
Program-Data Dependence	Logic and data coupled	Maintenance burden, change resistance, high costs

Let's examine each of these limitations in depth, understanding not just what the problem is but why the file-based architecture makes it inevitable.

Data Redundancy: The Multiplication Problem

Why Redundancy Is Unavoidable in File Systems:

•Application Independence — Each application is designed to be self-contained. It cannot depend on another application's files being available or structured correctly.
•No Data Sharing Mechanism — File systems provide no standard way to reference data in another file. If two applications need customer addresses, each must store its own copy.
•Departmental Autonomy — Different departments develop their systems at different times, with different requirements, and don't coordinate data structures.
•Performance Considerations — Cross-file lookups were slow. Storing data locally with each application improved processing speed.

A Quantitative Example:

Consider a mid-sized insurance company with 500,000 policyholders. The same customer information might appear in:

Application	Customer Data Stored	Records
Policy Administration	Name, Address, Phone, DOB, SSN	500,000
Billing	Name, Address, Phone, Payment Info	500,000
Claims	Name, Address, Phone, Claim History	200,000 (active)
Underwriting	Name, Address, DOB, SSN, Risk Info	100,000 (recent)
Marketing	Name, Address, Demographics	500,000
Agent Portal	Name, Address, Phone, Agent Info	500,000

Storage Waste Calculation

The True Cost of Redundancy:

Storage was merely the visible symptom. The real costs were operational:

Operational Costs of Redundancy

•Data Entry Multiplication — The same customer update might need to be keyed into 3, 4, or 6 systems. Each entry costs time and introduces error potential.
•Update Complexity — When a customer moves, how do you ensure all systems are updated? What if the claims system uses a different address format?
•Reconciliation Overhead — Organizations needed periodic projects to identify and reconcile differences between systems—expensive, disruptive, and never-ending.
•Training Burden — Staff needed to learn which system to update for which purpose, often with complex rules about what propagated where.
•Audit Nightmares — When auditors asked 'How many customers do you have?', different systems gave different answers. Which was correct?

Data Inconsistency: When Copies Diverge

Inconsistency Patterns:

Temporal Inconsistency occurs when updates don't propagate to all copies at the same time.

Scenario: A customer calls at 2:00 PM to change their address. The customer service representative updates the billing system immediately. But:

The policy system is only updated during the nightly batch run
The claims system requires a separate form submission
The marketing system is updated monthly from the master file

Result: For hours, days, or weeks, different systems show different addresses for the same customer.

Converting Mermaid diagram...

Data Isolation: Islands of Information

Why Data Becomes Isolated:

•Incompatible File Formats — Each application used its own record layouts, character encodings, and data representations. Reading another application's files required reverse-engineering its format.
•No Query Language — There was no universal way to ask questions across files. Each access required custom programming.
•Physical Separation — Files might reside on different machines, storage systems, or even different physical locations.
•Organizational Boundaries — Departments viewed 'their' data as proprietary. No mechanism existed for controlled sharing.

The Business Consequence: Unanswerable Questions

Data isolation meant that many basic business questions—questions that seem trivial today—were nearly impossible to answer:

Questions That Couldn't Be Answered

•Which customers bought Product A and also called support?
•What's the average order value for customers in the Western region?
•Which employees work in departments with declining revenue?
•How many suppliers also buy from us as customers?
•What products are purchased together most frequently?

Why They Couldn't Be Answered

•Sales and Support files have no common key
•Order file has no region; customer file has no region flag
•HR and Finance files are entirely separate
•Supplier and Customer masters use different ID schemes
•Order system doesn't track basket-level data

The Ad-Hoc Integration Problem:

When cross-system queries were truly needed, organizations resorted to ad-hoc integration projects:

•Extract data from each relevant file (often requiring programmer help)
•Convert formats to a common layout (custom conversion programs)
•Merge files based on whatever matching keys exist (error-prone)
•Write reports against the merged data (custom programming)
•Repeat next time because the merged file is now out of date

The Hidden Cost

Integrity Constraint Failures

Data integrity refers to the accuracy, consistency, and validity of data according to business rules. In file-based systems, integrity enforcement was fragmented, incomplete, and unreliable.

Types of Integrity Constraints:

Integrity Constraints and File System Support
Constraint Type	Example	File System Support
Domain Constraint	Age must be between 0 and 150	Application code only; no central enforcement
Entity Integrity	Every record must have a unique identifier	Not enforced; duplicates can be inserted
Referential Integrity	OrderID must refer to existing Customer	Not enforced; orphan records common
Business Rules	Discount cannot exceed 50%	Scattered across multiple programs
Format Constraints	Phone numbers must match pattern	Each application validates differently

The Scattered Validation Problem:

In file-based systems, validation logic was duplicated across every application that accessed the data:

validation_example.cob
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
* Every program that writes to CUSTOMER file must include:
 
  VALIDATE-CUSTOMER-DATA.
      IF CUSTOMER-NAME = SPACES
          MOVE "ERROR: NAME REQUIRED" TO ERROR-MSG
          PERFORM ERROR-ROUTINE.
      IF CUSTOMER-ZIP NOT NUMERIC
          MOVE "ERROR: ZIP MUST BE NUMERIC" TO ERROR-MSG
          PERFORM ERROR-ROUTINE.
      IF CUSTOMER-STATE NOT IN VALID-STATES-TABLE
          MOVE "ERROR: INVALID STATE CODE" TO ERROR-MSG
          PERFORM ERROR-ROUTINE.
      IF CUSTOMER-BALANCE < 0
          MOVE "ERROR: BALANCE CANNOT BE NEGATIVE" TO ERROR-MSG
          PERFORM ERROR-ROUTINE.
 
* This same logic appears in:
*   - Customer entry program
*   - Customer update program  
*   - Order entry program (for new customers)
*   - Batch conversion program
*   - Data fix utility
*   - Each with slight variations and bugs...

Inconsistent Validation

The Referential Integrity Problem:

Perhaps the most severe integrity issue in file-based systems was the inability to maintain referential integrity—ensuring that references between related data remain valid.

Example: Order and Customer Files

referential_integrity.txt
CUSTOMER FILE:
CustID | Name           | Address
-------+----------------+-------------------------
C001   | Acme Corp      | 123 Main St, NY
C002   | Beta Inc       | 456 Oak Ave, CA
C003   | Gamma LLC      | 789 Elm St, TX
 
ORDER FILE:
OrderID | CustID | OrderDate  | Amount
--------+--------+------------+---------
O1001   | C001   | 2024-01-15 | 1500.00
O1002   | C002   | 2024-01-16 | 2300.00
O1003   | C001   | 2024-01-17 | 890.00
O1004   | C004   | 2024-01-18 | 1200.00  ← ORPHAN! C004 doesn't exist
O1005   | C002   | 2024-01-19 | 3100.00
 
What happens if Gamma LLC (C003) is deleted?
What happens if we try to find the customer for O1004?
No automatic protection in file systems.

Consequences of Integrity Failures:

•Orphaned Records — References to deleted data, causing program crashes or incorrect calculations
•Invalid Data — Values outside acceptable ranges, causing downstream processing errors
•Duplicate Records — Multiple records for the same entity, causing overcounting and confusion
•Inconsistent Relationships — Parent without children or children without parents
•Data Corruption — Invalid data propagating through batch processes, contaminating more files

Security Limitations

Operating System File Permissions:

Typical file-level security provided:

•Read — Can view file contents
•Write — Can modify file contents
•Execute — Can run if file is a program
•Delete — Can remove the file

These permissions applied to the entire file. You could not:

What File-Level Security Cannot Do

•Grant read access to some fields but not others (e.g., see name but not salary)
•Allow viewing some records but not others (e.g., only customers in your region)
•Permit updates to certain fields while protecting others
•Control access based on data values (e.g., see orders under $10,000 but not over)
•Track who accessed which specific records
•Implement role-based access control

The All-or-Nothing Problem:

Consider an HR file containing:

•Employee name
•Department
•Work phone
•Social Security Number
•Salary
•Performance reviews
•Medical conditions

Security Requirements:

•Everyone can see names, departments, phones
•Only HR can see SSN
•Only HR and managers can see salary
•Only HR can see medical info
•Managers can only see their department

Impossible to Implement

No Audit Trail:

File systems typically provided no auditing of data access. Organizations couldn't answer:

Who viewed the customer list last Tuesday?
Which programs modified the payroll file?
When was this record last updated, and by whom?
How many times has this file been accessed this month?

Without audit trails, security breaches couldn't be detected, investigated, or prevented. Insider threats went unnoticed until external consequences surfaced.

Concurrency Control Failures

The Lost Update Problem:

Consider two users updating the same customer record:

Converting Mermaid diagram...

What Happened:

Other Concurrency Problems:

•Dirty Read — Reading data that another user has modified but not yet committed. If that user's changes are rolled back, you've acted on invalid data.
•Non-Repeatable Read — Reading the same record twice and getting different values because another user modified it between reads.
•Phantom Records — A query returns different sets of records on successive executions because another user inserted or deleted matching records.
•File Locking Deadlocks — When applications implemented their own locking, two programs could each hold a lock the other needed, permanently blocking both.

Workaround Attempts:

Organizations tried various approaches to manage concurrency:

Approach	Mechanism	Problems
Exclusive File Locking	Lock entire file during update	Only one user can access at a time; severe bottleneck
Record Reservation	Application marks records as 'in use'	Requires custom code; orphaned locks when programs crash
Batch-Only Updates	No interactive updates; collect changes for batch	Defeats purpose of interactive systems
Optimistic Checking	Verify record unchanged before write	Still has race condition window; complex to implement

The Interactive Revolution Problem

Atomicity and Recovery Failures

The Partial Update Problem:

Consider a funds transfer that must update two files:

transfer_program.pseudo
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
FUNDS-TRANSFER PROGRAM:
    1. Read source account from ACCOUNTS file
    2. Verify sufficient balance
    3. Subtract transfer amount from source balance  
    4. Write updated source account ← SUCCESS
    
    5. Read destination account from ACCOUNTS file
    
    *** SYSTEM CRASH OCCURS HERE ***
    
    6. Add transfer amount to destination balance
    7. Write updated destination account ← NEVER EXECUTED
 
RESULT AFTER CRASH:
    - Source account: $500 DEDUCTED
    - Destination account: $0 CREDITED
    - Money has DISAPPEARED from the system

No Automatic Recovery

Recovery Challenges:

File systems provided no systematic recovery mechanisms for interrupted operations:

•No Transaction Logs — No record of what changes were in progress when failure occurred
•No Rollback Capability — Cannot undo completed steps of a partially-completed operation
•No Consistency Points — No way to restore to a known-consistent state
•Manual Diagnosis Required — Programmers must analyze what each program was doing to understand state

The Backup Problem:

Recovery typically relied on restoring from backup and reprocessing transactions. But:

Backup Recovery Issues

•Backups were periodic — Nightly or weekly. All changes since backup must be reprocessed.
•Transaction logs often didn't exist — No record of individual transactions to reprocess.
•Reprocessing was risky — How do you avoid applying the same transaction twice?
•Multi-file consistency — Restoring related files to the same point in time was nearly impossible.
•Downtime during restore — Recovery could take hours or days while systems remained unavailable.

Summary: The Case Against File Systems

Key Takeaways

•Data redundancy is structural — Not an accident but an inevitable consequence of application-centric design, causing storage waste and update complexity.
•Inconsistency follows redundancy — Multiple copies guarantee divergence over time, undermining trust in all organizational data.
•Data isolation prevents insight — Information trapped in silos cannot answer cross-functional business questions.
•Integrity cannot be centrally enforced — Validation scattered across programs leads to invalid, inconsistent data entering the system.
•Security is all-or-nothing — File-level permissions cannot implement the fine-grained access control that sensitive data requires.
•Concurrency requires coordination — Multiple users accessing the same data without proper controls leads to lost updates and corrupted state.
•Atomicity is unachievable — Without transaction support, failures leave data in inconsistent, partially-updated states with no automatic recovery.
•Program-data dependence magnifies all problems — Changes to data structure ripple through dozens of programs, making evolution prohibitively expensive.

What's Next:

Page Complete