Loading learning content...
What if we removed the acyclic constraint entirely? What if directories could point to their ancestors, creating true cycles? This isn't merely a theoretical question—some systems have explored general graph directories, and understanding why they remain rare illuminates fundamental trade-offs in file system design.
A general graph directory allows arbitrary connections between nodes, including cycles. This provides maximum flexibility but introduces significant complexity: deletion becomes nuanced, traversal requires cycle detection, and the entire system must implement garbage collection rather than simple reference counting.
In this concluding page of the module, we explore the general graph directory model: its theoretical expressiveness, the garbage collection problem, historical systems that attempted it, reasons for the dominance of acyclic designs, and how these concepts influence modern distributed and version control systems.
By completing this page, you will understand why general graphs are more expressive than DAGs, analyze why reference counting fails with cycles, master garbage collection approaches for graph directories, examine historical and modern systems using graph concepts, and synthesize understanding of the directory structure evolution.
A general graph directory removes the acyclic constraint, allowing arbitrary directed edges between nodes. This creates a more general structure than the DAG (Directed Acyclic Graph) of standard file systems.
Formal Definition:
G = (V, E)
where:
V = nodes (files and directories)
E = directed edges (links between nodes)
Cycles ALLOWED: ∃ paths v₁ → v₂ → ... → vₙ → v₁
Comparison of Directory Models:
| Model | Edges | Multiple Parents | Cycles | Reachability |
|---|---|---|---|---|
| Tree | n-1 | No | No | Unique path to each node |
| DAG (Acyclic) | ≥n-1 | Yes | No | Multiple paths, finite |
| General Graph | Any | Yes | Yes | May have infinite paths |
What Cycles Enable:
Example Cyclic Structure:
/shared/
├── project_a/
│ ├── lib → /shared/common/
│ └── code/
├── project_b/
│ ├── lib → /shared/common/
│ └── code/
└── common/
├── utils/
└── projects → /shared/ ← CYCLE!
(common links back to shared)
Observations from the Diagram:
Multiple Paths: file1.txt is reachable via /dir_a/file1.txt AND /dir_b/dir_a/file1.txt (via the cycle)
Cycles Present: dir_a → dir_c → dir_a forms a cycle. dir_a → dir_b → dir_c → dir_a is another.
Infinite Paths: The path /dir_a/dir_c/dir_a/dir_c/dir_a/... continues forever!
Shared Access: dir_c and shared.dat are reachable from multiple starting points.
Theoretical Expressiveness:
General graphs can model relationships that DAGs cannot:
While acyclic structures dominate production file systems, general graph concepts appear in version control (Git's commit DAG can be traversed in cycles from merge commits), distributed file systems, and object storage systems. Understanding graph directories prepares you for these advanced systems.
Acyclic directories use reference counting for deletion: each inode tracks how many directory entries point to it. When the count reaches zero, the file is deleted. This elegant approach fails completely with cycles.
Why Reference Counting Fails with Cycles:
Consider a simple cycle:
dir_a (link_count: 2) ← referenced by root and dir_b
└── link_to_b → dir_b
dir_b (link_count: 1) ← referenced by dir_a
└── link_to_a → dir_a
Now, remove dir_a from the root directory:
Before: root → dir_a
After: root ⊗ (dir_a removed from root)
dir_a.link_count: 2 → 1 (still > 0!)
dir_b.link_count: 1 (unchanged)
Both directories still exist because they reference each other!
But they're unreachable from root!
This is the cyclic garbage problem—nodes with non-zero reference counts that can never be reached from the root.
| Scenario | DAG Behavior | Cyclic Graph Behavior |
|---|---|---|
| Delete last link to file | link_count=0, file deleted | link_count=0, file deleted |
| Delete link, others exist | link_count--, file remains | link_count--, file remains |
| Cycle with no external refs | Impossible (no cycles) | All nodes have count>0, LEAKED! |
| Storage reclamation | Automatic via count=0 | Requires garbage collection |
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231
/* * Demonstration of Reference Counting Failure with Cycles * Shows why garbage collection is needed for general graph directories. */ #include <stdio.h>#include <stdlib.h>#include <string.h>#include <stdbool.h> #define MAX_LINKS 8#define MAX_NAME 32 /* * Simulated file/directory node */typedef struct Node { char name[MAX_NAME]; int link_count; /* Reference count */ struct Node *links[MAX_LINKS]; /* Outgoing links (children) */ int num_links; bool marked; /* For garbage collection */} Node; /* All nodes in the system */#define MAX_NODES 100Node *all_nodes[MAX_NODES];int node_count = 0; /* Root of file system */Node *root = NULL; /* * Create a new node */Node* create_node(const char *name) { Node *n = malloc(sizeof(Node)); strncpy(n->name, name, MAX_NAME - 1); n->link_count = 0; n->num_links = 0; n->marked = false; all_nodes[node_count++] = n; return n;} /* * Add a link from parent to child */void add_link(Node *parent, Node *child) { if (parent->num_links >= MAX_LINKS) return; parent->links[parent->num_links++] = child; child->link_count++; /* Increment reference count */ printf("Added link: %s → %s (link_count now %d)", parent->name, child->name, child->link_count);} /* * Remove a link from parent to child */void remove_link(Node *parent, Node *child) { for (int i = 0; i < parent->num_links; i++) { if (parent->links[i] == child) { /* Shift remaining links */ for (int j = i; j < parent->num_links - 1; j++) { parent->links[j] = parent->links[j + 1]; } parent->num_links--; child->link_count--; printf("Removed link: %s → %s (link_count now %d)", parent->name, child->name, child->link_count); /* Traditional ref counting: delete if count = 0 */ if (child->link_count == 0) { printf(" → %s would be deleted (ref count = 0)", child->name); } return; } }} /* * Print current state */void print_state(void) { printf("--- Current State ---"); for (int i = 0; i < node_count; i++) { printf(" %s: link_count=%d, links_out=%d", all_nodes[i]->name, all_nodes[i]->link_count, all_nodes[i]->num_links); } printf("--------------------- ");} /* * Demonstrate the cyclic garbage problem */void demonstrate_cyclic_garbage(void) { printf("=== Cyclic Garbage Demonstration === "); /* Create structure */ root = create_node("root"); Node *dir_a = create_node("dir_a"); Node *dir_b = create_node("dir_b"); /* Build initial tree */ printf("Step 1: Build structure"); add_link(root, dir_a); /* root → dir_a */ add_link(root, dir_b); /* root → dir_b */ add_link(dir_a, dir_b); /* dir_a → dir_b (sharing) */ add_link(dir_b, dir_a); /* dir_b → dir_a (CYCLE!) */ print_state(); printf("Note: dir_a and dir_b form a cycle."); /* Remove from root */ printf("Step 2: Remove dir_a from root"); remove_link(root, dir_a); print_state(); printf("dir_a still has link_count=1 (from dir_b)"); printf("dir_a is NOT deleted despite being unreachable!"); printf("Step 3: Remove dir_b from root"); remove_link(root, dir_b); print_state(); printf("PROBLEM: Both nodes have link_count > 0"); printf(" But neither is reachable from root!"); printf(" This is CYCLIC GARBAGE."); printf(" Reference counting CANNOT reclaim this memory.");} /* * Mark phase: mark all reachable nodes from root */void mark_reachable(Node *node) { if (node == NULL || node->marked) return; node->marked = true; for (int i = 0; i < node->num_links; i++) { mark_reachable(node->links[i]); }} /* * Sweep phase: delete all unmarked nodes */void sweep_unreachable(void) { printf("Garbage Collection Results:"); for (int i = 0; i < node_count; i++) { if (!all_nodes[i]->marked) { printf(" GARBAGE: %s (unreachable, will be deleted)", all_nodes[i]->name); } else { printf(" LIVE: %s (reachable from root)", all_nodes[i]->name); } }} /* * Demonstrate garbage collection solution */void demonstrate_garbage_collection(void) { printf(" === Garbage Collection Solution === "); printf("Running Mark-and-Sweep from root:"); /* Reset marks */ for (int i = 0; i < node_count; i++) { all_nodes[i]->marked = false; } /* Mark phase */ printf("Mark phase: traversing from root..."); mark_reachable(root); /* Sweep phase */ sweep_unreachable(); printf("Garbage collection correctly identified the cycle!");} int main() { demonstrate_cyclic_garbage(); demonstrate_garbage_collection(); return 0;}Cyclic garbage in a file system means disk space is permanently lost—files exist on disk, consuming space, but can never be accessed or freed through normal operations. This is a catastrophic failure mode, which is why production file systems enforce acyclicity.
To handle cycles, general graph directories require garbage collection (GC)—algorithms that determine which nodes are reachable and reclaim unreachable ones.
1. Mark-and-Sweep
The classic garbage collection algorithm:
Mark Phase:
Start from root
Recursively mark all reachable nodes
Sweep Phase:
Scan all nodes
Delete unmarked nodes (garbage)
Reset marks for next collection
Complexity: O(|V| + |E|) - visits all nodes and edges
Challenges for File Systems:
2. Tri-Color Marking (Incremental GC)
For systems that can't stop for full GC:
Colors:
White: unvisited (potentially garbage)
Gray: visited but children not processed
Black: visited and all children processed
Algorithm:
1. All nodes start white
2. Root becomes gray
3. Pick gray node, process children (gray them), make current black
4. Repeat until no gray nodes
5. White nodes are garbage
Can be interrupted and resumed!
3. Backup Pointer Chain (for specific cycle patterns)
If cycles have predictable structure:
For each cycle-capable link, maintain a \"back-pointer\"
When checking deletion, follow back-pointers to detect cycles
Cycles can be explicitly broken when detected
Limitation: Only works for specific, anticipated cycle patterns
| Approach | Stop-the-World? | Complexity | Space Overhead | File System Suitability |
|---|---|---|---|---|
| Mark-and-Sweep | Yes | O(V+E) | O(V) for marks | Poor (long pauses) |
| Tri-Color | No (incremental) | O(V+E) amortized | O(V) for colors | Better (no pauses) |
| Reference Counting + Cycle Detection | Periodic | O(V+E) for detection | Minimal | Common hybrid |
| Epoch-based | No | O(modified) | Per-epoch tracking | Good for versioning |
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248
/* * Mark-and-Sweep Garbage Collection for Graph Directories * Demonstrates how GC would work in a general graph file system. */ #include <stdio.h>#include <stdlib.h>#include <stdbool.h>#include <string.h> #define MAX_CHILDREN 10#define MAX_NODES 1000 /* Node states for tri-color marking */typedef enum { WHITE, /* Unvisited - potentially garbage */ GRAY, /* Visited but children not processed */ BLACK /* Fully processed - definitely reachable */} NodeColor; typedef struct FSNode { int id; char name[32]; struct FSNode *children[MAX_CHILDREN]; int child_count; NodeColor color; bool deleted;} FSNode; /* Global node table (simulating disk) */FSNode *node_table[MAX_NODES];int next_id = 0; /* Root of file system */FSNode *fs_root = NULL; /* * Create a node */FSNode* create_fs_node(const char *name) { FSNode *node = malloc(sizeof(FSNode)); node->id = next_id; strncpy(node->name, name, 31); node->child_count = 0; node->color = WHITE; node->deleted = false; node_table[next_id++] = node; return node;} /* * Link nodes */void link_nodes(FSNode *parent, FSNode *child) { if (parent->child_count < MAX_CHILDREN) { parent->children[parent->child_count++] = child; }} /* * Phase 1: Reset all nodes to white */void gc_reset(void) { for (int i = 0; i < next_id; i++) { if (node_table[i] && !node_table[i]->deleted) { node_table[i]->color = WHITE; } }} /* * Phase 2: Mark - traverse from root, coloring reachable nodes * Uses iterative approach with explicit gray set */void gc_mark(void) { FSNode *gray_set[MAX_NODES]; int gray_count = 0; if (!fs_root || fs_root->deleted) return; /* Start with root */ fs_root->color = GRAY; gray_set[gray_count++] = fs_root; while (gray_count > 0) { /* Pop a gray node */ FSNode *current = gray_set[--gray_count]; printf(" Visiting: %s (id=%d)", current->name, current->id); /* Process children */ for (int i = 0; i < current->child_count; i++) { FSNode *child = current->children[i]; if (child && !child->deleted && child->color == WHITE) { child->color = GRAY; gray_set[gray_count++] = child; } } /* Current is now fully processed */ current->color = BLACK; }} /* * Phase 3: Sweep - reclaim white (unreachable) nodes */int gc_sweep(void) { int collected = 0; for (int i = 0; i < next_id; i++) { FSNode *node = node_table[i]; if (node && !node->deleted && node->color == WHITE) { printf(" Collecting: %s (id=%d) - GARBAGE", node->name, node->id); node->deleted = true; /* Mark as deleted */ collected++; } } return collected;} /* * Full garbage collection cycle */int garbage_collect(void) { printf("===== Running Garbage Collection ===== "); printf("Phase 1: Reset colors"); gc_reset(); printf("Phase 2: Mark reachable nodes"); gc_mark(); printf("Phase 3: Sweep unreachable nodes"); int collected = gc_sweep(); printf("===== GC Complete: %d nodes collected ===== ", collected); return collected;} /* * Print file system state */void print_filesystem(void) { printf("File System State:"); printf("------------------"); for (int i = 0; i < next_id; i++) { FSNode *n = node_table[i]; if (n && !n->deleted) { printf(" [%d] %s: %d children", n->id, n->name, n->child_count); } } printf("");} /* * Demonstrate GC with cycles */void demonstrate_gc(void) { printf("=== General Graph Directory with Garbage Collection === "); /* Create file system structure */ fs_root = create_fs_node("root"); FSNode *home = create_fs_node("home"); FSNode *alice = create_fs_node("alice"); FSNode *bob = create_fs_node("bob"); FSNode *shared = create_fs_node("shared"); /* Will become garbage */ FSNode *cycle1 = create_fs_node("cycle1"); /* Part of cycle */ FSNode *cycle2 = create_fs_node("cycle2"); /* Part of cycle */ /* Build tree portion */ link_nodes(fs_root, home); link_nodes(home, alice); link_nodes(home, bob); /* Build shared structure (reachable) */ link_nodes(alice, shared); link_nodes(bob, shared); /* Build cycle not connected to root */ link_nodes(cycle1, cycle2); link_nodes(cycle2, cycle1); /* Creates cycle! */ printf("Initial structure:"); printf(" root → home → alice → shared"); printf(" ↘ bob → shared"); printf(" "); printf(" cycle1 ⟷ cycle2 (disconnected cycle - GARBAGE!)"); print_filesystem(); /* First GC - should collect the cycle */ printf("Running GC to find the disconnected cycle..."); garbage_collect(); print_filesystem(); /* Now simulate removing shared from tree */ printf("Simulating removal: alice no longer links to shared"); printf(" bob no longer links to shared"); alice->child_count = 0; /* Remove alice → shared */ bob->child_count = 0; /* Remove bob → shared */ printf("Now 'shared' is isolated (no incoming links)..."); garbage_collect(); print_filesystem();} int main() { demonstrate_gc(); return 0;}While traditional file systems avoid GC, modern storage systems like log-structured file systems, SSDs, and distributed storage all use garbage collection for space reclamation. The principles learned here apply directly to these systems.
While strict general graph directories are rare in production file systems, the concepts appear in various historical and modern systems.
Historical Systems:
1. Multics (1965-2000)
Multics allowed symbolic links that could create apparent cycles. It handled this through:
2. TENEX/TOPS-20 (1969-1990s)
PDP-10 systems allowed more flexible linking. The OS tracked links and performed periodic cleanup operations.
3. Research Systems
Academic research explored general graph file systems:
Modern Systems with Graph Characteristics:
| System | Graph Feature | Cycle Handling | Use Case |
|---|---|---|---|
| Git | Commit DAG (merges) | DAG enforced (acyclic) | Version control |
| ZFS Snapshots | Dataset relationships | Controlled hierarchy | Copy-on-write storage |
| Btrfs Subvolumes | Multiple mount points | Namespace isolation | Flexible storage |
| Docker Layers | Layer inheritance | DAG of layers | Container images |
| IPFS | Content-addressed DAG | Hash-based (acyclic) | Distributed storage |
| Google File System | Chunk replication | Metadata separation | Distributed storage |
Git as a DAG Example:
Git's commit history is a DAG, not a tree:
A ← B ← D
╲ ╱
← C
Commit D has two parents (B and C) from a merge. This is a multi-parent relationship (DAG), but there are no cycles—you can never arrive at a commit by following its parent chain.
Git uses garbage collection to clean up unreachable commits:
$ git gc # Runs mark-and-sweep style collection
Object Storage Systems:
Amazon S3, Google Cloud Storage, and similar systems use flat namespaces (essentially single-level with prefixes) because:
This is a return to simplicity—choosing flat structure for distributed scale.
Modern cloud storage often resembles single-level directories! S3 buckets contain objects with key names like 'folder/subfolder/file.txt'—but those '/' separators are just characters in a flat namespace. We've gone from single-level → two-level → tree → DAG → back to flat for massive scale.
Given the expressiveness of general graphs, why do acyclic structures dominate production file systems? The answer involves engineering trade-offs, user expectations, and fundamental complexity management.
1. Simplicity of Deletion
DAG/Tree: unlink() → decrement count → maybe delete
Graph: unlink() → decrement count → ???
(can't delete until GC runs)
(GC is expensive and complex)
Reference counting is vastly simpler and more predictable than garbage collection.
2. Predictable Traversal
find /home -name '*.txt'
DAG: Visits each file once (track visited inodes)
Graph: Must detect cycles, may visit nodes multiple times on different paths
Every file system utility would need cycle detection, complicating implementation and slowing operation.
3. User Mental Model
Users understand hierarchies intuitively:
Cycles break this mental model:
4. Implementation Complexity
| Operation | Tree/DAG Complexity | General Graph Complexity |
|---|---|---|
| Create file | O(path length) | O(path length) |
| Delete file | O(1) after path resolution | O(1) + periodic O(V+E) GC |
| List directory | O(entries in dir) | O(entries in dir) |
| Recursive delete | O(subtree size) | O(subtree) + cycle detection |
| Free space calculation | Simple: sum free blocks | Complex: GC may reclaim more |
| Backup | Tree walk | Complex: shared node handling |
| fsck (repair) | Tree consistency check | Full graph analysis required |
5. Recovery and Reliability
File system recovery after crashes depends on consistent structures:
Tree: If parent pointer is consistent, subtree is recoverable
Graph: Broken links might be cycles or legitimate sharing
Recovery tool can't easily distinguish
The fsck tool assumes tree/DAG structure. Graph structures would require fundamentally different recovery approaches.
6. Quotas and Accounting
With cycles, questions become complex:
The Engineering Verdict:
The marginal expressiveness gained from allowing cycles doesn't justify the complexity cost. DAGs provide enough flexibility (hard links, symlinks) for real use cases while maintaining manageable complexity.
This is a recurring pattern in systems design: simpler abstractions with slightly less expressiveness often win over theoretically more powerful but complex alternatives. Trees and DAGs hit a sweet spot for file system organization.
We've completed our journey through directory structure evolution—from the simplest single-level directories to the theoretical extreme of general graphs. Let's synthesize the key insights from this entire module.
The Evolution Pattern:
Single-Level → Two-Level → Tree → DAG
│ │ │ │
All files User Arbitrary Sharing
jumbled isolation hierarchy via
together solves solves links
inter-user intra-user
naming organization
Each step solved specific limitations of the previous model while adding just enough complexity to be worthwhile.
The General Graph Lesson:
DAG → General Graph
│
Adds: True cycles possible
Costs: GC required, complexity explodes
Benefit: Marginal additional expressiveness
Verdict: Not worth it for file systems
This teaches a broader engineering lesson: more expressive isn't always better. The 'best' design is often one that provides just enough capability for real needs while minimizing complexity.
You now have a comprehensive understanding of directory structure design—from historical evolution to modern implementations to theoretical limits. This knowledge applies whenever you work with file systems, design storage hierarchies, or architect systems requiring hierarchical organization. You can now appreciate why your file manager looks the way it does and why 'mkdir -p' works while 'cyclic directory links' don't exist in your daily computing.