Distributed DatabasesFragmentation

Data Fragmentation in Distributed Databases

LevelAdvanced

Duration90 mins

TopicFragmentation

2 / 5

Vertical Fragmentation

When Rows Aren't the Right Unit

Consider an Employee table with 50 attributes: personal information (name, address, phone), employment details (hire date, department, salary), performance metrics (review scores, promotion history), and rarely-accessed archival data (original application, interview notes, previous employer references).

A typical HR dashboard query fetches only 5-7 attributes for display. An analytics job processes salary and performance data exclusively. The archival data is accessed perhaps once per year during audits. If all 50 attributes are stored together, every query reads vastly more data than needed.

Vertical fragmentation addresses this by partitioning a relation by columns rather than rows. Each fragment contains all rows but only a subset of attributes. This technique optimizes for access patterns where different applications or queries consistently use different attribute groups.

This page explores vertical fragmentation in rigorous depth—from the theoretical foundations of attribute affinity analysis through practical algorithms for optimal attribute grouping, addressing the unique challenges of maintaining tuple identity and reconstruction guarantees across column-oriented fragments.

What You Will Learn

By the end of this page, you will understand: (1) The formal definition and correctness properties of vertical fragmentation, (2) Attribute affinity analysis and the use matrix methodology, (3) Clustering algorithms for grouping attributes into fragments, (4) The critical role of the tuple identifier in maintaining reconstruction, (5) Trade-offs between vertical fragmentation and column-store architectures, and (6) When to choose vertical over horizontal fragmentation.

Formal Foundation of Vertical Fragmentation

Vertical fragmentation decomposes a relation's schema rather than its population. Understanding the formal properties is essential for correct design.

Definition:

Given a relation R with attributes A = {a₁, a₂, ..., aₘ} and primary key K, a vertical fragmentation of R produces fragments R₁, R₂, ..., Rₙ such that:

Attribute Coverage: Every non-key attribute appears in at least one fragment
Key Replication: The primary key K appears in every fragment
Reconstruction: The original relation R can be obtained by joining all fragments on K
Lossless Join: The join of all fragments produces exactly the original tuples (no spurious tuples)

Mathematically:

Coverage: ⋃ᵢ₌₁ⁿ attributes(Rᵢ) \ K = A \ K (all non-key attributes covered)
Key Replication: K ⊆ attributes(Rᵢ) for all i
Reconstruction: R = R₁ ⋈ R₂ ⋈ ... ⋈ Rₙ (natural join on K)

Key Insight: Unlike horizontal fragmentation where fragments are disjoint, vertical fragments overlap in the key attributes. This overlap is what enables reconstruction through joins.

The Tuple Identifier Problem

Vertical fragmentation fundamentally requires a stable tuple identifier replicated across all fragments. In many databases, the natural primary key suffices. However, for tables with composite keys or mutable primary keys, a system-generated surrogate key (tuple-id) may be necessary. This tuple-id becomes the 'glue' binding columns from different fragments back to their original rows.

Attribute Projections:

Each vertical fragment Rᵢ is defined as a projection:

Rᵢ = π(K ∪ Aᵢ)(R)

Where Aᵢ is the set of non-key attributes assigned to fragment i, and π denotes the projection operation.

Example: Employee Table Fragmentation

Original relation: Employee(emp_id, name, address, phone, department, salary, hire_date, review_score, notes)

Vertical fragments:

R₁ = π(emp_id, name, address, phone)(Employee) → Personal Info
R₂ = π(emp_id, department, salary, hire_date)(Employee) → Employment Details
R₃ = π(emp_id, review_score, notes)(Employee) → Performance Data

Reconstruction: Employee = R₁ ⋈ R₂ ⋈ R₃ (natural join on emp_id)

Vertical vs. Horizontal Fragmentation Properties
Property	Horizontal Fragmentation	Vertical Fragmentation
Partition Unit	Rows (tuples)	Columns (attributes)
Fragment Overlap	None—fragments are disjoint	Key attributes replicated in all fragments
Reconstruction Operator	Union (∪)	Natural Join (⋈)
Storage Overhead	None inherent	Key replication in each fragment
Query Optimization	Fragment elimination by row predicates	Fragment elimination by attribute access
Primary Driver	Data locality, load distribution	Access pattern optimization, column affinity

Attribute Affinity Analysis

Optimal vertical fragmentation requires understanding which attributes are accessed together. Attribute affinity analysis quantifies the co-access patterns to guide grouping decisions.

The Attribute Usage Matrix:

The first step is constructing an attribute usage matrix U where:

Rows represent applications/queries q₁, q₂, ..., qₖ
Columns represent attributes a₁, a₂, ..., aₘ
U[i][j] = 1 if query qᵢ accesses attribute aⱼ, otherwise 0

Query Frequency Weighting:

Not all queries are equally important. Weight each query by its execution frequency:

Let f(qᵢ) = frequency of query qᵢ (executions per time unit)
Weighted usage: W[i][j] = U[i][j] × f(qᵢ)

Attribute Affinity Matrix:

From the usage matrix, derive the attribute affinity matrix AA where:

AA[j][k] = Σᵢ (U[i][j] × U[i][k] × f(qᵢ))

This measures how often attributes aⱼ and aₖ are accessed together, weighted by query frequency. High affinity suggests grouping; low affinity suggests separation.

Affinity Properties:

AA is symmetric: AA[j][k] = AA[k][j]
Diagonal entries AA[j][j] = total frequency of queries accessing aⱼ
Zero affinity: attributes never accessed together

affinity_analysis.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
import numpy as np
from typing import List, Dict, Tuple
 
def compute_affinity_matrix(
    attributes: List[str],
    queries: List[Dict],  # [{"name": str, "frequency": int, "attributes": List[str]}]
) -> Tuple[np.ndarray, np.ndarray]:
    """
    Compute the attribute usage and affinity matrices.
    
    Args:
        attributes: List of attribute names
        queries: List of query specifications with name, frequency, and accessed attributes
    
    Returns:
        Tuple of (usage_matrix, affinity_matrix)
    """
    n_attrs = len(attributes)
    n_queries = len(queries)
    attr_index = {attr: i for i, attr in enumerate(attributes)}
    
    # Build usage matrix U where U[q][a] = 1 if query q uses attribute a
    usage_matrix = np.zeros((n_queries, n_attrs), dtype=int)
    frequencies = np.zeros(n_queries)
    
    for q_idx, query in enumerate(queries):
        frequencies[q_idx] = query["frequency"]
        for attr in query["attributes"]:
            if attr in attr_index:
                usage_matrix[q_idx][attr_index[attr]] = 1
    
    # Build affinity matrix AA where AA[a1][a2] = sum of frequencies 
    # for queries accessing both a1 and a2
    affinity_matrix = np.zeros((n_attrs, n_attrs))
    
    for q_idx in range(n_queries):
        freq = frequencies[q_idx]
        accessed = np.where(usage_matrix[q_idx] == 1)[0]
        # For each pair of attributes accessed by this query
        for i in accessed:
            for j in accessed:
                affinity_matrix[i][j] += freq
    
    return usage_matrix, affinity_matrix
 
 
# Example: Employee table fragmentation analysis
attributes = [
    "emp_id", "name", "address", "phone", "email",  # Personal
    "department", "title", "salary", "hire_date",    # Employment
    "review_score", "last_review", "manager_notes"   # Performance  
]
 
queries = [
    {
        "name": "Employee Directory Lookup",
        "frequency": 10000,  # Very frequent
        "attributes": ["emp_id", "name", "email", "department", "title"]
    },
    {
        "name": "Contact Information",
        "frequency": 5000,
        "attributes": ["emp_id", "name", "phone", "address", "email"]
    },
    {
        "name": "Salary Report",
        "frequency": 100,  # Monthly
        "attributes": ["emp_id", "name", "department", "salary", "hire_date"]
    },
    {
        "name": "Performance Review",
        "frequency": 50,  # Quarterly
        "attributes": ["emp_id", "name", "review_score", "last_review", "manager_notes"]
    },
    {
        "name": "HR Full Profile",
        "frequency": 10,  # Rare
        "attributes": attributes  # All attributes
    }
]
 
usage, affinity = compute_affinity_matrix(attributes, queries)
 
print("Attribute Affinity Matrix (showing high-affinity pairs):")
print("-" * 60)
for i, attr_i in enumerate(attributes):
    for j, attr_j in enumerate(attributes):
        if i < j and affinity[i][j] > 5000:  # High affinity threshold
            print(f"  {attr_i} <-> {attr_j}: {affinity[i][j]:,.0f}")

Interpreting the Affinity Matrix:

The affinity matrix reveals natural attribute groupings:

High Affinity Clusters: Attributes with high mutual affinity should be in the same fragment
- In our example: (name, email, department, title) have high affinity due to directory lookups
Low Affinity Separation: Attributes rarely accessed together can be in different fragments
- (phone, address) vs. (review_score, manager_notes) have low cross-affinity
Universal Attributes: Some attributes (like primary key, name) appear in many queries
- Consider replicating these in multiple fragments to avoid joins
Singleton Attributes: Rarely accessed attributes might form their own small fragment
- Archival data accessed only for compliance audits

Site-Specific Affinity

In distributed systems, extend affinity analysis per site. If queries at Site A use attributes {a, b, c} while Site B uses {c, d, e}, create fragments aligned with each site's access patterns. The attribute 'c' would be replicated in both fragments, trading storage for locality.

Clustering Algorithms for Vertical Fragmentation

Given an affinity matrix, we need algorithms to partition attributes into fragments. The Bond Energy Algorithm (BEA) is the classical approach, reorganizing the matrix to cluster high-affinity attributes together.

Bond Energy Algorithm:

Objective: Rearrange rows and columns to maximize the sum of products of adjacent elements
Intuition: High-affinity attributes should be adjacent in the reordered matrix
Process:
- Start with an arbitrary column ordering
- Iteratively insert each column in the position that maximizes bond energy
- Apply the same process to rows

Bond Energy Measure:

The global affinity measure (AM) quantifies clustering quality:

AM = Σᵢ Σⱼ AA[i][j] × (AA[i-1][j] + AA[i+1][j] + AA[i][j-1] + AA[i][j+1])

Higher AM indicates better clustering—high-affinity attributes are adjacent.

Fragment Identification:

After BEA reordering, clusters appear as dense blocks along the diagonal. Fragment boundaries are placed between low-affinity attribute groups, typically using a threshold or optimization.

bond_energy_algorithm.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
import numpy as np
from typing import List, Tuple
 
def bond_energy_contribution(matrix: np.ndarray, col_order: List[int], 
                             new_col: int, position: int) -> float:
    """
    Calculate the bond energy contribution of inserting new_col at position.
    """
    n = matrix.shape[0]
    contribution = 0
    
    # Contribution from left neighbor
    if position > 0:
        left_col = col_order[position - 1]
        for i in range(n):
            contribution += 2 * matrix[i][new_col] * matrix[i][left_col]
    
    # Contribution from right neighbor
    if position < len(col_order):
        right_col = col_order[position]
        for i in range(n):
            contribution += 2 * matrix[i][new_col] * matrix[i][right_col]
    
    return contribution
 
 
def bond_energy_algorithm(affinity_matrix: np.ndarray) -> List[int]:
    """
    Apply the Bond Energy Algorithm to reorder columns for clustering.
    
    Args:
        affinity_matrix: Square symmetric affinity matrix
    
    Returns:
        Optimal column ordering as list of indices
    """
    n = affinity_matrix.shape[0]
    
    # Start with first two columns in arbitrary order
    ordered = [0, 1]
    remaining = list(range(2, n))
    
    # Greedily insert each remaining column at optimal position
    for col in remaining:
        best_position = 0
        best_contribution = float('-inf')
        
        # Try inserting at each position (0 to len(ordered))
        for pos in range(len(ordered) + 1):
            contrib = bond_energy_contribution(
                affinity_matrix, ordered, col, pos
            )
            if contrib > best_contribution:
                best_contribution = contrib
                best_position = pos
        
        ordered.insert(best_position, col)
    
    return ordered
 
 
def identify_fragments(affinity_matrix: np.ndarray, 
                       col_order: List[int],
                       threshold_percentile: float = 25) -> List[List[int]]:
    """
    Identify fragment boundaries in reordered affinity matrix.
    Uses threshold on between-cluster affinity.
    """
    n = len(col_order)
    
    # Compute affinity between adjacent columns
    adjacent_affinities = []
    for i in range(n - 1):
        col_i = col_order[i]
        col_j = col_order[i + 1]
        adjacent_affinities.append(affinity_matrix[col_i][col_j])
    
    # Threshold below which we split fragments
    threshold = np.percentile(adjacent_affinities, threshold_percentile)
    
    # Identify split points
    fragments = []
    current_fragment = [col_order[0]]
    
    for i in range(len(adjacent_affinities)):
        if adjacent_affinities[i] < threshold:
            # Low affinity - start new fragment
            fragments.append(current_fragment)
            current_fragment = [col_order[i + 1]]
        else:
            current_fragment.append(col_order[i + 1])
    
    fragments.append(current_fragment)
    return fragments
 
 
# Example usage
attributes = ["emp_id", "name", "address", "phone", "email",
              "department", "title", "salary", "hire_date",
              "review_score", "last_review", "manager_notes"]
 
# Simplified affinity matrix (symmetric)
affinity = np.array([
    # emp_id, name, address, phone, email, dept, title, salary, hire, review, last_rev, notes
    [15160, 15160, 5010, 5010, 15010, 10110, 10110, 110, 110, 60, 60, 60],  # emp_id
    [15160, 15160, 5010, 5010, 15010, 10110, 10110, 110, 110, 60, 60, 60],  # name
    [5010,  5010,  5010, 5010, 5010,  10,    10,    10,  10,  10, 10, 10],  # address
    [5010,  5010,  5010, 5010, 5010,  10,    10,    10,  10,  10, 10, 10],  # phone
    [15010, 15010, 5010, 5010, 15010, 10010, 10010, 10,  10,  10, 10, 10],  # email
    [10110, 10110, 10,   10,   10010, 10110, 10110, 110, 110, 10, 10, 10],  # department
    [10110, 10110, 10,   10,   10010, 10110, 10110, 10,  10,  10, 10, 10],  # title
    [110,   110,   10,   10,   10,    110,   10,    110, 110, 10, 10, 10],  # salary
    [110,   110,   10,   10,   10,    110,   10,    110, 110, 10, 10, 10],  # hire_date
    [60,    60,    10,   10,   10,    10,    10,    10,  10,  60, 60, 60],  # review_score
    [60,    60,    10,   10,   10,    10,    10,    10,  10,  60, 60, 60],  # last_review
    [60,    60,    10,   10,   10,    10,    10,    10,  10,  60, 60, 60],  # manager_notes
])
 
# Apply BEA
optimal_order = bond_energy_algorithm(affinity)
print("Optimal attribute order:", [attributes[i] for i in optimal_order])
 
# Identify fragments
fragments = identify_fragments(affinity, optimal_order)
print("
Identified fragments:")
for i, frag in enumerate(fragments):
    print(f"  Fragment {i + 1}: {[attributes[idx] for idx in frag]}")

Alternative Clustering Approaches

•Hierarchical Clustering — Build a dendrogram of attributes based on affinity distance. Cut at threshold to form fragments. Provides flexibility in choosing granularity.
•K-Means on Affinity — Treat affinity matrix rows as feature vectors. Cluster attributes into k groups. Requires choosing k beforehand.
•Graph Partitioning — Model attributes as graph nodes with affinity-weighted edges. Use graph partitioning algorithms (METIS, spectral clustering) to minimize cut weight.
•Genetic Algorithms — Optimize fragment assignments using evolutionary computation. Effective for complex objective functions with multiple constraints.
•Cost-Based Optimization — Define a cost model (I/O, network, joins) and search for fragmentation minimizing total query cost. Integration with query optimizer statistics.

Tuple Identifier Management

The tuple identifier (TID) is the critical mechanism enabling vertical fragment reconstruction. Every fragment must contain this identifier to support correct joins.

TID Requirements:

Uniqueness: Each tuple must have a globally unique identifier
Immutability: The TID must not change during the tuple's lifetime
Efficiency: TID comparison and indexing must be fast
Compactness: TID overhead should be minimal relative to fragment size

TID Options:

Natural Primary Key:
- Use existing primary key (e.g., emp_id, order_id)
- No additional storage overhead
- Works when PK is simple, stable, and well-distributed
Surrogate Key:
- System-generated identifier (auto-increment, UUID)
- Necessary when natural key is composite, mutable, or large
- Adds storage but guarantees stability
Physical Tuple ID:
- Database-internal row identifier (e.g., PostgreSQL's ctid, Oracle's ROWID)
- Efficient but not portable across fragments on different nodes
- May change with table reorganization (VACUUM, etc.)

The Mutable Key Problem

If the tuple identifier changes, vertical fragments become inconsistent. Consider an Employee whose emp_id changes due to corporate restructuring. Fragment R₁ has (old_id, name, address) while R₂ has (new_id, salary). A join produces no result—the employee appears to have no salary data. This is catastrophic data corruption disguised as missing data.

tid_implementation.sql
PostgreSQL
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
-- Vertical fragmentation with surrogate TID
 
-- Original table with composite natural key
CREATE TABLE Products (
    category_id INTEGER NOT NULL,
    product_code VARCHAR(50) NOT NULL,
    -- Many attributes...
    name VARCHAR(200),
    description TEXT,
    price DECIMAL(10, 2),
    cost DECIMAL(10, 2),
    weight_kg DECIMAL(8, 3),
    dimensions VARCHAR(100),
    supplier_id INTEGER,
    reorder_level INTEGER,
    discontinued BOOLEAN,
    image_url TEXT,
    technical_specs JSONB,
    PRIMARY KEY (category_id, product_code)
);
 
-- Add surrogate TID for fragmentation
ALTER TABLE Products ADD COLUMN tid BIGINT GENERATED ALWAYS AS IDENTITY;
CREATE UNIQUE INDEX idx_products_tid ON Products(tid);
 
-- Now create vertical fragments using TID
 
-- Fragment 1: Catalog information (frequently accessed)
CREATE TABLE Products_Catalog AS
SELECT tid, category_id, product_code, name, description, price, image_url
FROM Products;
 
ALTER TABLE Products_Catalog ADD PRIMARY KEY (tid);
CREATE INDEX idx_catalog_natural_key ON Products_Catalog(category_id, product_code);
 
-- Fragment 2: Inventory/supply chain (operations team)
CREATE TABLE Products_Inventory AS
SELECT tid, weight_kg, dimensions, supplier_id, reorder_level, discontinued
FROM Products;
 
ALTER TABLE Products_Inventory ADD PRIMARY KEY (tid);
 
-- Fragment 3: Financial data (restricted access)
CREATE TABLE Products_Financial AS
SELECT tid, cost
FROM Products;
 
ALTER TABLE Products_Financial ADD PRIMARY KEY (tid);
 
-- Fragment 4: Technical specifications (engineering)
CREATE TABLE Products_Technical AS
SELECT tid, technical_specs
FROM Products;
 
ALTER TABLE Products_Technical ADD PRIMARY KEY (tid);
 
-- Reconstruction via TID join
CREATE VIEW Products_Reconstructed AS
SELECT 
    c.category_id, c.product_code, c.name, c.description, c.price, c.image_url,
    i.weight_kg, i.dimensions, i.supplier_id, i.reorder_level, i.discontinued,
    f.cost,
    t.technical_specs
FROM Products_Catalog c
JOIN Products_Inventory i ON c.tid = i.tid
JOIN Products_Financial f ON c.tid = f.tid
JOIN Products_Technical t ON c.tid = t.tid;

TID Synchronization Challenges:

In a distributed setting, maintaining TID consistency across fragments requires careful coordination:

Insert Operations:
- TID must be generated before inserting into any fragment
- All fragments must receive the same TID
- Use distributed sequence generators or UUID
Delete Operations:
- Deletion must remove tuple from ALL fragments
- Orphaned TIDs in some fragments create ghost data
- Requires distributed delete protocols
TID Exhaustion:
- BIGINT provides 9.2 × 10¹⁸ values—practically unlimited
- But sequential generation creates hotspots
- Consider snowflake IDs or UUIDs for distribution
TID Reuse:
- Never reuse TIDs—historical references may still exist
- Audit logs, analytics, caches may reference old TIDs
- Treat TIDs as immutable, non-recyclable identifiers

Attribute Replication Strategies

Strict vertical fragmentation partitions attributes disjointly (except TID). However, practical systems often replicate certain attributes across fragments to avoid joins for common queries.

Motivation for Replication:

Consider a query accessing name (Fragment 1) and salary (Fragment 2):

SELECT name, salary FROM Employee WHERE emp_id = 12345;

With strict fragmentation, this requires joining two fragments. If name frequently accompanies other fragments' data, replicating it avoids the join.

Replication Candidates:

High-Frequency Attributes: Accessed in most queries (e.g., name, status)
Small Attributes: Low storage overhead for replication (e.g., date, enum)
Join Keys: Besides TID, secondary join attributes may be replicated
Display Attributes: User-facing attributes commonly needed for presentation

Storage vs. Performance Trade-off:

Approach	Storage Cost	Read Performance	Write Complexity
No replication	Minimal	Requires joins	Simple
Selective replication	Moderate	Fewer joins	Update multiple fragments
Full replication	Maximal	No joins needed	Update all fragments

Update Anomaly Prevention:

Replicated attributes must be updated consistently across all fragments. An inconsistent update creates the same problems as horizontal fragmentation overlap:

Display shows different names in different views
Reconciliation becomes impossible without audit trails

Implement replication updates as atomic distributed operations or accept eventual consistency for read-only replicas.

attribute_replication.sql
SQL
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
-- Vertical fragmentation with selective attribute replication
 
-- Fragment 1: Personal information (primary for name, email)
CREATE TABLE Employee_Personal (
    tid         BIGINT PRIMARY KEY,
    emp_id      INTEGER UNIQUE NOT NULL,
    name        VARCHAR(100) NOT NULL,  -- Primary source
    address     TEXT,
    phone       VARCHAR(20),
    email       VARCHAR(100)
);
 
-- Fragment 2: Employment details (replicates name for display)
CREATE TABLE Employee_Employment (
    tid         BIGINT PRIMARY KEY,
    emp_id      INTEGER NOT NULL,       -- Replicated for filtering
    name        VARCHAR(100) NOT NULL,  -- Replicated for display
    department  VARCHAR(50),
    title       VARCHAR(100),
    salary      DECIMAL(10, 2),
    hire_date   DATE,
    FOREIGN KEY (tid) REFERENCES Employee_Personal(tid)
);
 
-- Fragment 3: Performance data (replicates name, department for context)
CREATE TABLE Employee_Performance (
    tid             BIGINT PRIMARY KEY,
    emp_id          INTEGER NOT NULL,       -- Replicated
    name            VARCHAR(100) NOT NULL,  -- Replicated
    department      VARCHAR(50),            -- Replicated
    review_score    DECIMAL(3, 2),
    last_review     DATE,
    manager_notes   TEXT,
    FOREIGN KEY (tid) REFERENCES Employee_Personal(tid)
);
 
-- Trigger to maintain replication consistency on name update
CREATE OR REPLACE FUNCTION sync_name_replication()
RETURNS TRIGGER AS $$
BEGIN
    IF NEW.name <> OLD.name THEN
        UPDATE Employee_Employment SET name = NEW.name WHERE tid = NEW.tid;
        UPDATE Employee_Performance SET name = NEW.name WHERE tid = NEW.tid;
    END IF;
    RETURN NEW;
END;
$$ LANGUAGE plpgsql;
 
CREATE TRIGGER tr_sync_name
AFTER UPDATE OF name ON Employee_Personal
FOR EACH ROW EXECUTE FUNCTION sync_name_replication();
 
-- Now queries can avoid joins for common patterns:
 
-- Salary query with name - no join needed
SELECT name, salary, title 
FROM Employee_Employment 
WHERE department = 'Engineering';
 
-- Performance query with context - no join needed
SELECT name, department, review_score 
FROM Employee_Performance 
WHERE last_review >= '2024-01-01';

Replication Maintenance Cost

Every replicated attribute adds write amplification. A name change in our example requires updating 3 fragments. For high-update-frequency attributes, the synchronization overhead may outweigh read benefits. Profile actual workloads before replicating mutable attributes.

Vertical Fragmentation vs. Column Stores

Vertical fragmentation shares conceptual similarities with column-oriented storage (columnar databases). Understanding their relationship clarifies when each approach applies.

Column Stores (Columnar Databases):

Column-oriented databases store each column in a separate physical file or segment:

Each attribute is contiguous on disk
Queries read only needed columns
Excellent compression (similar values grouped)
Optimized for analytical workloads (OLAP)

Examples: Vertica, ClickHouse, Amazon Redshift, Apache Parquet

Key Differences:

Aspect	Vertical Fragmentation	Column Store
Scope	Distributed database design	Single-node storage format
Granularity	Groups of related attributes	Individual columns
Driver	Network/site optimization	I/O efficiency, compression
Reconstruction	Join across network	Local tuple assembly
Tuple Identity	Explicit TID management	Implicit positional alignment

When They Complement:

The approaches can combine:

Vertically fragment for distribution decisions
Store each fragment using column-oriented format locally
Benefit from both network locality and I/O efficiency

Example Architecture:

Site 1 (Sales Team):                Site 2 (Finance Team):
+-------------------------+         +-------------------------+
| Employee_Personal       |         | Employee_Financial      |
| (Column-store format)   |         | (Column-store format)   |
|  - tid.parquet          |         |  - tid.parquet          |
|  - name.parquet         |         |  - salary.parquet       |
|  - email.parquet        |         |  - bonus.parquet        |
|  - phone.parquet        |         |  - equity.parquet       |
+-------------------------+         +-------------------------+

When to Use Vertical Fragmentation

•Different user groups access different attribute sets
•Security requires attribute-level access control at node level
•Attributes have vastly different update frequencies
•Distributed sites have different functional responsibilities
•Regulatory requirements mandate attribute-level data residency

When Column Stores Suffice

•Single-site deployment with column-oriented queries
•Analytical workloads scanning few columns over many rows
•Compression benefits outweigh operational complexity
•No distribution-based access pattern differentiation
•Read-heavy workloads with batch updates

Hybrid Architectures

Modern distributed analytical databases (Snowflake, BigQuery, Databricks) combine both concepts: data is horizontally partitioned across nodes (cloud storage), vertically within each partition (columnar format), and dynamically cached based on access patterns. This layered approach addresses scale, performance, and cost simultaneously.

Summary: Vertical Fragmentation Mastery

Vertical fragmentation partitions tables by columns, optimizing for attribute-level access patterns in distributed systems. Let's consolidate the key concepts:

Key Takeaways

•Formal Properties — Vertical fragments require attribute coverage, key replication, and lossless join reconstruction. All fragments must include the tuple identifier.
•Affinity Analysis — The attribute usage and affinity matrices quantify co-access patterns, guiding grouping decisions. High-affinity attributes belong together.
•Clustering Algorithms — The Bond Energy Algorithm and alternatives cluster attributes by affinity. Fragment boundaries fall between low-affinity groups.
•Tuple Identifier — The TID is the glue enabling reconstruction. It must be unique, immutable, and efficiently indexed in every fragment.
•Attribute Replication — Selectively replicating high-frequency attributes reduces join requirements but introduces update synchronization overhead.
•Column Store Relationship — Vertical fragmentation operates at design/distribution level; column stores operate at storage format level. They can combine beneficially.

What's Next:

Real-world scenarios often require combining horizontal and vertical strategies. The next page explores Hybrid Fragmentation, where tables are partitioned both by rows AND columns to address complex access patterns spanning multiple dimensions.

Page Complete

You now understand vertical fragmentation at production depth—from affinity analysis through clustering algorithms to tuple identifier management. This knowledge enables designing column-aware distributed schemas that minimize unnecessary data access while maintaining reconstruction guarantees.

2 / 5

Loading learning content...

Distributed DatabasesFragmentation

Data Fragmentation in Distributed Databases

LevelAdvanced

Duration90 mins

TopicFragmentation

2 / 5

Vertical Fragmentation

When Rows Aren't the Right Unit

What You Will Learn

Formal Foundation of Vertical Fragmentation

Vertical fragmentation decomposes a relation's schema rather than its population. Understanding the formal properties is essential for correct design.

Definition:

Given a relation R with attributes A = {a₁, a₂, ..., aₘ} and primary key K, a vertical fragmentation of R produces fragments R₁, R₂, ..., Rₙ such that:

Attribute Coverage: Every non-key attribute appears in at least one fragment
Key Replication: The primary key K appears in every fragment
Reconstruction: The original relation R can be obtained by joining all fragments on K
Lossless Join: The join of all fragments produces exactly the original tuples (no spurious tuples)

Mathematically:

Coverage: ⋃ᵢ₌₁ⁿ attributes(Rᵢ) \ K = A \ K (all non-key attributes covered)
Key Replication: K ⊆ attributes(Rᵢ) for all i
Reconstruction: R = R₁ ⋈ R₂ ⋈ ... ⋈ Rₙ (natural join on K)

Key Insight: Unlike horizontal fragmentation where fragments are disjoint, vertical fragments overlap in the key attributes. This overlap is what enables reconstruction through joins.

The Tuple Identifier Problem

Attribute Projections:

Each vertical fragment Rᵢ is defined as a projection:

Rᵢ = π(K ∪ Aᵢ)(R)

Where Aᵢ is the set of non-key attributes assigned to fragment i, and π denotes the projection operation.

Example: Employee Table Fragmentation

Original relation: Employee(emp_id, name, address, phone, department, salary, hire_date, review_score, notes)

Vertical fragments:

R₁ = π(emp_id, name, address, phone)(Employee) → Personal Info
R₂ = π(emp_id, department, salary, hire_date)(Employee) → Employment Details
R₃ = π(emp_id, review_score, notes)(Employee) → Performance Data

Reconstruction: Employee = R₁ ⋈ R₂ ⋈ R₃ (natural join on emp_id)

Vertical vs. Horizontal Fragmentation Properties
Property	Horizontal Fragmentation	Vertical Fragmentation
Partition Unit	Rows (tuples)	Columns (attributes)
Fragment Overlap	None—fragments are disjoint	Key attributes replicated in all fragments
Reconstruction Operator	Union (∪)	Natural Join (⋈)
Storage Overhead	None inherent	Key replication in each fragment
Query Optimization	Fragment elimination by row predicates	Fragment elimination by attribute access
Primary Driver	Data locality, load distribution	Access pattern optimization, column affinity

Attribute Affinity Analysis

Optimal vertical fragmentation requires understanding which attributes are accessed together. Attribute affinity analysis quantifies the co-access patterns to guide grouping decisions.

The Attribute Usage Matrix:

The first step is constructing an attribute usage matrix U where:

Rows represent applications/queries q₁, q₂, ..., qₖ
Columns represent attributes a₁, a₂, ..., aₘ
U[i][j] = 1 if query qᵢ accesses attribute aⱼ, otherwise 0

Query Frequency Weighting:

Not all queries are equally important. Weight each query by its execution frequency:

Let f(qᵢ) = frequency of query qᵢ (executions per time unit)
Weighted usage: W[i][j] = U[i][j] × f(qᵢ)

Attribute Affinity Matrix:

From the usage matrix, derive the attribute affinity matrix AA where:

AA[j][k] = Σᵢ (U[i][j] × U[i][k] × f(qᵢ))

This measures how often attributes aⱼ and aₖ are accessed together, weighted by query frequency. High affinity suggests grouping; low affinity suggests separation.

Affinity Properties:

AA is symmetric: AA[j][k] = AA[k][j]
Diagonal entries AA[j][j] = total frequency of queries accessing aⱼ
Zero affinity: attributes never accessed together

affinity_analysis.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
import numpy as np
from typing import List, Dict, Tuple
 
def compute_affinity_matrix(
    attributes: List[str],
    queries: List[Dict],  # [{"name": str, "frequency": int, "attributes": List[str]}]
) -> Tuple[np.ndarray, np.ndarray]:
    """
    Compute the attribute usage and affinity matrices.
    
    Args:
        attributes: List of attribute names
        queries: List of query specifications with name, frequency, and accessed attributes
    
    Returns:
        Tuple of (usage_matrix, affinity_matrix)
    """
    n_attrs = len(attributes)
    n_queries = len(queries)
    attr_index = {attr: i for i, attr in enumerate(attributes)}
    
    # Build usage matrix U where U[q][a] = 1 if query q uses attribute a
    usage_matrix = np.zeros((n_queries, n_attrs), dtype=int)
    frequencies = np.zeros(n_queries)
    
    for q_idx, query in enumerate(queries):
        frequencies[q_idx] = query["frequency"]
        for attr in query["attributes"]:
            if attr in attr_index:
                usage_matrix[q_idx][attr_index[attr]] = 1
    
    # Build affinity matrix AA where AA[a1][a2] = sum of frequencies 
    # for queries accessing both a1 and a2
    affinity_matrix = np.zeros((n_attrs, n_attrs))
    
    for q_idx in range(n_queries):
        freq = frequencies[q_idx]
        accessed = np.where(usage_matrix[q_idx] == 1)[0]
        # For each pair of attributes accessed by this query
        for i in accessed:
            for j in accessed:
                affinity_matrix[i][j] += freq
    
    return usage_matrix, affinity_matrix
 
 
# Example: Employee table fragmentation analysis
attributes = [
    "emp_id", "name", "address", "phone", "email",  # Personal
    "department", "title", "salary", "hire_date",    # Employment
    "review_score", "last_review", "manager_notes"   # Performance  
]
 
queries = [
    {
        "name": "Employee Directory Lookup",
        "frequency": 10000,  # Very frequent
        "attributes": ["emp_id", "name", "email", "department", "title"]
    },
    {
        "name": "Contact Information",
        "frequency": 5000,
        "attributes": ["emp_id", "name", "phone", "address", "email"]
    },
    {
        "name": "Salary Report",
        "frequency": 100,  # Monthly
        "attributes": ["emp_id", "name", "department", "salary", "hire_date"]
    },
    {
        "name": "Performance Review",
        "frequency": 50,  # Quarterly
        "attributes": ["emp_id", "name", "review_score", "last_review", "manager_notes"]
    },
    {
        "name": "HR Full Profile",
        "frequency": 10,  # Rare
        "attributes": attributes  # All attributes
    }
]
 
usage, affinity = compute_affinity_matrix(attributes, queries)
 
print("Attribute Affinity Matrix (showing high-affinity pairs):")
print("-" * 60)
for i, attr_i in enumerate(attributes):
    for j, attr_j in enumerate(attributes):
        if i < j and affinity[i][j] > 5000:  # High affinity threshold
            print(f"  {attr_i} <-> {attr_j}: {affinity[i][j]:,.0f}")

Interpreting the Affinity Matrix:

The affinity matrix reveals natural attribute groupings:

High Affinity Clusters: Attributes with high mutual affinity should be in the same fragment
- In our example: (name, email, department, title) have high affinity due to directory lookups
Low Affinity Separation: Attributes rarely accessed together can be in different fragments
- (phone, address) vs. (review_score, manager_notes) have low cross-affinity
Universal Attributes: Some attributes (like primary key, name) appear in many queries
- Consider replicating these in multiple fragments to avoid joins
Singleton Attributes: Rarely accessed attributes might form their own small fragment
- Archival data accessed only for compliance audits

Site-Specific Affinity

Clustering Algorithms for Vertical Fragmentation

Bond Energy Algorithm:

Objective: Rearrange rows and columns to maximize the sum of products of adjacent elements
Intuition: High-affinity attributes should be adjacent in the reordered matrix
Process:
- Start with an arbitrary column ordering
- Iteratively insert each column in the position that maximizes bond energy
- Apply the same process to rows

Bond Energy Measure:

The global affinity measure (AM) quantifies clustering quality:

AM = Σᵢ Σⱼ AA[i][j] × (AA[i-1][j] + AA[i+1][j] + AA[i][j-1] + AA[i][j+1])

Higher AM indicates better clustering—high-affinity attributes are adjacent.

Fragment Identification:

After BEA reordering, clusters appear as dense blocks along the diagonal. Fragment boundaries are placed between low-affinity attribute groups, typically using a threshold or optimization.

bond_energy_algorithm.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
import numpy as np
from typing import List, Tuple
 
def bond_energy_contribution(matrix: np.ndarray, col_order: List[int], 
                             new_col: int, position: int) -> float:
    """
    Calculate the bond energy contribution of inserting new_col at position.
    """
    n = matrix.shape[0]
    contribution = 0
    
    # Contribution from left neighbor
    if position > 0:
        left_col = col_order[position - 1]
        for i in range(n):
            contribution += 2 * matrix[i][new_col] * matrix[i][left_col]
    
    # Contribution from right neighbor
    if position < len(col_order):
        right_col = col_order[position]
        for i in range(n):
            contribution += 2 * matrix[i][new_col] * matrix[i][right_col]
    
    return contribution
 
 
def bond_energy_algorithm(affinity_matrix: np.ndarray) -> List[int]:
    """
    Apply the Bond Energy Algorithm to reorder columns for clustering.
    
    Args:
        affinity_matrix: Square symmetric affinity matrix
    
    Returns:
        Optimal column ordering as list of indices
    """
    n = affinity_matrix.shape[0]
    
    # Start with first two columns in arbitrary order
    ordered = [0, 1]
    remaining = list(range(2, n))
    
    # Greedily insert each remaining column at optimal position
    for col in remaining:
        best_position = 0
        best_contribution = float('-inf')
        
        # Try inserting at each position (0 to len(ordered))
        for pos in range(len(ordered) + 1):
            contrib = bond_energy_contribution(
                affinity_matrix, ordered, col, pos
            )
            if contrib > best_contribution:
                best_contribution = contrib
                best_position = pos
        
        ordered.insert(best_position, col)
    
    return ordered
 
 
def identify_fragments(affinity_matrix: np.ndarray, 
                       col_order: List[int],
                       threshold_percentile: float = 25) -> List[List[int]]:
    """
    Identify fragment boundaries in reordered affinity matrix.
    Uses threshold on between-cluster affinity.
    """
    n = len(col_order)
    
    # Compute affinity between adjacent columns
    adjacent_affinities = []
    for i in range(n - 1):
        col_i = col_order[i]
        col_j = col_order[i + 1]
        adjacent_affinities.append(affinity_matrix[col_i][col_j])
    
    # Threshold below which we split fragments
    threshold = np.percentile(adjacent_affinities, threshold_percentile)
    
    # Identify split points
    fragments = []
    current_fragment = [col_order[0]]
    
    for i in range(len(adjacent_affinities)):
        if adjacent_affinities[i] < threshold:
            # Low affinity - start new fragment
            fragments.append(current_fragment)
            current_fragment = [col_order[i + 1]]
        else:
            current_fragment.append(col_order[i + 1])
    
    fragments.append(current_fragment)
    return fragments
 
 
# Example usage
attributes = ["emp_id", "name", "address", "phone", "email",
              "department", "title", "salary", "hire_date",
              "review_score", "last_review", "manager_notes"]
 
# Simplified affinity matrix (symmetric)
affinity = np.array([
    # emp_id, name, address, phone, email, dept, title, salary, hire, review, last_rev, notes
    [15160, 15160, 5010, 5010, 15010, 10110, 10110, 110, 110, 60, 60, 60],  # emp_id
    [15160, 15160, 5010, 5010, 15010, 10110, 10110, 110, 110, 60, 60, 60],  # name
    [5010,  5010,  5010, 5010, 5010,  10,    10,    10,  10,  10, 10, 10],  # address
    [5010,  5010,  5010, 5010, 5010,  10,    10,    10,  10,  10, 10, 10],  # phone
    [15010, 15010, 5010, 5010, 15010, 10010, 10010, 10,  10,  10, 10, 10],  # email
    [10110, 10110, 10,   10,   10010, 10110, 10110, 110, 110, 10, 10, 10],  # department
    [10110, 10110, 10,   10,   10010, 10110, 10110, 10,  10,  10, 10, 10],  # title
    [110,   110,   10,   10,   10,    110,   10,    110, 110, 10, 10, 10],  # salary
    [110,   110,   10,   10,   10,    110,   10,    110, 110, 10, 10, 10],  # hire_date
    [60,    60,    10,   10,   10,    10,    10,    10,  10,  60, 60, 60],  # review_score
    [60,    60,    10,   10,   10,    10,    10,    10,  10,  60, 60, 60],  # last_review
    [60,    60,    10,   10,   10,    10,    10,    10,  10,  60, 60, 60],  # manager_notes
])
 
# Apply BEA
optimal_order = bond_energy_algorithm(affinity)
print("Optimal attribute order:", [attributes[i] for i in optimal_order])
 
# Identify fragments
fragments = identify_fragments(affinity, optimal_order)
print("
Identified fragments:")
for i, frag in enumerate(fragments):
    print(f"  Fragment {i + 1}: {[attributes[idx] for idx in frag]}")

Alternative Clustering Approaches

•Hierarchical Clustering — Build a dendrogram of attributes based on affinity distance. Cut at threshold to form fragments. Provides flexibility in choosing granularity.
•K-Means on Affinity — Treat affinity matrix rows as feature vectors. Cluster attributes into k groups. Requires choosing k beforehand.
•Graph Partitioning — Model attributes as graph nodes with affinity-weighted edges. Use graph partitioning algorithms (METIS, spectral clustering) to minimize cut weight.
•Genetic Algorithms — Optimize fragment assignments using evolutionary computation. Effective for complex objective functions with multiple constraints.
•Cost-Based Optimization — Define a cost model (I/O, network, joins) and search for fragmentation minimizing total query cost. Integration with query optimizer statistics.

Tuple Identifier Management

The tuple identifier (TID) is the critical mechanism enabling vertical fragment reconstruction. Every fragment must contain this identifier to support correct joins.

TID Requirements:

Uniqueness: Each tuple must have a globally unique identifier
Immutability: The TID must not change during the tuple's lifetime
Efficiency: TID comparison and indexing must be fast
Compactness: TID overhead should be minimal relative to fragment size

TID Options:

Natural Primary Key:
- Use existing primary key (e.g., emp_id, order_id)
- No additional storage overhead
- Works when PK is simple, stable, and well-distributed
Surrogate Key:
- System-generated identifier (auto-increment, UUID)
- Necessary when natural key is composite, mutable, or large
- Adds storage but guarantees stability
Physical Tuple ID:
- Database-internal row identifier (e.g., PostgreSQL's ctid, Oracle's ROWID)
- Efficient but not portable across fragments on different nodes
- May change with table reorganization (VACUUM, etc.)

The Mutable Key Problem

tid_implementation.sql
PostgreSQL
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
-- Vertical fragmentation with surrogate TID
 
-- Original table with composite natural key
CREATE TABLE Products (
    category_id INTEGER NOT NULL,
    product_code VARCHAR(50) NOT NULL,
    -- Many attributes...
    name VARCHAR(200),
    description TEXT,
    price DECIMAL(10, 2),
    cost DECIMAL(10, 2),
    weight_kg DECIMAL(8, 3),
    dimensions VARCHAR(100),
    supplier_id INTEGER,
    reorder_level INTEGER,
    discontinued BOOLEAN,
    image_url TEXT,
    technical_specs JSONB,
    PRIMARY KEY (category_id, product_code)
);
 
-- Add surrogate TID for fragmentation
ALTER TABLE Products ADD COLUMN tid BIGINT GENERATED ALWAYS AS IDENTITY;
CREATE UNIQUE INDEX idx_products_tid ON Products(tid);
 
-- Now create vertical fragments using TID
 
-- Fragment 1: Catalog information (frequently accessed)
CREATE TABLE Products_Catalog AS
SELECT tid, category_id, product_code, name, description, price, image_url
FROM Products;
 
ALTER TABLE Products_Catalog ADD PRIMARY KEY (tid);
CREATE INDEX idx_catalog_natural_key ON Products_Catalog(category_id, product_code);
 
-- Fragment 2: Inventory/supply chain (operations team)
CREATE TABLE Products_Inventory AS
SELECT tid, weight_kg, dimensions, supplier_id, reorder_level, discontinued
FROM Products;
 
ALTER TABLE Products_Inventory ADD PRIMARY KEY (tid);
 
-- Fragment 3: Financial data (restricted access)
CREATE TABLE Products_Financial AS
SELECT tid, cost
FROM Products;
 
ALTER TABLE Products_Financial ADD PRIMARY KEY (tid);
 
-- Fragment 4: Technical specifications (engineering)
CREATE TABLE Products_Technical AS
SELECT tid, technical_specs
FROM Products;
 
ALTER TABLE Products_Technical ADD PRIMARY KEY (tid);
 
-- Reconstruction via TID join
CREATE VIEW Products_Reconstructed AS
SELECT 
    c.category_id, c.product_code, c.name, c.description, c.price, c.image_url,
    i.weight_kg, i.dimensions, i.supplier_id, i.reorder_level, i.discontinued,
    f.cost,
    t.technical_specs
FROM Products_Catalog c
JOIN Products_Inventory i ON c.tid = i.tid
JOIN Products_Financial f ON c.tid = f.tid
JOIN Products_Technical t ON c.tid = t.tid;

TID Synchronization Challenges:

In a distributed setting, maintaining TID consistency across fragments requires careful coordination:

Insert Operations:
- TID must be generated before inserting into any fragment
- All fragments must receive the same TID
- Use distributed sequence generators or UUID
Delete Operations:
- Deletion must remove tuple from ALL fragments
- Orphaned TIDs in some fragments create ghost data
- Requires distributed delete protocols
TID Exhaustion:
- BIGINT provides 9.2 × 10¹⁸ values—practically unlimited
- But sequential generation creates hotspots
- Consider snowflake IDs or UUIDs for distribution
TID Reuse:
- Never reuse TIDs—historical references may still exist
- Audit logs, analytics, caches may reference old TIDs
- Treat TIDs as immutable, non-recyclable identifiers

Attribute Replication Strategies

Strict vertical fragmentation partitions attributes disjointly (except TID). However, practical systems often replicate certain attributes across fragments to avoid joins for common queries.

Motivation for Replication:

Consider a query accessing name (Fragment 1) and salary (Fragment 2):

SELECT name, salary FROM Employee WHERE emp_id = 12345;

With strict fragmentation, this requires joining two fragments. If name frequently accompanies other fragments' data, replicating it avoids the join.

Replication Candidates:

High-Frequency Attributes: Accessed in most queries (e.g., name, status)
Small Attributes: Low storage overhead for replication (e.g., date, enum)
Join Keys: Besides TID, secondary join attributes may be replicated
Display Attributes: User-facing attributes commonly needed for presentation

Storage vs. Performance Trade-off:

Approach	Storage Cost	Read Performance	Write Complexity
No replication	Minimal	Requires joins	Simple
Selective replication	Moderate	Fewer joins	Update multiple fragments
Full replication	Maximal	No joins needed	Update all fragments

Update Anomaly Prevention:

Replicated attributes must be updated consistently across all fragments. An inconsistent update creates the same problems as horizontal fragmentation overlap:

Display shows different names in different views
Reconciliation becomes impossible without audit trails

Implement replication updates as atomic distributed operations or accept eventual consistency for read-only replicas.

attribute_replication.sql
SQL
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
-- Vertical fragmentation with selective attribute replication
 
-- Fragment 1: Personal information (primary for name, email)
CREATE TABLE Employee_Personal (
    tid         BIGINT PRIMARY KEY,
    emp_id      INTEGER UNIQUE NOT NULL,
    name        VARCHAR(100) NOT NULL,  -- Primary source
    address     TEXT,
    phone       VARCHAR(20),
    email       VARCHAR(100)
);
 
-- Fragment 2: Employment details (replicates name for display)
CREATE TABLE Employee_Employment (
    tid         BIGINT PRIMARY KEY,
    emp_id      INTEGER NOT NULL,       -- Replicated for filtering
    name        VARCHAR(100) NOT NULL,  -- Replicated for display
    department  VARCHAR(50),
    title       VARCHAR(100),
    salary      DECIMAL(10, 2),
    hire_date   DATE,
    FOREIGN KEY (tid) REFERENCES Employee_Personal(tid)
);
 
-- Fragment 3: Performance data (replicates name, department for context)
CREATE TABLE Employee_Performance (
    tid             BIGINT PRIMARY KEY,
    emp_id          INTEGER NOT NULL,       -- Replicated
    name            VARCHAR(100) NOT NULL,  -- Replicated
    department      VARCHAR(50),            -- Replicated
    review_score    DECIMAL(3, 2),
    last_review     DATE,
    manager_notes   TEXT,
    FOREIGN KEY (tid) REFERENCES Employee_Personal(tid)
);
 
-- Trigger to maintain replication consistency on name update
CREATE OR REPLACE FUNCTION sync_name_replication()
RETURNS TRIGGER AS $$
BEGIN
    IF NEW.name <> OLD.name THEN
        UPDATE Employee_Employment SET name = NEW.name WHERE tid = NEW.tid;
        UPDATE Employee_Performance SET name = NEW.name WHERE tid = NEW.tid;
    END IF;
    RETURN NEW;
END;
$$ LANGUAGE plpgsql;
 
CREATE TRIGGER tr_sync_name
AFTER UPDATE OF name ON Employee_Personal
FOR EACH ROW EXECUTE FUNCTION sync_name_replication();
 
-- Now queries can avoid joins for common patterns:
 
-- Salary query with name - no join needed
SELECT name, salary, title 
FROM Employee_Employment 
WHERE department = 'Engineering';
 
-- Performance query with context - no join needed
SELECT name, department, review_score 
FROM Employee_Performance 
WHERE last_review >= '2024-01-01';

Replication Maintenance Cost

Vertical Fragmentation vs. Column Stores

Vertical fragmentation shares conceptual similarities with column-oriented storage (columnar databases). Understanding their relationship clarifies when each approach applies.

Column Stores (Columnar Databases):

Column-oriented databases store each column in a separate physical file or segment:

Each attribute is contiguous on disk
Queries read only needed columns
Excellent compression (similar values grouped)
Optimized for analytical workloads (OLAP)

Examples: Vertica, ClickHouse, Amazon Redshift, Apache Parquet

Key Differences:

Aspect	Vertical Fragmentation	Column Store
Scope	Distributed database design	Single-node storage format
Granularity	Groups of related attributes	Individual columns
Driver	Network/site optimization	I/O efficiency, compression
Reconstruction	Join across network	Local tuple assembly
Tuple Identity	Explicit TID management	Implicit positional alignment

When They Complement:

The approaches can combine:

Vertically fragment for distribution decisions
Store each fragment using column-oriented format locally
Benefit from both network locality and I/O efficiency

Example Architecture:

Site 1 (Sales Team):                Site 2 (Finance Team):
+-------------------------+         +-------------------------+
| Employee_Personal       |         | Employee_Financial      |
| (Column-store format)   |         | (Column-store format)   |
|  - tid.parquet          |         |  - tid.parquet          |
|  - name.parquet         |         |  - salary.parquet       |
|  - email.parquet        |         |  - bonus.parquet        |
|  - phone.parquet        |         |  - equity.parquet       |
+-------------------------+         +-------------------------+

When to Use Vertical Fragmentation

•Different user groups access different attribute sets
•Security requires attribute-level access control at node level
•Attributes have vastly different update frequencies
•Distributed sites have different functional responsibilities
•Regulatory requirements mandate attribute-level data residency

When Column Stores Suffice

•Single-site deployment with column-oriented queries
•Analytical workloads scanning few columns over many rows
•Compression benefits outweigh operational complexity
•No distribution-based access pattern differentiation
•Read-heavy workloads with batch updates

Hybrid Architectures

Summary: Vertical Fragmentation Mastery

Vertical fragmentation partitions tables by columns, optimizing for attribute-level access patterns in distributed systems. Let's consolidate the key concepts:

Key Takeaways

•Formal Properties — Vertical fragments require attribute coverage, key replication, and lossless join reconstruction. All fragments must include the tuple identifier.
•Affinity Analysis — The attribute usage and affinity matrices quantify co-access patterns, guiding grouping decisions. High-affinity attributes belong together.
•Clustering Algorithms — The Bond Energy Algorithm and alternatives cluster attributes by affinity. Fragment boundaries fall between low-affinity groups.
•Tuple Identifier — The TID is the glue enabling reconstruction. It must be unique, immutable, and efficiently indexed in every fragment.
•Attribute Replication — Selectively replicating high-frequency attributes reduces join requirements but introduces update synchronization overhead.
•Column Store Relationship — Vertical fragmentation operates at design/distribution level; column stores operate at storage format level. They can combine beneficially.

What's Next:

Page Complete

2 / 5