Loading learning content...
Consider an Employee table with 50 attributes: personal information (name, address, phone), employment details (hire date, department, salary), performance metrics (review scores, promotion history), and rarely-accessed archival data (original application, interview notes, previous employer references).
A typical HR dashboard query fetches only 5-7 attributes for display. An analytics job processes salary and performance data exclusively. The archival data is accessed perhaps once per year during audits. If all 50 attributes are stored together, every query reads vastly more data than needed.
Vertical fragmentation addresses this by partitioning a relation by columns rather than rows. Each fragment contains all rows but only a subset of attributes. This technique optimizes for access patterns where different applications or queries consistently use different attribute groups.
This page explores vertical fragmentation in rigorous depth—from the theoretical foundations of attribute affinity analysis through practical algorithms for optimal attribute grouping, addressing the unique challenges of maintaining tuple identity and reconstruction guarantees across column-oriented fragments.
By the end of this page, you will understand: (1) The formal definition and correctness properties of vertical fragmentation, (2) Attribute affinity analysis and the use matrix methodology, (3) Clustering algorithms for grouping attributes into fragments, (4) The critical role of the tuple identifier in maintaining reconstruction, (5) Trade-offs between vertical fragmentation and column-store architectures, and (6) When to choose vertical over horizontal fragmentation.
Vertical fragmentation decomposes a relation's schema rather than its population. Understanding the formal properties is essential for correct design.
Definition:
Given a relation R with attributes A = {a₁, a₂, ..., aₘ} and primary key K, a vertical fragmentation of R produces fragments R₁, R₂, ..., Rₙ such that:
Mathematically:
Key Insight: Unlike horizontal fragmentation where fragments are disjoint, vertical fragments overlap in the key attributes. This overlap is what enables reconstruction through joins.
Vertical fragmentation fundamentally requires a stable tuple identifier replicated across all fragments. In many databases, the natural primary key suffices. However, for tables with composite keys or mutable primary keys, a system-generated surrogate key (tuple-id) may be necessary. This tuple-id becomes the 'glue' binding columns from different fragments back to their original rows.
Attribute Projections:
Each vertical fragment Rᵢ is defined as a projection:
Rᵢ = π(K ∪ Aᵢ)(R)
Where Aᵢ is the set of non-key attributes assigned to fragment i, and π denotes the projection operation.
Example: Employee Table Fragmentation
Original relation:
Employee(emp_id, name, address, phone, department, salary, hire_date, review_score, notes)
Vertical fragments:
Reconstruction:
Employee = R₁ ⋈ R₂ ⋈ R₃ (natural join on emp_id)
| Property | Horizontal Fragmentation | Vertical Fragmentation |
|---|---|---|
| Partition Unit | Rows (tuples) | Columns (attributes) |
| Fragment Overlap | None—fragments are disjoint | Key attributes replicated in all fragments |
| Reconstruction Operator | Union (∪) | Natural Join (⋈) |
| Storage Overhead | None inherent | Key replication in each fragment |
| Query Optimization | Fragment elimination by row predicates | Fragment elimination by attribute access |
| Primary Driver | Data locality, load distribution | Access pattern optimization, column affinity |
Optimal vertical fragmentation requires understanding which attributes are accessed together. Attribute affinity analysis quantifies the co-access patterns to guide grouping decisions.
The Attribute Usage Matrix:
The first step is constructing an attribute usage matrix U where:
Query Frequency Weighting:
Not all queries are equally important. Weight each query by its execution frequency:
Attribute Affinity Matrix:
From the usage matrix, derive the attribute affinity matrix AA where:
AA[j][k] = Σᵢ (U[i][j] × U[i][k] × f(qᵢ))
This measures how often attributes aⱼ and aₖ are accessed together, weighted by query frequency. High affinity suggests grouping; low affinity suggests separation.
Affinity Properties:
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768697071727374757677787980818283848586878889
import numpy as npfrom typing import List, Dict, Tuple def compute_affinity_matrix( attributes: List[str], queries: List[Dict], # [{"name": str, "frequency": int, "attributes": List[str]}]) -> Tuple[np.ndarray, np.ndarray]: """ Compute the attribute usage and affinity matrices. Args: attributes: List of attribute names queries: List of query specifications with name, frequency, and accessed attributes Returns: Tuple of (usage_matrix, affinity_matrix) """ n_attrs = len(attributes) n_queries = len(queries) attr_index = {attr: i for i, attr in enumerate(attributes)} # Build usage matrix U where U[q][a] = 1 if query q uses attribute a usage_matrix = np.zeros((n_queries, n_attrs), dtype=int) frequencies = np.zeros(n_queries) for q_idx, query in enumerate(queries): frequencies[q_idx] = query["frequency"] for attr in query["attributes"]: if attr in attr_index: usage_matrix[q_idx][attr_index[attr]] = 1 # Build affinity matrix AA where AA[a1][a2] = sum of frequencies # for queries accessing both a1 and a2 affinity_matrix = np.zeros((n_attrs, n_attrs)) for q_idx in range(n_queries): freq = frequencies[q_idx] accessed = np.where(usage_matrix[q_idx] == 1)[0] # For each pair of attributes accessed by this query for i in accessed: for j in accessed: affinity_matrix[i][j] += freq return usage_matrix, affinity_matrix # Example: Employee table fragmentation analysisattributes = [ "emp_id", "name", "address", "phone", "email", # Personal "department", "title", "salary", "hire_date", # Employment "review_score", "last_review", "manager_notes" # Performance ] queries = [ { "name": "Employee Directory Lookup", "frequency": 10000, # Very frequent "attributes": ["emp_id", "name", "email", "department", "title"] }, { "name": "Contact Information", "frequency": 5000, "attributes": ["emp_id", "name", "phone", "address", "email"] }, { "name": "Salary Report", "frequency": 100, # Monthly "attributes": ["emp_id", "name", "department", "salary", "hire_date"] }, { "name": "Performance Review", "frequency": 50, # Quarterly "attributes": ["emp_id", "name", "review_score", "last_review", "manager_notes"] }, { "name": "HR Full Profile", "frequency": 10, # Rare "attributes": attributes # All attributes }] usage, affinity = compute_affinity_matrix(attributes, queries) print("Attribute Affinity Matrix (showing high-affinity pairs):")print("-" * 60)for i, attr_i in enumerate(attributes): for j, attr_j in enumerate(attributes): if i < j and affinity[i][j] > 5000: # High affinity threshold print(f" {attr_i} <-> {attr_j}: {affinity[i][j]:,.0f}")Interpreting the Affinity Matrix:
The affinity matrix reveals natural attribute groupings:
High Affinity Clusters: Attributes with high mutual affinity should be in the same fragment
Low Affinity Separation: Attributes rarely accessed together can be in different fragments
Universal Attributes: Some attributes (like primary key, name) appear in many queries
Singleton Attributes: Rarely accessed attributes might form their own small fragment
In distributed systems, extend affinity analysis per site. If queries at Site A use attributes {a, b, c} while Site B uses {c, d, e}, create fragments aligned with each site's access patterns. The attribute 'c' would be replicated in both fragments, trading storage for locality.
Given an affinity matrix, we need algorithms to partition attributes into fragments. The Bond Energy Algorithm (BEA) is the classical approach, reorganizing the matrix to cluster high-affinity attributes together.
Bond Energy Algorithm:
Bond Energy Measure:
The global affinity measure (AM) quantifies clustering quality:
AM = Σᵢ Σⱼ AA[i][j] × (AA[i-1][j] + AA[i+1][j] + AA[i][j-1] + AA[i][j+1])
Higher AM indicates better clustering—high-affinity attributes are adjacent.
Fragment Identification:
After BEA reordering, clusters appear as dense blocks along the diagonal. Fragment boundaries are placed between low-affinity attribute groups, typically using a threshold or optimization.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128
import numpy as npfrom typing import List, Tuple def bond_energy_contribution(matrix: np.ndarray, col_order: List[int], new_col: int, position: int) -> float: """ Calculate the bond energy contribution of inserting new_col at position. """ n = matrix.shape[0] contribution = 0 # Contribution from left neighbor if position > 0: left_col = col_order[position - 1] for i in range(n): contribution += 2 * matrix[i][new_col] * matrix[i][left_col] # Contribution from right neighbor if position < len(col_order): right_col = col_order[position] for i in range(n): contribution += 2 * matrix[i][new_col] * matrix[i][right_col] return contribution def bond_energy_algorithm(affinity_matrix: np.ndarray) -> List[int]: """ Apply the Bond Energy Algorithm to reorder columns for clustering. Args: affinity_matrix: Square symmetric affinity matrix Returns: Optimal column ordering as list of indices """ n = affinity_matrix.shape[0] # Start with first two columns in arbitrary order ordered = [0, 1] remaining = list(range(2, n)) # Greedily insert each remaining column at optimal position for col in remaining: best_position = 0 best_contribution = float('-inf') # Try inserting at each position (0 to len(ordered)) for pos in range(len(ordered) + 1): contrib = bond_energy_contribution( affinity_matrix, ordered, col, pos ) if contrib > best_contribution: best_contribution = contrib best_position = pos ordered.insert(best_position, col) return ordered def identify_fragments(affinity_matrix: np.ndarray, col_order: List[int], threshold_percentile: float = 25) -> List[List[int]]: """ Identify fragment boundaries in reordered affinity matrix. Uses threshold on between-cluster affinity. """ n = len(col_order) # Compute affinity between adjacent columns adjacent_affinities = [] for i in range(n - 1): col_i = col_order[i] col_j = col_order[i + 1] adjacent_affinities.append(affinity_matrix[col_i][col_j]) # Threshold below which we split fragments threshold = np.percentile(adjacent_affinities, threshold_percentile) # Identify split points fragments = [] current_fragment = [col_order[0]] for i in range(len(adjacent_affinities)): if adjacent_affinities[i] < threshold: # Low affinity - start new fragment fragments.append(current_fragment) current_fragment = [col_order[i + 1]] else: current_fragment.append(col_order[i + 1]) fragments.append(current_fragment) return fragments # Example usageattributes = ["emp_id", "name", "address", "phone", "email", "department", "title", "salary", "hire_date", "review_score", "last_review", "manager_notes"] # Simplified affinity matrix (symmetric)affinity = np.array([ # emp_id, name, address, phone, email, dept, title, salary, hire, review, last_rev, notes [15160, 15160, 5010, 5010, 15010, 10110, 10110, 110, 110, 60, 60, 60], # emp_id [15160, 15160, 5010, 5010, 15010, 10110, 10110, 110, 110, 60, 60, 60], # name [5010, 5010, 5010, 5010, 5010, 10, 10, 10, 10, 10, 10, 10], # address [5010, 5010, 5010, 5010, 5010, 10, 10, 10, 10, 10, 10, 10], # phone [15010, 15010, 5010, 5010, 15010, 10010, 10010, 10, 10, 10, 10, 10], # email [10110, 10110, 10, 10, 10010, 10110, 10110, 110, 110, 10, 10, 10], # department [10110, 10110, 10, 10, 10010, 10110, 10110, 10, 10, 10, 10, 10], # title [110, 110, 10, 10, 10, 110, 10, 110, 110, 10, 10, 10], # salary [110, 110, 10, 10, 10, 110, 10, 110, 110, 10, 10, 10], # hire_date [60, 60, 10, 10, 10, 10, 10, 10, 10, 60, 60, 60], # review_score [60, 60, 10, 10, 10, 10, 10, 10, 10, 60, 60, 60], # last_review [60, 60, 10, 10, 10, 10, 10, 10, 10, 60, 60, 60], # manager_notes]) # Apply BEAoptimal_order = bond_energy_algorithm(affinity)print("Optimal attribute order:", [attributes[i] for i in optimal_order]) # Identify fragmentsfragments = identify_fragments(affinity, optimal_order)print("Identified fragments:")for i, frag in enumerate(fragments): print(f" Fragment {i + 1}: {[attributes[idx] for idx in frag]}")The tuple identifier (TID) is the critical mechanism enabling vertical fragment reconstruction. Every fragment must contain this identifier to support correct joins.
TID Requirements:
TID Options:
Natural Primary Key:
emp_id, order_id)Surrogate Key:
Physical Tuple ID:
ctid, Oracle's ROWID)If the tuple identifier changes, vertical fragments become inconsistent. Consider an Employee whose emp_id changes due to corporate restructuring. Fragment R₁ has (old_id, name, address) while R₂ has (new_id, salary). A join produces no result—the employee appears to have no salary data. This is catastrophic data corruption disguised as missing data.
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667
-- Vertical fragmentation with surrogate TID -- Original table with composite natural keyCREATE TABLE Products ( category_id INTEGER NOT NULL, product_code VARCHAR(50) NOT NULL, -- Many attributes... name VARCHAR(200), description TEXT, price DECIMAL(10, 2), cost DECIMAL(10, 2), weight_kg DECIMAL(8, 3), dimensions VARCHAR(100), supplier_id INTEGER, reorder_level INTEGER, discontinued BOOLEAN, image_url TEXT, technical_specs JSONB, PRIMARY KEY (category_id, product_code)); -- Add surrogate TID for fragmentationALTER TABLE Products ADD COLUMN tid BIGINT GENERATED ALWAYS AS IDENTITY;CREATE UNIQUE INDEX idx_products_tid ON Products(tid); -- Now create vertical fragments using TID -- Fragment 1: Catalog information (frequently accessed)CREATE TABLE Products_Catalog ASSELECT tid, category_id, product_code, name, description, price, image_urlFROM Products; ALTER TABLE Products_Catalog ADD PRIMARY KEY (tid);CREATE INDEX idx_catalog_natural_key ON Products_Catalog(category_id, product_code); -- Fragment 2: Inventory/supply chain (operations team)CREATE TABLE Products_Inventory ASSELECT tid, weight_kg, dimensions, supplier_id, reorder_level, discontinuedFROM Products; ALTER TABLE Products_Inventory ADD PRIMARY KEY (tid); -- Fragment 3: Financial data (restricted access)CREATE TABLE Products_Financial ASSELECT tid, costFROM Products; ALTER TABLE Products_Financial ADD PRIMARY KEY (tid); -- Fragment 4: Technical specifications (engineering)CREATE TABLE Products_Technical ASSELECT tid, technical_specsFROM Products; ALTER TABLE Products_Technical ADD PRIMARY KEY (tid); -- Reconstruction via TID joinCREATE VIEW Products_Reconstructed ASSELECT c.category_id, c.product_code, c.name, c.description, c.price, c.image_url, i.weight_kg, i.dimensions, i.supplier_id, i.reorder_level, i.discontinued, f.cost, t.technical_specsFROM Products_Catalog cJOIN Products_Inventory i ON c.tid = i.tidJOIN Products_Financial f ON c.tid = f.tidJOIN Products_Technical t ON c.tid = t.tid;TID Synchronization Challenges:
In a distributed setting, maintaining TID consistency across fragments requires careful coordination:
Insert Operations:
Delete Operations:
TID Exhaustion:
TID Reuse:
Strict vertical fragmentation partitions attributes disjointly (except TID). However, practical systems often replicate certain attributes across fragments to avoid joins for common queries.
Motivation for Replication:
Consider a query accessing name (Fragment 1) and salary (Fragment 2):
SELECT name, salary FROM Employee WHERE emp_id = 12345;
With strict fragmentation, this requires joining two fragments. If name frequently accompanies other fragments' data, replicating it avoids the join.
Replication Candidates:
name, status)Storage vs. Performance Trade-off:
| Approach | Storage Cost | Read Performance | Write Complexity |
|---|---|---|---|
| No replication | Minimal | Requires joins | Simple |
| Selective replication | Moderate | Fewer joins | Update multiple fragments |
| Full replication | Maximal | No joins needed | Update all fragments |
Update Anomaly Prevention:
Replicated attributes must be updated consistently across all fragments. An inconsistent update creates the same problems as horizontal fragmentation overlap:
Implement replication updates as atomic distributed operations or accept eventual consistency for read-only replicas.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263
-- Vertical fragmentation with selective attribute replication -- Fragment 1: Personal information (primary for name, email)CREATE TABLE Employee_Personal ( tid BIGINT PRIMARY KEY, emp_id INTEGER UNIQUE NOT NULL, name VARCHAR(100) NOT NULL, -- Primary source address TEXT, phone VARCHAR(20), email VARCHAR(100)); -- Fragment 2: Employment details (replicates name for display)CREATE TABLE Employee_Employment ( tid BIGINT PRIMARY KEY, emp_id INTEGER NOT NULL, -- Replicated for filtering name VARCHAR(100) NOT NULL, -- Replicated for display department VARCHAR(50), title VARCHAR(100), salary DECIMAL(10, 2), hire_date DATE, FOREIGN KEY (tid) REFERENCES Employee_Personal(tid)); -- Fragment 3: Performance data (replicates name, department for context)CREATE TABLE Employee_Performance ( tid BIGINT PRIMARY KEY, emp_id INTEGER NOT NULL, -- Replicated name VARCHAR(100) NOT NULL, -- Replicated department VARCHAR(50), -- Replicated review_score DECIMAL(3, 2), last_review DATE, manager_notes TEXT, FOREIGN KEY (tid) REFERENCES Employee_Personal(tid)); -- Trigger to maintain replication consistency on name updateCREATE OR REPLACE FUNCTION sync_name_replication()RETURNS TRIGGER AS $$BEGIN IF NEW.name <> OLD.name THEN UPDATE Employee_Employment SET name = NEW.name WHERE tid = NEW.tid; UPDATE Employee_Performance SET name = NEW.name WHERE tid = NEW.tid; END IF; RETURN NEW;END;$$ LANGUAGE plpgsql; CREATE TRIGGER tr_sync_nameAFTER UPDATE OF name ON Employee_PersonalFOR EACH ROW EXECUTE FUNCTION sync_name_replication(); -- Now queries can avoid joins for common patterns: -- Salary query with name - no join neededSELECT name, salary, title FROM Employee_Employment WHERE department = 'Engineering'; -- Performance query with context - no join neededSELECT name, department, review_score FROM Employee_Performance WHERE last_review >= '2024-01-01';Every replicated attribute adds write amplification. A name change in our example requires updating 3 fragments. For high-update-frequency attributes, the synchronization overhead may outweigh read benefits. Profile actual workloads before replicating mutable attributes.
Vertical fragmentation shares conceptual similarities with column-oriented storage (columnar databases). Understanding their relationship clarifies when each approach applies.
Column Stores (Columnar Databases):
Column-oriented databases store each column in a separate physical file or segment:
Examples: Vertica, ClickHouse, Amazon Redshift, Apache Parquet
Key Differences:
| Aspect | Vertical Fragmentation | Column Store |
|---|---|---|
| Scope | Distributed database design | Single-node storage format |
| Granularity | Groups of related attributes | Individual columns |
| Driver | Network/site optimization | I/O efficiency, compression |
| Reconstruction | Join across network | Local tuple assembly |
| Tuple Identity | Explicit TID management | Implicit positional alignment |
When They Complement:
The approaches can combine:
Example Architecture:
Site 1 (Sales Team): Site 2 (Finance Team):
+-------------------------+ +-------------------------+
| Employee_Personal | | Employee_Financial |
| (Column-store format) | | (Column-store format) |
| - tid.parquet | | - tid.parquet |
| - name.parquet | | - salary.parquet |
| - email.parquet | | - bonus.parquet |
| - phone.parquet | | - equity.parquet |
+-------------------------+ +-------------------------+
Modern distributed analytical databases (Snowflake, BigQuery, Databricks) combine both concepts: data is horizontally partitioned across nodes (cloud storage), vertically within each partition (columnar format), and dynamically cached based on access patterns. This layered approach addresses scale, performance, and cost simultaneously.
Vertical fragmentation partitions tables by columns, optimizing for attribute-level access patterns in distributed systems. Let's consolidate the key concepts:
What's Next:
Real-world scenarios often require combining horizontal and vertical strategies. The next page explores Hybrid Fragmentation, where tables are partitioned both by rows AND columns to address complex access patterns spanning multiple dimensions.
You now understand vertical fragmentation at production depth—from affinity analysis through clustering algorithms to tuple identifier management. This knowledge enables designing column-aware distributed schemas that minimize unnecessary data access while maintaining reconstruction guarantees.