Distributed Database Concepts - Learning Module

Loading content...

0/241

Data Fragmentation

Dividing Data Across Sites

Once you decide to distribute a database, the first fundamental question is: How do you divide the data?

The answer is fragmentation—the process of partitioning a database's relations (tables) into smaller pieces called fragments, each stored at one or more sites. The fragment becomes the unit of distribution: transactions access fragments, replication copies fragments, and queries are translated into operations on fragments.

Fragmentation is the foundation upon which all distributed database operations are built. A well-designed fragmentation scheme enables parallel query execution, reduces network traffic, and aligns data placement with access patterns. A poorly-designed scheme creates hotspots, cross-site joins, and performance bottlenecks.

What You Will Learn

By the end of this page, you will understand the three fragmentation strategies—horizontal, vertical, and hybrid—along with their correctness requirements, design algorithms, and practical trade-offs. You'll be able to evaluate fragmentation schemes for real-world distributed database deployments.

Fragmentation Fundamentals

What is a Fragment?

A fragment is a subset of a relation that can be stored and managed independently. Formally, if R is a relation, we can decompose R into fragments R₁, R₂, ..., Rₙ where each Rᵢ contains a subset of R's data.

Fragmentation vs. Partitioning

The terms fragmentation and partitioning are often used interchangeably, though fragmentation is the classical distributed database term while partitioning is more common in modern systems. Both refer to dividing data into pieces distributed across nodes. We'll use "fragmentation" here to align with theoretical foundations.

Why Fragment?

Fragmentation serves multiple purposes:

Locality: Place data where it's most frequently accessed
Parallelism: Enable concurrent operations on different fragments
Load balancing: Distribute workload across nodes
Scaling: Add capacity by adding fragments on new nodes
Compliance: Keep data in specific jurisdictions

Correctness Requirements

A valid fragmentation must satisfy three properties:

1. Completeness

If relation R is decomposed into fragments R₁, R₂, ..., Rₙ, every data item in R must appear in at least one fragment Rᵢ. No data is lost during fragmentation.

Mathematically: R = R₁ ∪ R₂ ∪ ... ∪ Rₙ

2. Reconstruction

It must be possible to reconstruct the original relation R from its fragments. This ensures the fragmentation is reversible—you can always recover the complete relation when needed.

For horizontal fragmentation: R = R₁ ∪ R₂ ∪ ... ∪ Rₙ (union) For vertical fragmentation: R = R₁ ⋈ R₂ ⋈ ... ⋈ Rₙ (natural join on key)

3. Disjointness (often desirable, not always required)

For storage efficiency, fragments should ideally be disjoint—each data item appears in exactly one fragment. This prevents redundant storage at the fragmentation level (replication is handled separately).

Mathematically: Rᵢ ∩ Rⱼ = ∅ for all i ≠ j

Fragmentation vs. Replication

Fragmentation divides data into non-overlapping (or minimally overlapping) pieces. Replication copies pieces to multiple locations. These are orthogonal concerns: you can fragment without replicating, replicate without fragmenting, or both. Most production systems use both—fragment for parallelism, replicate for availability.

Horizontal Fragmentation

Horizontal fragmentation divides a relation by rows. Each fragment contains a subset of tuples, but all fragments have the same schema (all columns). Think of slicing a table horizontally—each slice is a valid table with the same column structure.

Formal Definition

Given relation R and selection predicates p₁, p₂, ..., pₙ, horizontal fragments are:

R₁ = σ(p₁)(R)
R₂ = σ(p₂)(R)
...
Rₙ = σ(pₙ)(R)

Where σ denotes the selection operation. For completeness, the predicates must cover all tuples:

p₁ ∨ p₂ ∨ ... ∨ pₙ ≡ true (for all tuples in R)

Example: Customer Table by Region

Consider a customers table for a global e-commerce platform:

horizontal_fragmentation.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
-- Original table
CREATE TABLE customers (
    customer_id INT PRIMARY KEY,
    name VARCHAR(100),
    email VARCHAR(100),
    region VARCHAR(50),
    country VARCHAR(50)
);
 
-- Horizontal fragments by region
-- Fragment 1: North America (stored in US data center)
Fragment_NA = σ(region = 'North America')(customers)
 
-- Fragment 2: Europe (stored in EU data center)
Fragment_EU = σ(region = 'Europe')(customers)
 
-- Fragment 3: Asia-Pacific (stored in Singapore data center)
Fragment_APAC = σ(region = 'Asia-Pacific')(customers)
 
-- Fragment 4: Rest of World (stored in primary data center)
Fragment_ROW = σ(region NOT IN ('North America', 'Europe', 'Asia-Pacific'))(customers)
 
-- Reconstruction: UNION of all fragments returns original table
customers = Fragment_NA ∪ Fragment_EU ∪ Fragment_APAC ∪ Fragment_ROW

Types of Horizontal Fragmentation

Primary Horizontal Fragmentation

Fragments are defined using predicates on attributes of the relation itself. The example above uses region—an attribute of customers—as the fragmentation key.

Derived Horizontal Fragmentation

Fragments are defined based on relationships with other relations. For example, fragment orders based on the region of the associated customer:

Orders_NA = orders where customer is in North America
Orders_EU = orders where customer is in Europe

This ensures related data is co-located: queries joining customers and orders can execute locally if both tables are fragmented consistently.

Horizontal Fragmentation Advantages

•Natural parallelism: Different fragments processed independently
•Query localization: Queries with matching predicates access single fragment
•Even data distribution: Well-chosen predicates balance fragment sizes
•Simple reconstruction: UNION operation is straightforward
•Schema independence: Adding columns affects all fragments uniformly
•Transaction isolation: Transactions on disjoint data don't conflict

Fragmentation Key Selection is Critical

A poorly chosen fragmentation key creates hotspots (one fragment receives most traffic), cross-fragment queries (common queries span fragments), or skewed distribution (fragments of vastly different sizes). Key selection requires workload analysis—understanding which queries are common and how data is accessed.

Vertical Fragmentation

Vertical fragmentation divides a relation by columns. Each fragment contains a subset of attributes but all tuples. Think of slicing a table vertically—each slice has fewer columns but the same number of rows.

Critical Requirement: Include Primary Key

Every vertical fragment must include the primary key (or a tuple identifier). Without the key, you cannot reconstruct the original relation—there's no way to know which attribute values belong together.

Formal Definition

Given relation R with attributes A = {a₁, a₂, ..., aₘ} and primary key K, vertical fragments are:

R₁ = π(K ∪ A₁)(R)
R₂ = π(K ∪ A₂)(R)
...
Rₙ = π(K ∪ Aₙ)(R)

Where π denotes projection, and A₁ ∪ A₂ ∪ ... ∪ Aₙ = A (all attributes covered).

Reconstruction: R = R₁ ⋈ R₂ ⋈ ... ⋈ Rₙ (natural join on primary key)

vertical_fragmentation.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
-- Original table with wide schema
CREATE TABLE employees (
    employee_id INT PRIMARY KEY,
    -- Core HR data
    name VARCHAR(100),
    department VARCHAR(50),
    hire_date DATE,
    -- Sensitive payroll data
    salary DECIMAL(10,2),
    bank_account VARCHAR(50),
    ssn VARCHAR(11),
    -- Large document
    resume_pdf BLOB,
    -- Audit fields
    created_at TIMESTAMP,
    updated_at TIMESTAMP
);
 
-- Vertical Fragment 1: Core HR (frequently accessed, HR systems)
Fragment_HR = π(employee_id, name, department, hire_date)(employees)
 
-- Vertical Fragment 2: Payroll (restricted access, payroll systems)
Fragment_Payroll = π(employee_id, salary, bank_account, ssn)(employees)
 
-- Vertical Fragment 3: Documents (archive storage, rarely accessed)
Fragment_Docs = π(employee_id, resume_pdf)(employees)
 
-- Vertical Fragment 4: Audit (compliance system)
Fragment_Audit = π(employee_id, created_at, updated_at)(employees)
 
-- Reconstruction: Natural join on employee_id
employees = Fragment_HR ⋈ Fragment_Payroll ⋈ Fragment_Docs ⋈ Fragment_Audit

Why Vertical Fragmentation?

Access Pattern Optimization: Most queries access only a subset of columns. Vertical fragmentation places frequently co-accessed columns together, reducing I/O.
Security Isolation: Sensitive columns (SSN, salary, medical data) can be isolated in fragments with restricted access.
Storage Tiering: Large, infrequently-accessed columns (BLOBs, CLOBs) stored on cheaper storage while frequently-accessed columns on fast SSD.
Cache Efficiency: Smaller fragments fit better in memory caches, improving hit rates.

Vertical Fragmentation Advantages

•Improved cache efficiency (smaller records)
•Security isolation for sensitive data
•Storage tiering by access frequency
•Reduced I/O for narrow queries
•Parallel processing of attribute subsets

Vertical Fragmentation Challenges

•Join cost for full tuple reconstruction
•Primary key duplication across fragments
•Complex query rewriting
•More fragments to manage
•Cross-site joins for wide queries

Column Stores: Vertical Fragmentation Taken to Extreme

Column-oriented databases (Vertica, ClickHouse, Amazon Redshift) essentially apply aggressive vertical fragmentation, storing each column separately. This maximizes compression, cache efficiency, and scan performance for analytical queries that aggregate few columns across many rows.

Hybrid Fragmentation

Hybrid fragmentation (also called mixed fragmentation) combines horizontal and vertical fragmentation. This allows exploiting benefits of both strategies—row-based distribution for locality and column-based splitting for access optimization.

Two Approaches

1. Horizontal-then-Vertical (HV)

First apply horizontal fragmentation to partition rows, then apply vertical fragmentation to each horizontal fragment:

Horizontal: R → R₁, R₂, R₃ (by predicate)
Vertical: R₁ → R₁ₐ, R₁ᵦ (by columns), R₂ → R₂ₐ, R₂ᵦ, ...

2. Vertical-then-Horizontal (VH)

First apply vertical fragmentation to partition columns, then apply horizontal fragmentation to each vertical fragment:

Vertical: R → Rₐ, Rᵦ, Rᵧ (by columns)
Horizontal: Rₐ → Rₐ₁, Rₐ₂ (by predicate), Rᵦ → Rᵦ₁, Rᵦ₂, ...

hybrid_fragmentation.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
-- Original: Large transaction history table
CREATE TABLE transactions (
    txn_id BIGINT PRIMARY KEY,
    customer_id INT,
    region VARCHAR(20),
    txn_date TIMESTAMP,
    amount DECIMAL(12,2),
    status VARCHAR(20),
    description TEXT,       -- Large text field
    receipt_image BLOB      -- Large binary
);
 
-- Step 1: Horizontal fragmentation by region
-- Creates regional subsets
 
-- Step 2: Vertical fragmentation of each regional subset
-- Fragment A: Frequently queried operational data
Fragment_NA_Ops = π(txn_id, customer_id, txn_date, amount, status)(
    σ(region = 'NA')(transactions)
)
 
-- Fragment B: Large, rarely-accessed documents
Fragment_NA_Docs = π(txn_id, description, receipt_image)(
    σ(region = 'NA')(transactions)
)
 
-- Same pattern for other regions...
Fragment_EU_Ops = π(txn_id, customer_id, txn_date, amount, status)(
    σ(region = 'EU')(transactions)
)
 
Fragment_EU_Docs = π(txn_id, description, receipt_image)(
    σ(region = 'EU')(transactions)
)
 
-- Result: 2 regional × 2 column groups = 4 fragments per table
-- Each fragment optimally placed:
-- - Ops fragments on fast SSD in each region
-- - Docs fragments on cheap archive storage

Reconstruction of Hybrid Fragments

Reconstruction combines the operations for both fragmentation types:

R = ⋃ᵢ (∪ₖ (Rᵢₖ ⋈ Rᵢₗ ⋈ ...))

First, join vertical fragments within each horizontal partition (reconstruct full rows)
Then, union horizontal partitions (combine all rows)

When to Use Hybrid Fragmentation

Geographic + Access Pattern: Data partitioned by region (horizontal), with hot vs. cold columns split (vertical)
Temporal + Schema: Data partitioned by time (horizontal), with frequently-accessed summary columns separate from full details (vertical)
Multi-tenant + Security: Data partitioned by tenant (horizontal), with sensitive fields isolated (vertical)

Complexity vs. Optimization Trade-off

Hybrid fragmentation offers maximum flexibility but also maximum complexity. Each additional fragmentation dimension multiplies the number of fragments, complicating placement, replication, and query routing. Use hybrid fragmentation when clear access patterns justify it, not as a default strategy.

Fragmentation Design Algorithms

Designing an optimal fragmentation scheme requires analyzing query workloads, access patterns, and data characteristics. Several algorithms guide this process:

Horizontal Fragmentation Design: Predicate-Based Partitioning

The goal is to define predicates that:

Cover all tuples (completeness)
Are mutually exclusive (disjointness)
Reflect access patterns (locality)
Create balanced fragments (load distribution)

Algorithm: Simple Predicate Analysis

Collect all simple predicates used in queries against the relation
Compute minterm predicates (all combinations of simple predicates)
Eliminate minterms that are semantically meaningless
Group minterms accessed together into fragments
Validate completeness and disjointness

predicate_analysis.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
-- Example: Simple Predicate Analysis for orders table
-- Collected predicates from query workload:
p1: region = 'NA'
p2: region = 'EU'
p3: order_date >= '2024-01-01'
p4: status = 'pending'
 
-- Minterm predicates (all combinations):
m1: region = 'NA' ∧ order_date >= '2024-01-01' ∧ status = 'pending'
m2: region = 'NA' ∧ order_date >= '2024-01-01' ∧ status ≠ 'pending'
m3: region = 'NA' ∧ order_date < '2024-01-01' ∧ status = 'pending'
... (16 combinations for 4 boolean predicates)
 
-- After workload analysis, group into fragments:
Fragment_1: region = 'NA' ∧ status = 'pending'  -- Active NA orders
Fragment_2: region = 'EU' ∧ status = 'pending'  -- Active EU orders
Fragment_3: status ≠ 'pending'                   -- Historical (all regions)

Vertical Fragmentation Design: Attribute Affinity

Vertical fragmentation groups attributes that are frequently accessed together. The attribute affinity measures how often two attributes appear in the same query.

Algorithm: Bond Energy Algorithm (BEA)

Build an attribute usage matrix: For each query, mark which attributes it accesses
Compute attribute affinity matrix: aff(Aᵢ, Aⱼ) = count of queries accessing both
Apply clustering to group high-affinity attributes
Define fragments based on clusters (always including primary key)

Example Affinity Matrix:

Attribute Affinity Matrix (Query Access Correlation)
Attribute	name	email	salary	ssn	resume
name	—	82	15	3	12
email	82	—	8	2	10
salary	15	8	—	78	0
ssn	3	2	78	—	0
resume	12	10	0	0	—

Interpreting the Matrix:

name and email have high affinity (82 queries access both) → Group together
salary and ssn have high affinity (78 queries access both) → Group together
resume has low affinity with all others → Separate fragment

Resulting Fragments:

Fragment A: {id, name, email} — HR lookup operations
Fragment B: {id, salary, ssn} — Payroll operations
Fragment C: {id, resume} — Document retrieval

Modern Automation

Production systems increasingly automate fragmentation design. Query logs are analyzed to extract access patterns, and optimization algorithms suggest fragmentation schemes. Cloud data warehouses (BigQuery, Snowflake) even auto-cluster data based on observed query patterns. However, understanding the principles helps you evaluate and override automated decisions when needed.

Fragmentation Trade-offs

Every fragmentation decision involves trade-offs. Understanding these helps you navigate design choices:

Fragment Size: Too Big vs. Too Small

Aspect	Large Fragments	Small Fragments
Parallelism	Limited	High
Management overhead	Low	High
Load balancing	Coarse-grained	Fine-grained
Skew tolerance	Low	High
Query routing complexity	Low	High
Rebalancing cost	High	Low

Fragmentation Granularity Guidelines

Rule of Thumb: Fragment size should be:

Large enough to amortize per-fragment overhead (metadata, coordination)
Small enough to enable effective parallelism and load balancing
Aligned with expected queries (fragments match common access patterns)

Typical Ranges:

OLTP systems: Thousands to millions of rows per fragment
OLAP systems: Hundreds of megabytes to a few gigabytes per fragment
Time-series: Hours to days of data per fragment

Common Fragmentation Mistakes

•Hot key problem: Fragmentation key with low cardinality (e.g., status: active/inactive) creates hotspots
•Cross-fragment joins: Related tables fragmented on different keys, forcing distributed joins
•Temporal skew: Date-based fragmentation without archiving leads to ever-growing recent fragments
•Over-fragmentation: Too many small fragments overwhelm metadata systems
•Ignoring workload evolution: Static fragmentation doesn't adapt to changing access patterns
•Key immutability assumption: Fragmenting on mutable attributes (e.g., region, status) requires data movement on update

Fragmentation is Not Permanent

Most distributed databases support online refragmentation—changing the fragmentation scheme while the system remains available. This is expensive but necessary as data volumes and access patterns evolve. Design with the expectation that you'll refragment at least once as your system matures.

Allocation and Placement

After defining fragments, the next question is: Where do you place each fragment?

This is the allocation problem—assigning fragments to sites in a way that optimizes performance, availability, and resource utilization.

Allocation Considerations

Access locality: Place fragments near the applications that access them most
Load balancing: Distribute fragments to avoid overloaded nodes
Availability requirements: Place replicas to survive failures
Storage capacity: Respect storage limits at each site
Network topology: Account for bandwidth and latency between sites
Regulatory constraints: Ensure data residency requirements are met

Allocation Strategies

1. Full Replication

Every fragment at every site. Maximizes read availability and locality but:

Storage cost: N × data size (where N = number of sites)
Write complexity: Every write must update all replicas
Best for: Read-heavy workloads, small datasets, maximum availability requirements

2. No Replication (Partitioning Only)

Each fragment at exactly one site. Minimizes storage but:

Single point of failure per fragment
Cross-site access for non-local data
Best for: Cost-constrained systems, availability handled at application level

3. Selective Replication

Each fragment replicated to K sites (where 1 < K < N). Balances:

Storage cost: K × data size
Availability: Survives K-1 failures
Write cost: K replicas to update
Most production systems use this approach

Allocation Strategy Comparison
Strategy	Storage Cost	Read Availability	Write Complexity	Use Case
Full Replication	O(N × Data)	Maximum	O(N) per write	Small reference data, config
No Replication	O(Data)	Minimum	O(1) per write	Cost-sensitive, stateless apps
Selective (K=3)	O(3 × Data)	High	O(3) per write	Most production systems
Region-Aware (K=2 per region)	O(2R × Data/R)	High within region	O(2) per write	Multi-region deployments

Co-location for Performance

Place fragments that are frequently joined together on the same node. If customers and orders are always queried together, ensure each customer's orders are on the same node as that customer's data. This is 'co-location' or 'affinity-based placement' and eliminates distributed joins for common queries.

Summary: Mastering Data Fragmentation

Data fragmentation is the foundation of distributed database design. Let's consolidate the key concepts:

Key Takeaways

•Fragmentation divides relations into distributable pieces — Fragments become the unit of storage, replication, and query processing
•Three strategies exist — Horizontal (by rows), vertical (by columns), and hybrid (both)
•Correctness requires completeness and reconstructability — No data lost, original relation recoverable
•Horizontal fragmentation optimizes for locality — Place data where it's accessed, enable parallel scans
•Vertical fragmentation optimizes for access patterns — Group frequently co-accessed columns, isolate large/sensitive fields
•Design algorithms analyze workloads — Predicate analysis for horizontal, attribute affinity for vertical
•Allocation determines physical placement — Balance locality, load, availability, and constraints

What's Next

Fragmentation divides data, but for availability, you also need replication—maintaining copies of fragments across sites. The next page explores replication strategies: synchronous vs. asynchronous, primary-replica vs. multi-primary, and the consistency trade-offs each entails.

Page Complete

You now understand how distributed databases divide data through fragmentation. You can distinguish horizontal from vertical fragmentation, apply design algorithms to analyze workloads, and reason about allocation trade-offs. Next, we'll explore data replication for fault tolerance and performance.

Data Fragmentation

Dividing Data Across Sites

Once you decide to distribute a database, the first fundamental question is: How do you divide the data?

What You Will Learn

Fragmentation Fundamentals

What is a Fragment?

Fragmentation vs. Partitioning

Why Fragment?

Fragmentation serves multiple purposes:

Locality: Place data where it's most frequently accessed
Parallelism: Enable concurrent operations on different fragments
Load balancing: Distribute workload across nodes
Scaling: Add capacity by adding fragments on new nodes
Compliance: Keep data in specific jurisdictions

Correctness Requirements

A valid fragmentation must satisfy three properties:

1. Completeness

If relation R is decomposed into fragments R₁, R₂, ..., Rₙ, every data item in R must appear in at least one fragment Rᵢ. No data is lost during fragmentation.

Mathematically: R = R₁ ∪ R₂ ∪ ... ∪ Rₙ

2. Reconstruction

It must be possible to reconstruct the original relation R from its fragments. This ensures the fragmentation is reversible—you can always recover the complete relation when needed.

For horizontal fragmentation: R = R₁ ∪ R₂ ∪ ... ∪ Rₙ (union) For vertical fragmentation: R = R₁ ⋈ R₂ ⋈ ... ⋈ Rₙ (natural join on key)

3. Disjointness (often desirable, not always required)

Mathematically: Rᵢ ∩ Rⱼ = ∅ for all i ≠ j

Fragmentation vs. Replication

Horizontal Fragmentation

Formal Definition

Given relation R and selection predicates p₁, p₂, ..., pₙ, horizontal fragments are:

R₁ = σ(p₁)(R)
R₂ = σ(p₂)(R)
...
Rₙ = σ(pₙ)(R)

Where σ denotes the selection operation. For completeness, the predicates must cover all tuples:

p₁ ∨ p₂ ∨ ... ∨ pₙ ≡ true (for all tuples in R)

Example: Customer Table by Region

Consider a customers table for a global e-commerce platform:

horizontal_fragmentation.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
-- Original table
CREATE TABLE customers (
    customer_id INT PRIMARY KEY,
    name VARCHAR(100),
    email VARCHAR(100),
    region VARCHAR(50),
    country VARCHAR(50)
);
 
-- Horizontal fragments by region
-- Fragment 1: North America (stored in US data center)
Fragment_NA = σ(region = 'North America')(customers)
 
-- Fragment 2: Europe (stored in EU data center)
Fragment_EU = σ(region = 'Europe')(customers)
 
-- Fragment 3: Asia-Pacific (stored in Singapore data center)
Fragment_APAC = σ(region = 'Asia-Pacific')(customers)
 
-- Fragment 4: Rest of World (stored in primary data center)
Fragment_ROW = σ(region NOT IN ('North America', 'Europe', 'Asia-Pacific'))(customers)
 
-- Reconstruction: UNION of all fragments returns original table
customers = Fragment_NA ∪ Fragment_EU ∪ Fragment_APAC ∪ Fragment_ROW

Types of Horizontal Fragmentation

Primary Horizontal Fragmentation

Fragments are defined using predicates on attributes of the relation itself. The example above uses region—an attribute of customers—as the fragmentation key.

Derived Horizontal Fragmentation

Fragments are defined based on relationships with other relations. For example, fragment orders based on the region of the associated customer:

Orders_NA = orders where customer is in North America
Orders_EU = orders where customer is in Europe

This ensures related data is co-located: queries joining customers and orders can execute locally if both tables are fragmented consistently.

Horizontal Fragmentation Advantages

•Natural parallelism: Different fragments processed independently
•Query localization: Queries with matching predicates access single fragment
•Even data distribution: Well-chosen predicates balance fragment sizes
•Simple reconstruction: UNION operation is straightforward
•Schema independence: Adding columns affects all fragments uniformly
•Transaction isolation: Transactions on disjoint data don't conflict

Fragmentation Key Selection is Critical

Vertical Fragmentation

Critical Requirement: Include Primary Key

Formal Definition

Given relation R with attributes A = {a₁, a₂, ..., aₘ} and primary key K, vertical fragments are:

R₁ = π(K ∪ A₁)(R)
R₂ = π(K ∪ A₂)(R)
...
Rₙ = π(K ∪ Aₙ)(R)

Where π denotes projection, and A₁ ∪ A₂ ∪ ... ∪ Aₙ = A (all attributes covered).

Reconstruction: R = R₁ ⋈ R₂ ⋈ ... ⋈ Rₙ (natural join on primary key)

vertical_fragmentation.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
-- Original table with wide schema
CREATE TABLE employees (
    employee_id INT PRIMARY KEY,
    -- Core HR data
    name VARCHAR(100),
    department VARCHAR(50),
    hire_date DATE,
    -- Sensitive payroll data
    salary DECIMAL(10,2),
    bank_account VARCHAR(50),
    ssn VARCHAR(11),
    -- Large document
    resume_pdf BLOB,
    -- Audit fields
    created_at TIMESTAMP,
    updated_at TIMESTAMP
);
 
-- Vertical Fragment 1: Core HR (frequently accessed, HR systems)
Fragment_HR = π(employee_id, name, department, hire_date)(employees)
 
-- Vertical Fragment 2: Payroll (restricted access, payroll systems)
Fragment_Payroll = π(employee_id, salary, bank_account, ssn)(employees)
 
-- Vertical Fragment 3: Documents (archive storage, rarely accessed)
Fragment_Docs = π(employee_id, resume_pdf)(employees)
 
-- Vertical Fragment 4: Audit (compliance system)
Fragment_Audit = π(employee_id, created_at, updated_at)(employees)
 
-- Reconstruction: Natural join on employee_id
employees = Fragment_HR ⋈ Fragment_Payroll ⋈ Fragment_Docs ⋈ Fragment_Audit

Why Vertical Fragmentation?

Access Pattern Optimization: Most queries access only a subset of columns. Vertical fragmentation places frequently co-accessed columns together, reducing I/O.
Security Isolation: Sensitive columns (SSN, salary, medical data) can be isolated in fragments with restricted access.
Storage Tiering: Large, infrequently-accessed columns (BLOBs, CLOBs) stored on cheaper storage while frequently-accessed columns on fast SSD.
Cache Efficiency: Smaller fragments fit better in memory caches, improving hit rates.

Vertical Fragmentation Advantages

•Improved cache efficiency (smaller records)
•Security isolation for sensitive data
•Storage tiering by access frequency
•Reduced I/O for narrow queries
•Parallel processing of attribute subsets

Vertical Fragmentation Challenges

•Join cost for full tuple reconstruction
•Primary key duplication across fragments
•Complex query rewriting
•More fragments to manage
•Cross-site joins for wide queries

Column Stores: Vertical Fragmentation Taken to Extreme

Hybrid Fragmentation

Two Approaches

1. Horizontal-then-Vertical (HV)

First apply horizontal fragmentation to partition rows, then apply vertical fragmentation to each horizontal fragment:

Horizontal: R → R₁, R₂, R₃ (by predicate)
Vertical: R₁ → R₁ₐ, R₁ᵦ (by columns), R₂ → R₂ₐ, R₂ᵦ, ...

2. Vertical-then-Horizontal (VH)

First apply vertical fragmentation to partition columns, then apply horizontal fragmentation to each vertical fragment:

Vertical: R → Rₐ, Rᵦ, Rᵧ (by columns)
Horizontal: Rₐ → Rₐ₁, Rₐ₂ (by predicate), Rᵦ → Rᵦ₁, Rᵦ₂, ...

hybrid_fragmentation.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
-- Original: Large transaction history table
CREATE TABLE transactions (
    txn_id BIGINT PRIMARY KEY,
    customer_id INT,
    region VARCHAR(20),
    txn_date TIMESTAMP,
    amount DECIMAL(12,2),
    status VARCHAR(20),
    description TEXT,       -- Large text field
    receipt_image BLOB      -- Large binary
);
 
-- Step 1: Horizontal fragmentation by region
-- Creates regional subsets
 
-- Step 2: Vertical fragmentation of each regional subset
-- Fragment A: Frequently queried operational data
Fragment_NA_Ops = π(txn_id, customer_id, txn_date, amount, status)(
    σ(region = 'NA')(transactions)
)
 
-- Fragment B: Large, rarely-accessed documents
Fragment_NA_Docs = π(txn_id, description, receipt_image)(
    σ(region = 'NA')(transactions)
)
 
-- Same pattern for other regions...
Fragment_EU_Ops = π(txn_id, customer_id, txn_date, amount, status)(
    σ(region = 'EU')(transactions)
)
 
Fragment_EU_Docs = π(txn_id, description, receipt_image)(
    σ(region = 'EU')(transactions)
)
 
-- Result: 2 regional × 2 column groups = 4 fragments per table
-- Each fragment optimally placed:
-- - Ops fragments on fast SSD in each region
-- - Docs fragments on cheap archive storage

Reconstruction of Hybrid Fragments

Reconstruction combines the operations for both fragmentation types:

R = ⋃ᵢ (∪ₖ (Rᵢₖ ⋈ Rᵢₗ ⋈ ...))

First, join vertical fragments within each horizontal partition (reconstruct full rows)
Then, union horizontal partitions (combine all rows)

When to Use Hybrid Fragmentation

Geographic + Access Pattern: Data partitioned by region (horizontal), with hot vs. cold columns split (vertical)
Temporal + Schema: Data partitioned by time (horizontal), with frequently-accessed summary columns separate from full details (vertical)
Multi-tenant + Security: Data partitioned by tenant (horizontal), with sensitive fields isolated (vertical)

Complexity vs. Optimization Trade-off

Fragmentation Design Algorithms

Designing an optimal fragmentation scheme requires analyzing query workloads, access patterns, and data characteristics. Several algorithms guide this process:

Horizontal Fragmentation Design: Predicate-Based Partitioning

The goal is to define predicates that:

Cover all tuples (completeness)
Are mutually exclusive (disjointness)
Reflect access patterns (locality)
Create balanced fragments (load distribution)

Algorithm: Simple Predicate Analysis

Collect all simple predicates used in queries against the relation
Compute minterm predicates (all combinations of simple predicates)
Eliminate minterms that are semantically meaningless
Group minterms accessed together into fragments
Validate completeness and disjointness

predicate_analysis.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
-- Example: Simple Predicate Analysis for orders table
-- Collected predicates from query workload:
p1: region = 'NA'
p2: region = 'EU'
p3: order_date >= '2024-01-01'
p4: status = 'pending'
 
-- Minterm predicates (all combinations):
m1: region = 'NA' ∧ order_date >= '2024-01-01' ∧ status = 'pending'
m2: region = 'NA' ∧ order_date >= '2024-01-01' ∧ status ≠ 'pending'
m3: region = 'NA' ∧ order_date < '2024-01-01' ∧ status = 'pending'
... (16 combinations for 4 boolean predicates)
 
-- After workload analysis, group into fragments:
Fragment_1: region = 'NA' ∧ status = 'pending'  -- Active NA orders
Fragment_2: region = 'EU' ∧ status = 'pending'  -- Active EU orders
Fragment_3: status ≠ 'pending'                   -- Historical (all regions)

Vertical Fragmentation Design: Attribute Affinity

Vertical fragmentation groups attributes that are frequently accessed together. The attribute affinity measures how often two attributes appear in the same query.

Algorithm: Bond Energy Algorithm (BEA)

Build an attribute usage matrix: For each query, mark which attributes it accesses
Compute attribute affinity matrix: aff(Aᵢ, Aⱼ) = count of queries accessing both
Apply clustering to group high-affinity attributes
Define fragments based on clusters (always including primary key)

Example Affinity Matrix:

Attribute Affinity Matrix (Query Access Correlation)
Attribute	name	email	salary	ssn	resume
name	—	82	15	3	12
email	82	—	8	2	10
salary	15	8	—	78	0
ssn	3	2	78	—	0
resume	12	10	0	0	—

Interpreting the Matrix:

name and email have high affinity (82 queries access both) → Group together
salary and ssn have high affinity (78 queries access both) → Group together
resume has low affinity with all others → Separate fragment

Resulting Fragments:

Fragment A: {id, name, email} — HR lookup operations
Fragment B: {id, salary, ssn} — Payroll operations
Fragment C: {id, resume} — Document retrieval

Modern Automation

Fragmentation Trade-offs

Every fragmentation decision involves trade-offs. Understanding these helps you navigate design choices:

Fragment Size: Too Big vs. Too Small

Aspect	Large Fragments	Small Fragments
Parallelism	Limited	High
Management overhead	Low	High
Load balancing	Coarse-grained	Fine-grained
Skew tolerance	Low	High
Query routing complexity	Low	High
Rebalancing cost	High	Low

Fragmentation Granularity Guidelines

Rule of Thumb: Fragment size should be:

Large enough to amortize per-fragment overhead (metadata, coordination)
Small enough to enable effective parallelism and load balancing
Aligned with expected queries (fragments match common access patterns)

Typical Ranges:

OLTP systems: Thousands to millions of rows per fragment
OLAP systems: Hundreds of megabytes to a few gigabytes per fragment
Time-series: Hours to days of data per fragment

Common Fragmentation Mistakes

•Hot key problem: Fragmentation key with low cardinality (e.g., status: active/inactive) creates hotspots
•Cross-fragment joins: Related tables fragmented on different keys, forcing distributed joins
•Temporal skew: Date-based fragmentation without archiving leads to ever-growing recent fragments
•Over-fragmentation: Too many small fragments overwhelm metadata systems
•Ignoring workload evolution: Static fragmentation doesn't adapt to changing access patterns
•Key immutability assumption: Fragmenting on mutable attributes (e.g., region, status) requires data movement on update

Fragmentation is Not Permanent

Allocation and Placement

After defining fragments, the next question is: Where do you place each fragment?

This is the allocation problem—assigning fragments to sites in a way that optimizes performance, availability, and resource utilization.

Allocation Considerations

Access locality: Place fragments near the applications that access them most
Load balancing: Distribute fragments to avoid overloaded nodes
Availability requirements: Place replicas to survive failures
Storage capacity: Respect storage limits at each site
Network topology: Account for bandwidth and latency between sites
Regulatory constraints: Ensure data residency requirements are met

Allocation Strategies

1. Full Replication

Every fragment at every site. Maximizes read availability and locality but:

Storage cost: N × data size (where N = number of sites)
Write complexity: Every write must update all replicas
Best for: Read-heavy workloads, small datasets, maximum availability requirements

2. No Replication (Partitioning Only)

Each fragment at exactly one site. Minimizes storage but:

Single point of failure per fragment
Cross-site access for non-local data
Best for: Cost-constrained systems, availability handled at application level

3. Selective Replication

Each fragment replicated to K sites (where 1 < K < N). Balances:

Storage cost: K × data size
Availability: Survives K-1 failures
Write cost: K replicas to update
Most production systems use this approach

Allocation Strategy Comparison
Strategy	Storage Cost	Read Availability	Write Complexity	Use Case
Full Replication	O(N × Data)	Maximum	O(N) per write	Small reference data, config
No Replication	O(Data)	Minimum	O(1) per write	Cost-sensitive, stateless apps
Selective (K=3)	O(3 × Data)	High	O(3) per write	Most production systems
Region-Aware (K=2 per region)	O(2R × Data/R)	High within region	O(2) per write	Multi-region deployments

Co-location for Performance

Summary: Mastering Data Fragmentation

Data fragmentation is the foundation of distributed database design. Let's consolidate the key concepts:

Key Takeaways

•Fragmentation divides relations into distributable pieces — Fragments become the unit of storage, replication, and query processing
•Three strategies exist — Horizontal (by rows), vertical (by columns), and hybrid (both)
•Correctness requires completeness and reconstructability — No data lost, original relation recoverable
•Horizontal fragmentation optimizes for locality — Place data where it's accessed, enable parallel scans
•Vertical fragmentation optimizes for access patterns — Group frequently co-accessed columns, isolate large/sensitive fields
•Design algorithms analyze workloads — Predicate analysis for horizontal, attribute affinity for vertical
•Allocation determines physical placement — Balance locality, load, availability, and constraints

What's Next

Page Complete