Loading learning content...
You've fragmented your data across continents—customer records split by region, order history partitioned by year, product catalogs divided by category. The fragmentation serves its purpose: locality, performance, compliance. But now a quarterly business review requires a comprehensive report spanning all regions, all years, all categories.
Reconstruction is the process of reassembling the original relation (or a query result) from distributed fragments. It's the fundamental guarantee that fragmentation doesn't lose information—that despite physical distribution, the logical view remains coherent.
This page explores reconstruction in rigorous depth: the formal operators for different fragmentation types, optimization techniques to minimize reconstruction cost, query processing strategies that avoid unnecessary reconstruction, and practical implementation patterns for production systems. Understanding reconstruction completes your mastery of fragmentation—you'll know not just how to split data, but how to seamlessly reunite it when needed.
By the end of this page, you will understand: (1) Formal reconstruction operators for horizontal, vertical, and hybrid fragmentation, (2) The role of localization in distributed query processing, (3) Optimization techniques to minimize reconstruction overhead, (4) Semi-join and bloom filter strategies for reducing data transfer, (5) Materialized views for pre-computed reconstruction, and (6) Consistency considerations during reconstruction.
Reconstruction operators mirror fragmentation operators. Understanding the formal relationship ensures correct and complete data retrieval.
Horizontal Fragmentation Reconstruction:
Given horizontal fragments R₁, R₂, ..., Rₙ where each Rᵢ = σ(pᵢ)(R):
Reconstruction: R = R₁ ∪ R₂ ∪ ... ∪ Rₙ
The union operation combines all tuples from all fragments. Because fragments are disjoint (pᵢ ∧ pⱼ = FALSE for i≠j), there are no duplicates to eliminate—UNION ALL is sufficient and more efficient than UNION DISTINCT.
Vertical Fragmentation Reconstruction:
Given vertical fragments R₁, R₂, ..., Rₙ where each Rᵢ = π(K ∪ Aᵢ)(R) and K is the tuple identifier:
Reconstruction: R = R₁ ⋈ R₂ ⋈ ... ⋈ Rₙ (natural join on K)
The join operation matches tuples across fragments using the common tuple identifier. Because each tuple has exactly one entry in each fragment (assuming correct fragmentation), this is a lossless join with no spurious tuples.
Hybrid Fragmentation Reconstruction:
For HV (horizontal-then-vertical) fragmentation with fragments Fᵢⱼ:
Reconstruction: R = ⋈ⱼ (∪ᵢ Fᵢⱼ)
First union all horizontal fragments within each vertical group, then join the vertical results.
For VH (vertical-then-horizontal), the order reverses:
Reconstruction: R = ∪ᵢ (⋈ⱼ Fⱼᵢ)
First join all vertical fragments within each horizontal partition, then union the horizontal results.
| Fragmentation Type | Reconstruction Operator | Complexity Factor | Key Requirement |
|---|---|---|---|
| Horizontal | Union (∪) | O(n) fragments | Disjoint predicates (no duplicates) |
| Vertical | Natural Join (⋈ on TID) | O(n) joins | TID present in all fragments |
| Hybrid HV | Join(Union per vertical) | O(h × v) | Both properties above |
| Hybrid VH | Union(Join per horizontal) | O(h × v) | Both properties above |
The correct reconstruction formula guarantees that the reconstructed relation equals the original: R_reconstructed ≡ R_original. This equivalence is the formal statement of the fragmentation correctness property. Any violation indicates a bug in fragmentation definition, a data corruption event, or a consistency anomaly in the distributed system.
The goal of query processing in distributed databases is to avoid full reconstruction whenever possible. Query localization transforms global queries into fragment-specific subqueries, accessing only necessary fragments.
The Localization Program:
Transforming a query on global relation R into queries on fragments:
Example: Horizontal Fragment Localization
Global query:
SELECT * FROM Orders WHERE region = 'US' AND amount > 1000
Fragments:
Localization steps:
Orders → Orders_NA ∪ Orders_EU ∪ Orders_APACσ(region='US')σ(region='US' ∧ amount>1000)(Orders_NA)The query accesses only one fragment instead of three—significant savings in distributed systems.
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273747576777879808182838485868788899091929394
-- Query Localization Examples -- ================================================================-- HORIZONTAL FRAGMENT LOCALIZATION-- ================================================================ -- Original global query-- SELECT customer_name, total_amount-- FROM Orders-- WHERE region = 'DE' AND order_date >= '2024-01-01'; -- After localization (only EU fragment accessed):SELECT customer_name, total_amountFROM Orders_EUWHERE region = 'DE' AND order_date >= '2024-01-01';-- Orders_NA and Orders_APAC eliminated -- ================================================================-- VERTICAL FRAGMENT LOCALIZATION-- ================================================================ -- Given vertical fragments:-- Customers_Contact: tid, customer_id, name, email, phone-- Customers_Demographics: tid, age, gender, income_bracket-- Customers_Preferences: tid, newsletter, language, timezone -- Original query only needs contact info:-- SELECT name, email FROM Customers WHERE customer_id = 12345; -- After localization (only Contact fragment accessed):SELECT name, emailFROM Customers_ContactWHERE customer_id = 12345;-- No join needed! Demographics and Preferences fragments not accessed -- Query needing multiple vertical fragments:-- SELECT name, age, newsletter FROM Customers WHERE customer_id = 12345; -- After localization (two fragments, minimal attributes):SELECT c.name, d.age, p.newsletterFROM Customers_Contact cJOIN Customers_Demographics d ON c.tid = d.tidJOIN Customers_Preferences p ON c.tid = p.tidWHERE c.customer_id = 12345;-- Only necessary fragments joined -- ================================================================-- HYBRID FRAGMENT LOCALIZATION-- ================================================================ -- Given HV fragmentation:-- Orders_NA_Billing: region IN (NA), billing attributes-- Orders_NA_Shipping: region IN (NA), shipping attributes-- Orders_EU_Billing: region IN (EU), billing attributes-- Orders_EU_Shipping: region IN (EU), shipping attributes -- Query: Get billing info for US orders-- SELECT order_id, amount, tax FROM Orders WHERE region = 'US'; -- After localization:-- 1. Eliminate EU fragments (region mismatch)-- 2. Eliminate Shipping fragments (attributes not needed) SELECT order_id, amount, taxFROM Orders_NA_BillingWHERE region = 'US';-- Only one fragment accessed out of four! -- ================================================================-- LOCALIZATION OPTIMIZATION VIEW-- ================================================================ -- Create a view that the optimizer can localizeCREATE VIEW global_orders ASSELECT * FROM Orders_NAUNION ALLSELECT * FROM Orders_EUUNION ALLSELECT * FROM Orders_APAC; -- PostgreSQL optimizer will perform constraint exclusion-- when querying this view with partition-compatible predicatesEXPLAIN (COSTS OFF)SELECT * FROM global_orders WHERE region = 'JP'; -- OUTPUT (ideal):-- Append-- -> Seq Scan on orders_apac-- Filter: (region = 'JP')-- (NA and EU partitions excluded)For automatic localization: (1) Use declarative partitioning with CHECK constraints that match fragment definitions, (2) Enable constraint_exclusion or equivalent optimizer settings, (3) Use partition-compatible data types (avoid functions on partition keys), (4) Keep statistics updated with regular ANALYZE. The optimizer can only eliminate fragments when it can prove predicate contradiction.
Vertical fragment reconstruction requires joins. In distributed systems, naive join execution transfers massive data volumes. Optimization techniques minimize this overhead.
The Problem:
Vertical fragments V₁ (at Site A) and V₂ (at Site B) must be joined:
Naive approach: Ship V₂ to Site A (2 GB transfer), perform local join.
But if the query selects only 1% of rows, we transferred 99% unnecessary data!
Semi-Join Reduction:
The semi-join technique reduces transfer by sending only relevant keys:
Cost Comparison:
Let selectivity s = 0.01 (1% of rows qualify)
| Approach | Data Transfer | Calculation |
|---|---|---|
| Naive | 2 GB | Full V₂ |
| Semi-join | ~20 MB + 2 MB | 10M × 8B × 0.01 keys + 10M × 200B × 0.01 rows |
Semi-join reduces transfer by ~100x for selective queries!
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384
-- Semi-Join Reduction Strategy -- Scenario: Reconstruct customer data from vertical fragments-- Customers_Contact (Site A): 10M rows, contains email for filtering-- Customers_Financial (Site B): 10M rows, contains salary data -- Query: Get names and salaries for customers with @gmail.com email-- SELECT c.name, f.salary-- FROM Customers_Contact c-- JOIN Customers_Financial f ON c.tid = f.tid-- WHERE c.email LIKE '%@gmail.com'; -- ================================================================-- NAIVE APPROACH (what NOT to do)-- ================================================================ -- Step 1: Ship all of Customers_Financial (10M rows × 200B = 2GB) to Site A-- Step 2: Join locally-- Step 3: Filter by email-- Result: Most transferred data is discarded! -- ================================================================-- SEMI-JOIN APPROACH (optimized)-- ================================================================ -- Step 1: At Site A, find qualifying TIDs (site-local query)CREATE TEMP TABLE qualifying_tids ASSELECT tidFROM Customers_ContactWHERE email LIKE '%@gmail.com';-- Result: ~100K TIDs (1% selectivity, 8 bytes each = 800KB) -- Step 2: Send TID set to Site B (800KB transfer)-- (In practice, this might be a federated query or explicit data transfer) -- Step 3: At Site B, filter Customers_Financial using received TIDsCREATE TEMP TABLE filtered_financial ASSELECT f.tid, f.salaryFROM Customers_Financial fWHERE f.tid IN (SELECT tid FROM qualifying_tids);-- Result: ~100K rows (1% of table, 100K × 200B = 20MB) -- Step 4: Send filtered result back to Site A (20MB transfer) -- Step 5: At Site A, complete the join locallySELECT c.name, f.salaryFROM Customers_Contact cJOIN filtered_financial f ON c.tid = f.tidWHERE c.email LIKE '%@gmail.com'; -- Total transfer: 800KB + 20MB ≈ 21MB-- vs. Naive: 2GB-- Savings: ~99%! -- ================================================================-- BLOOM FILTER OPTIMIZATION-- ================================================================ -- For very large TID sets, sending the actual set is expensive-- Use a Bloom filter: probabilistic set membership with false positives -- Build Bloom filter for qualifying TIDs (fixed size, e.g., 1MB)-- Send Bloom filter instead of TID list -- At Site B, filter using Bloom filter-- Some false positives will be included, but join will filter them out -- PostgreSQL extension for Bloom filtersCREATE EXTENSION bloom; -- Create bloom index on TIDCREATE INDEX idx_financial_tid_bloom ON Customers_Financial USING bloom (tid) WITH (length=1024); -- With proper integration, the optimizer can use Bloom filters-- for semi-join reduction in distributed queriesFor frequently-executed reconstruction queries, materialized views pre-compute and store the reconstructed data, trading storage for query performance.
Materialized View Strategy:
Full Reconstruction Materialization:
Partial Reconstruction Materialization:
Incremental Materialization:
Trade-off Analysis:
| Factor | On-Demand Reconstruction | Materialized View |
|---|---|---|
| Query Latency | Higher (compute at query time) | Lower (pre-computed) |
| Storage | None (use fragments) | Full copy of reconstructed data |
| Freshness | Always current | Depends on refresh strategy |
| Maintenance | None | Refresh costs (incremental or full) |
| Flexibility | Any query pattern | Only pre-defined patterns |
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145
-- Materialized View Reconstruction Strategies -- ================================================================-- FULL RECONSTRUCTION MATERIALIZATION-- ================================================================ -- Vertical fragments:-- Customers_Contact: tid, customer_id, name, email-- Customers_Demographics: tid, age, income_bracket-- Customers_Activity: tid, last_login, purchase_count -- Materialized view for complete customer recordCREATE MATERIALIZED VIEW customers_full ASSELECT c.tid, c.customer_id, c.name, c.email, d.age, d.income_bracket, a.last_login, a.purchase_countFROM Customers_Contact cJOIN Customers_Demographics d ON c.tid = d.tidJOIN Customers_Activity a ON c.tid = a.tidWITH DATA; -- Create indexes for common query patternsCREATE INDEX idx_customers_full_email ON customers_full(email);CREATE INDEX idx_customers_full_income ON customers_full(income_bracket);CREATE INDEX idx_customers_full_active ON customers_full(last_login) WHERE last_login > CURRENT_DATE - INTERVAL '30 days'; -- Refresh strategy: full refresh nightlyREFRESH MATERIALIZED VIEW customers_full; -- ================================================================-- PARTIAL RECONSTRUCTION FOR HOT QUERIES-- ================================================================ -- Horizontal fragments by region (NA, EU, APAC)-- Common report needs only NA data with specific attributes CREATE MATERIALIZED VIEW customers_na_report ASSELECT customer_id, name, email, state, revenue_ytd, last_order_dateFROM Customers_NAWHERE status = 'active'WITH DATA; -- Refresh more frequently for operational data-- Using CONCURRENTLY for zero-downtime refreshCREATE UNIQUE INDEX idx_customers_na_report_pk ON customers_na_report(customer_id);REFRESH MATERIALIZED VIEW CONCURRENTLY customers_na_report; -- ================================================================-- INCREMENTAL MATERIALIZATION-- ================================================================ -- For real-time or near-real-time requirements-- Use triggers to maintain materialized view -- Create the materialized table manuallyCREATE TABLE customers_realtime ( tid BIGINT PRIMARY KEY, customer_id BIGINT UNIQUE, name VARCHAR(100), email VARCHAR(255), age INTEGER, last_login TIMESTAMP, updated_at TIMESTAMP DEFAULT NOW()); -- Trigger function to propagate changesCREATE OR REPLACE FUNCTION sync_customers_realtime()RETURNS TRIGGER AS $$BEGIN IF TG_TABLE_NAME = 'customers_contact' THEN INSERT INTO customers_realtime (tid, customer_id, name, email) VALUES (NEW.tid, NEW.customer_id, NEW.name, NEW.email) ON CONFLICT (tid) DO UPDATE SET customer_id = EXCLUDED.customer_id, name = EXCLUDED.name, email = EXCLUDED.email, updated_at = NOW(); ELSIF TG_TABLE_NAME = 'customers_demographics' THEN UPDATE customers_realtime SET age = NEW.age, updated_at = NOW() WHERE tid = NEW.tid; ELSIF TG_TABLE_NAME = 'customers_activity' THEN UPDATE customers_realtime SET last_login = NEW.last_login, updated_at = NOW() WHERE tid = NEW.tid; END IF; RETURN NEW;END;$$ LANGUAGE plpgsql; -- Attach triggers to vertical fragmentsCREATE TRIGGER tr_sync_contactAFTER INSERT OR UPDATE ON customers_contactFOR EACH ROW EXECUTE FUNCTION sync_customers_realtime(); CREATE TRIGGER tr_sync_demographicsAFTER INSERT OR UPDATE ON customers_demographicsFOR EACH ROW EXECUTE FUNCTION sync_customers_realtime(); CREATE TRIGGER tr_sync_activityAFTER INSERT OR UPDATE ON customers_activityFOR EACH ROW EXECUTE FUNCTION sync_customers_realtime(); -- ================================================================-- AGGREGATED RECONSTRUCTION-- ================================================================ -- Sometimes full reconstruction isn't needed—only aggregates CREATE MATERIALIZED VIEW regional_sales_summary ASSELECT region, DATE_TRUNC('month', order_date) AS month, COUNT(*) AS order_count, SUM(amount) AS total_revenue, AVG(amount) AS avg_order_valueFROM ( SELECT * FROM Orders_NA UNION ALL SELECT * FROM Orders_EU UNION ALL SELECT * FROM Orders_APAC) all_ordersGROUP BY region, DATE_TRUNC('month', order_date)WITH DATA; -- Much smaller than full reconstruction!-- Refresh daily for reporting freshnessREFRESH MATERIALIZED VIEW regional_sales_summary;Materialized views can become stale between refreshes. For decision-critical queries, verify freshness or use real-time reconstruction. Document the expected staleness for each materialized view. Consider 'last refreshed' metadata visible to users. For financial or compliance data, staleness may be unacceptable.
In distributed systems, fragments may be updated independently. Reconstruction must address potential inconsistencies between fragments.
Consistency Challenges:
Temporal Inconsistency:
Partial Updates:
Replication Lag:
Consistency Levels for Reconstruction:
| Level | Definition | Implementation |
|---|---|---|
| Strong | Reconstruction reflects a single consistent snapshot | Distributed transactions, snapshot isolation |
| Session | Each session sees its own writes consistently | Read-your-writes guarantee |
| Eventual | Reconstruction eventually converges | Tolerate temporary inconsistency |
| Causal | Causally related writes seen in order | Vector clocks, causal+ consistency |
Achieving Strong Consistency:
Distributed Snapshot Isolation:
Two-Phase Locking:
Coordination Services:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122
-- Consistent Reconstruction Strategies -- ================================================================-- SNAPSHOT ISOLATION-- ================================================================ -- Use a single transaction with REPEATABLE READ for consistent reconstructionBEGIN TRANSACTION ISOLATION LEVEL REPEATABLE READ; -- All these queries see the same snapshotSELECT c.name, c.email FROM Customers_Contact c WHERE c.tid IN (SELECT tid FROM active_customers); SELECT d.age, d.income_bracket FROM Customers_Demographics d WHERE d.tid IN (SELECT tid FROM active_customers); SELECT a.last_login, a.purchase_count FROM Customers_Activity a WHERE a.tid IN (SELECT tid FROM active_customers); COMMIT;-- All three queries see database as of transaction start -- ================================================================-- VERSIONED RECONSTRUCTION-- ================================================================ -- Add version/timestamp metadata to fragmentsALTER TABLE Customers_Contact ADD COLUMN version_ts TIMESTAMP DEFAULT NOW();ALTER TABLE Customers_Demographics ADD COLUMN version_ts TIMESTAMP DEFAULT NOW();ALTER TABLE Customers_Activity ADD COLUMN version_ts TIMESTAMP DEFAULT NOW(); -- Reconstruct as of a specific point in time-- (Requires temporal tables or versioning extension) -- PostgreSQL temporal tables approachCREATE EXTENSION temporal_tables; -- Create history tables for versioningCREATE TABLE Customers_Contact_History (LIKE Customers_Contact);ALTER TABLE Customers_Contact_History ADD COLUMN valid_from TIMESTAMP;ALTER TABLE Customers_Contact_History ADD COLUMN valid_until TIMESTAMP; -- Query as of specific timestampCREATE OR REPLACE FUNCTION reconstruct_customer_at( p_tid BIGINT, p_as_of TIMESTAMP) RETURNS TABLE ( name VARCHAR, email VARCHAR, age INTEGER, last_login TIMESTAMP) AS $$BEGIN RETURN QUERY SELECT c.name, c.email, d.age, a.last_login FROM (SELECT * FROM Customers_Contact_History WHERE tid = p_tid AND valid_from <= p_as_of AND (valid_until IS NULL OR valid_until > p_as_of)) c JOIN (SELECT * FROM Customers_Demographics_History WHERE tid = p_tid AND valid_from <= p_as_of AND (valid_until IS NULL OR valid_until > p_as_of)) d ON c.tid = d.tid JOIN (SELECT * FROM Customers_Activity_History WHERE tid = p_tid AND valid_from <= p_as_of AND (valid_until IS NULL OR valid_until > p_as_of)) a ON c.tid = a.tid;END;$$ LANGUAGE plpgsql; -- ================================================================-- LAG DETECTION-- ================================================================ -- Monitor reconstruction consistency by tracking fragment timestamps CREATE TABLE fragment_metadata ( fragment_name VARCHAR PRIMARY KEY, last_write_ts TIMESTAMP, replica_lag_ms INTEGER); -- Update metadata on each write (via triggers or application) -- Before reconstruction, check for unacceptable lagCREATE OR REPLACE FUNCTION check_reconstruction_consistency( max_lag_ms INTEGER DEFAULT 1000) RETURNS BOOLEAN AS $$DECLARE max_observed_lag INTEGER;BEGIN SELECT MAX(replica_lag_ms) INTO max_observed_lag FROM fragment_metadata WHERE fragment_name IN ('Customers_Contact', 'Customers_Demographics', 'Customers_Activity'); IF max_observed_lag > max_lag_ms THEN RAISE WARNING 'Reconstruction may be inconsistent. Max lag: % ms', max_observed_lag; RETURN FALSE; END IF; RETURN TRUE;END;$$ LANGUAGE plpgsql; -- Usage before reconstruction queriesDO $$BEGIN IF NOT check_reconstruction_consistency(500) THEN RAISE EXCEPTION 'Fragments too inconsistent for reconstruction'; END IF;END $$;Per the CAP theorem, strong consistency during reconstruction may reduce availability. If a fragment's primary is unreachable, you can either: (1) Wait until available (sacrifice availability), (2) Read from a stale replica (sacrifice consistency), or (3) Return partial results with warnings (application-level trade-off). The right choice depends on your application's requirements.
Production systems employ various patterns to optimize reconstruction performance.
Pattern 1: Parallel Fragment Access
For horizontal reconstruction, access all fragments concurrently:
Fork:
Thread 1: Query Fragment_NA
Thread 2: Query Fragment_EU
Thread 3: Query Fragment_APAC
Join:
Union results as they complete
Latency = max(fragment latencies) not sum(fragment latencies)
Pattern 2: Pipelined Reconstruction
Begin processing results as they arrive instead of waiting for complete reconstruction:
1. Start Fragment_1 query
2. As rows arrive, begin filtering/projecting
3. Stream results to client
4. Fragment_2, _3 results processed similarly
5. No full materialization of reconstructed table
Reduces memory usage and time-to-first-result.
Pattern 3: Predicate Distribution
For queries with multiple predicates, distribute predicates to eliminate fragments early:
Query: WHERE region = 'US' AND income > 100000
Region predicate → eliminates non-NA fragments
Income predicate → applied only to NA fragment
No cross-fragment communication for eliminated fragments
Pattern 4: Batched Reconstruction
For bulk reconstruction (reports, exports), batch processing reduces per-record overhead:
1. Collect TIDs in batches of 10,000
2. Reconstruct batch via single join query
3. Process batch to output
4. Repeat for next batch
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142
import asyncioimport asyncpgfrom typing import List, AsyncIteratorfrom dataclasses import dataclass @dataclassclass FragmentConfig: name: str connection_string: str predicate: str class DistributedReconstructor: """Parallel and pipelined reconstruction from distributed fragments.""" def __init__(self, fragments: List[FragmentConfig]): self.fragments = fragments self.pools = {} async def initialize(self): """Create connection pools for each fragment site.""" for frag in self.fragments: self.pools[frag.name] = await asyncpg.create_pool( frag.connection_string, min_size=2, max_size=10 ) async def close(self): """Close all connection pools.""" for pool in self.pools.values(): await pool.close() async def reconstruct_horizontal( self, query_predicates: dict, attributes: List[str] ) -> AsyncIterator[dict]: """ Reconstruct from horizontal fragments with parallel access. Yields results as they arrive (pipelined). """ # Build query for each fragment async def query_fragment(frag: FragmentConfig): # Check if fragment can be eliminated if not self._predicates_compatible(query_predicates, frag.predicate): return # Fragment eliminated pool = self.pools[frag.name] async with pool.acquire() as conn: # Build and execute query sql = self._build_query(frag.name, attributes, query_predicates) async for record in conn.cursor(sql): yield record # Execute all fragment queries in parallel tasks = [query_fragment(f) for f in self.fragments] # Merge results as they complete (pipelined) for task in asyncio.as_completed([ asyncio.create_task(self._collect(t)) for t in tasks ]): records = await task for record in records: yield record async def reconstruct_vertical( self, tid_filter: str, fragment_attributes: dict # {fragment_name: [attributes]} ) -> AsyncIterator[dict]: """ Reconstruct from vertical fragments using parallel access and merge. """ async def query_vertical_fragment(frag_name: str, attrs: List[str]): pool = self.pools[frag_name] async with pool.acquire() as conn: sql = f""" SELECT tid, {', '.join(attrs)} FROM {frag_name} WHERE {tid_filter} """ return await conn.fetch(sql) # Query all vertical fragments in parallel tasks = { frag: asyncio.create_task( query_vertical_fragment(frag, attrs) ) for frag, attrs in fragment_attributes.items() } # Wait for all results results = {} for frag, task in tasks.items(): results[frag] = {r['tid']: dict(r) for r in await task} # Merge on TID primary_frag = list(fragment_attributes.keys())[0] for tid, primary_record in results[primary_frag].items(): merged = primary_record.copy() for frag in list(fragment_attributes.keys())[1:]: if tid in results[frag]: merged.update(results[frag][tid]) yield merged def _predicates_compatible(self, query_pred: dict, frag_pred: str) -> bool: """Check if query predicates are compatible with fragment definition.""" # Simplified - real implementation needs SQL predicate analysis return True # Placeholder def _build_query(self, table: str, attrs: List[str], preds: dict) -> str: """Build SQL query for fragment.""" where_clause = ' AND '.join(f"{k} = ${i + 1}" for i, k in enumerate(preds)) return f"SELECT {', '.join(attrs)} FROM {table} WHERE {where_clause}" async def _collect(self, async_gen) -> List: """Collect async generator results into list.""" return [item async for item in async_gen] # Usage exampleasync def main(): fragments = [ FragmentConfig("orders_na", "postgresql://db1/orders", "region IN ('US', 'CA')"), FragmentConfig("orders_eu", "postgresql://db2/orders", "region IN ('UK', 'DE')"), FragmentConfig("orders_apac", "postgresql://db3/orders", "region IN ('JP', 'AU')"), ] reconstructor = DistributedReconstructor(fragments) await reconstructor.initialize() try: # Parallel reconstruction with predicates async for order in reconstructor.reconstruct_horizontal( {"region": "US", "status": "pending"}, ["order_id", "customer_id", "amount"] ): print(f"Order: {order}") finally: await reconstructor.close() # asyncio.run(main())Reconstruction completes the fragmentation story—ensuring that distributed data can be seamlessly reunited when needed. Let's consolidate the key concepts:
Module Complete:
You've now completed the Fragmentation module. You understand horizontal, vertical, and hybrid fragmentation strategies; formal correctness properties; design methodologies; allocation optimization; and reconstruction techniques. This comprehensive knowledge enables you to design, implement, and operate fragmented distributed database systems that scale while maintaining correctness and performance.
Congratulations! You've mastered fragmentation—the foundational technique for distributed database design. From formal properties through practical implementation, you can now partition data strategically across distributed nodes while ensuring complete, consistent reconstruction. Next, explore Replication to understand how to maintain multiple copies of fragments for availability and performance.