Newsql Databases - Learning Module

Loading content...

0/252

SQL with Scalability

The Holy Grail of Database Engineering

For decades, the database industry accepted an implicit truth: SQL and horizontal scalability were fundamentally incompatible. The rich semantics of SQL—joins, transactions, foreign keys, complex queries—seemed to require centralized coordination that contradicted the distributed nature of scale-out architectures.

This assumption was wrong.

NewSQL databases prove that SQL's expressiveness and ACID's correctness guarantees can coexist with elastic, horizontal scalability. This page examines how—exploring the architectural patterns, distributed algorithms, and engineering innovations that make SQL at scale a reality.

What You Will Learn

By the end of this page, you will understand the specific techniques NewSQL systems use to distribute SQL workloads across multiple nodes, maintain consistency during distributed operations, and provide linear scalability without sacrificing query capabilities. You'll see how query planning, transaction coordination, and data distribution work together to deliver SQL at any scale.

The Scalability Challenge for SQL

Before examining solutions, let's understand precisely why SQL databases traditionally struggled with horizontal scaling. This isn't a limitation of SQL the language—it's a challenge of implementing SQL's semantics across distributed nodes.

Challenge 1: Distributed Transactions

Consider a simple bank transfer in SQL:

BEGIN TRANSACTION;
  UPDATE accounts SET balance = balance - 100 WHERE id = 'alice';
  UPDATE accounts SET balance = balance + 100 WHERE id = 'bob';
COMMIT;

In a single-node database, this is straightforward. But what if Alice's account data lives on Node A and Bob's account lives on Node B? The system must:

Ensure both updates succeed or both fail (atomicity)
Ensure no concurrent transaction sees partial updates (isolation)
Ensure committed changes survive node failures (durability)
All while coordinating across a network that can delay, duplicate, or lose messages

Challenge 2: Distributed Joins

SQL's power comes largely from joins—combining data from multiple tables based on relationships:

SELECT orders.id, customers.name, products.title
FROM orders
JOIN customers ON orders.customer_id = customers.id
JOIN products ON orders.product_id = products.id
WHERE orders.date > '2024-01-01';

If these three tables are partitioned across different nodes, the database must:

Locate relevant rows across potentially dozens of nodes
Ship data between nodes for join processing
Handle the case where join keys don't align with partition keys
Optimize to minimize network traffic while maximizing parallelism

Challenge 3: Global Secondary Indexes

Secondary indexes (indexes on non-primary-key columns) enable efficient queries like:

SELECT * FROM users WHERE email = 'alice@example.com';

In a distributed environment, the user table is partitioned by primary key (user_id). How does the email index work when users are spread across all nodes? Options include:

Local indexes: Each node indexes only its own data (fast writes, slow reads)
Global indexes: Separate index distributed independently (fast reads, complex writes)

NewSQL systems must balance these trade-offs for every index.

Why Sharded MySQL/PostgreSQL Isn't Enough

Manual sharding of traditional databases addresses some scaling issues but creates significant limitations. Cross-shard queries become application responsibilities, transactions across shards require custom coordination code, and schema changes become operational nightmares. NewSQL solves these problems at the database level.

Distributed Query Processing

NewSQL databases transform SQL queries into distributed execution plans that parallelize work across nodes while minimizing data movement. This process involves several sophisticated components.

Query Planning and Optimization

The query optimizer in a NewSQL database must consider factors that single-node optimizers never face:

Data location: Which nodes hold the relevant data ranges?
Network costs: How expensive is shipping data versus computation?
Parallelism opportunities: Which operations can execute concurrently?
Locality exploitation: Can we push computation to data instead of moving data?

distributed_query_flow.pseudocode
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
-- Example Query
SELECT c.name, SUM(o.amount) as total
FROM customers c
JOIN orders o ON c.id = o.customer_id
WHERE c.region = 'EMEA'
GROUP BY c.name
HAVING SUM(o.amount) > 10000;
 
-- Distributed Execution Plan (simplified)
 
STEP 1: PARALLEL SCAN (on all nodes holding 'customers' ranges)
  - Filter: region = 'EMEA'
  - Each node scans local customer data
  - Output: matching (customer_id, name) tuples
 
STEP 2: PARALLEL LOOKUP (on order data nodes)  
  - For each customer_id from Step 1
  - Scan orders where orders.customer_id matches
  - Executed in parallel across all order nodes
 
STEP 3: SHUFFLE (redistribute by customer name)
  - Hash-partition intermediate results
  - Send tuples to aggregation nodes based on hash(name)
 
STEP 4: LOCAL AGGREGATE (on each aggregation node)
  - GROUP BY name
  - SUM(amount)
  - Filter: total > 10000
 
STEP 5: GATHER (at coordinator)
  - Collect final results from all aggregation nodes
  - Return to client

Push-Down Optimization

A critical optimization in NewSQL is 'pushing down' operations as close to the data as possible:

Predicate pushdown: Filter rows at storage nodes before sending data (reduces network traffic dramatically)
Projection pushdown: Send only required columns, not entire rows
Aggregation pushdown: Compute partial aggregates at each node, combine at coordinator
Join pushdown: When tables are co-located, perform joins locally

These optimizations can reduce network traffic by 100x compared to naive approaches.

Query Optimization Strategies

•Predicate Pushdown: Filter WHERE region = 'EMEA' at each storage node before returning any data to the coordinator.
•Projection Pushdown: Request only id, name columns from customers, not all columns.
•Partial Aggregation: Compute per-node SUM(amount) before shuffling; final aggregation merges partial sums.
•Colocation Awareness: If customers and orders are co-located by customer_id, perform the join locally without shuffling.
•Parallel Execution: Execute independent operations (scans, lookups) concurrently across nodes.

Distributed Transaction Protocol

The heart of NewSQL's ACID guarantees is the distributed transaction protocol. NewSQL systems typically use an enhanced form of Two-Phase Commit (2PC) combined with consensus-based replication.

Two-Phase Commit Overview

The classic 2PC protocol coordinates transactions across multiple participants:

Phase 1 - Prepare:

Coordinator sends PREPARE to all participants (nodes holding affected data)
Each participant acquires locks, writes to durable log, responds VOTE-COMMIT or VOTE-ABORT

Phase 2 - Commit/Abort: 3. If all votes are COMMIT, coordinator sends COMMIT to all participants 4. If any vote is ABORT, coordinator sends ABORT to all participants 5. Participants apply or rollback changes, release locks

Converting Mermaid diagram...

NewSQL Enhancements to 2PC

Classic 2PC has known limitations—particularly, coordinator failure after sending PREPARE can leave participants in an uncertain state. NewSQL databases address these with several enhancements:

1. Consensus-Based Coordinator

The coordinator isn't a single node but a Raft group. If the coordinator leader fails, another replica takes over with full knowledge of the transaction state.

2. Parallel Commits

Modern NewSQL systems (like CockroachDB's parallel commits) optimize the commit path:

Write the transaction record and participant writes in parallel
A committed transaction can be determined by observing participant state
Reduces commit latency from 2 round-trips to 1 in the common case

parallel_commits.pseudocode
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
// Traditional 2PC: 2 consensus rounds
Round 1: Write intents to all participants
Round 2: Write transaction record as COMMITTED
 
// Parallel Commits: 1 consensus round
Round 1 (parallel):
  - Write intents to all participants 
  - Write transaction record as STAGING
  
// Transaction is committed when:
//   - Transaction record shows STAGING (or later COMMITTED)
//   - AND all intents are successfully written
 
// Background: Eventually flip STAGING -> COMMITTED
// This is async and doesn't block the client

Read-Committed vs Serializable

NewSQL systems typically default to serializable isolation—the strongest level. This prevents anomalies like write skew that weaker isolation allows. While serializable traditionally implied significant overhead, NewSQL implementations like serializable snapshot isolation (SSI) make it practical for most workloads.

Data Distribution Strategies

How data is distributed across nodes fundamentally affects performance, scalability, and operational characteristics. NewSQL systems employ sophisticated distribution strategies.

Range-Based Partitioning

Most NewSQL databases partition data into ranges (contiguous key spans):

range_partitioning.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
-- Example: Users table partitioned by user_id
 
-- Range 1: user_id 0 - 999,999
--   - Stored on Nodes {A, B, C} (3 replicas)
--   - Leader: Node A
 
-- Range 2: user_id 1,000,000 - 1,999,999  
--   - Stored on Nodes {B, C, D} (3 replicas)
--   - Leader: Node C
 
-- Range 3: user_id 2,000,000 - 2,999,999
--   - Stored on Nodes {A, D, E} (3 replicas)
--   - Leader: Node D
 
-- Query execution:
SELECT * FROM users WHERE user_id = 1,500,000;
-- Route directly to Range 2 leader (Node C)
 
SELECT * FROM users WHERE user_id BETWEEN 900,000 AND 1,100,000;
-- Parallel scan: Range 1 AND Range 2

Automatic Range Splitting

As data grows, ranges are automatically split:

Range exceeds size threshold (e.g., 512MB)
System finds optimal split point (middle key or based on write distribution)
Creates two new ranges from one
Updates metadata; no data movement required (just logical split)
New ranges may later be rebalanced to different nodes

Automatic Rebalancing

The system continuously monitors range distribution:

Size balancing: Moves ranges from nodes with more data to nodes with less
Load balancing: Moves hot ranges to reduce CPU/IO hotspots
Replica placement: Ensures replicas are on different failure domains (racks, zones, regions)
Locality optimization: Places ranges near their most frequent accessors

Range-Based vs Hash-Based Partitioning
Characteristic	Range-Based (NewSQL)	Hash-Based (Traditional Sharding)
Key ordering	Preserved (efficient range scans)	Destroyed (random distribution)
Hotspot handling	Automatic splitting	Manual resharding
Adding nodes	Automatic rebalancing	Full data redistribution
Range queries	Efficient (scan contiguous ranges)	Scatter-gather (all shards)
Point queries	O(log N) range lookup	O(1) hash lookup
Complexity	More sophisticated metadata	Simpler, predictable

Interleaved Tables (Colocation)

For related tables accessed together, NewSQL databases offer interleaving (or colocation):

-- Parent table
CREATE TABLE customers (
  id INT PRIMARY KEY,
  name STRING
);

-- Child table interleaved with parent
CREATE TABLE orders (
  id INT,
  customer_id INT,
  amount DECIMAL,
  PRIMARY KEY (customer_id, id),
  INTERLEAVE IN PARENT customers (customer_id)
);

With interleaving:

Orders for a customer are stored adjacent to that customer's row
Joins between customers and orders become local operations
No network round-trips for parent-child queries
Transactions touching one customer's data are single-range (faster commits)

Consistency Mechanisms

Maintaining consistency across distributed nodes requires sophisticated mechanisms for ordering operations and resolving conflicts.

Transaction Ordering with Timestamps

Every transaction receives a timestamp that determines its position in the global order. The challenge is generating globally consistent timestamps across multiple nodes without a central authority.

Approach 1: Hybrid Logical Clocks (HLC)

Used by CockroachDB and other systems without specialized hardware:

Combines physical time with logical counter
Physical component: Wall clock time (subject to clock skew)
Logical component: Monotonically increasing counter
Guarantees: If event A happens-before event B, then timestamp(A) < timestamp(B)

hybrid_logical_clock.pseudocode
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
// Hybrid Logical Clock Structure
HLC = {
  physical: int64,   // Wall clock time in nanoseconds
  logical: int64     // Logical counter
}
 
// On local event (e.g., new transaction)
function localEvent(hlc):
    now = getCurrentWallTime()
    if now > hlc.physical:
        hlc.physical = now
        hlc.logical = 0
    else:
        hlc.logical += 1
    return hlc
 
// On receiving message with sender's HLC
function receiveEvent(hlc, senderHLC):
    now = getCurrentWallTime()
    if now > max(hlc.physical, senderHLC.physical):
        hlc.physical = now
        hlc.logical = 0
    else if hlc.physical == max(...):
        hlc.logical += 1
    else:  // senderHLC.physical is max
        hlc.physical = senderHLC.physical
        hlc.logical = senderHLC.logical + 1
    return hlc

Approach 2: TrueTime (Google Spanner)

Google Spanner uses TrueTime—a globally synchronized time system based on GPS receivers and atomic clocks in each data center. TrueTime provides:

Bounded uncertainty: Each timestamp comes with an uncertainty interval
Commit wait: Transactions wait out uncertainty before committing
Result: External consistency—if transaction T1 commits before T2 starts, T1's timestamp < T2's timestamp, guaranteed

We'll explore TrueTime in detail in the Google Spanner page.

Read Consistency Levels

NewSQL systems offer different read consistency options:

Read Consistency Levels
Level	Guarantee	Performance	Use Case
Serializable	Reads reflect all prior committed transactions	May wait for uncertain transactions	Financial transactions, inventory
Read Committed	Reads see only committed data	Faster, no blocking	Most OLTP workloads
Stale Reads	Reads from snapshot N seconds ago	Fastest, any replica	Analytics, dashboards
Follower Reads	Reads from nearest replica (possibly stale)	Lowest latency, geo-distributed	User-facing reads where freshness isn't critical

Consistency vs Latency Trade-off

Applications can choose consistency levels per-query. A banking application might use serializable for balance checks but stale reads for transaction history display. This fine-grained control optimizes performance without sacrificing correctness where it matters.

Handling Distributed Joins

Joins are where distributed SQL becomes complex. NewSQL systems employ multiple join strategies depending on data location and query characteristics.

Join Strategy Selection

The query optimizer evaluates:

Table sizes and cardinalities
Data location (which ranges, which nodes)
Available indexes
Join selectivity
Network topology

Distributed Join Strategies

•Colocation Join: When joined tables are co-located (interleaved), the join executes locally on each node. No data movement. Best performance.
•Broadcast Join: For small dimension tables, broadcast the small table to all nodes holding the large table. Each node performs a local join. Good for star-schema fact-dimension joins.
•Hash Join (Shuffle): Both tables are redistributed by join key. Data shuffles across network to co-locate matching rows. Then standard hash join. Works for any tables but expensive.
•Lookup Join (Index Nested Loop): For highly selective joins, look up matching rows using an index. Efficient when one side is small or well-filtered.

join_strategies.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
-- Example: Join between orders and products
 
-- STRATEGY 1: Lookup Join (when orders is filtered)
SELECT o.id, p.name
FROM orders o
JOIN products p ON o.product_id = p.id
WHERE o.date = '2024-01-15';
-- Few orders match date filter
-- For each order, lookup product by id using index
-- Minimal data movement
 
-- STRATEGY 2: Broadcast Join (when products is small)
SELECT o.id, p.name  
FROM orders o
JOIN products p ON o.product_id = p.id;
-- Products table is small (10,000 rows)
-- Broadcast entire products table to all order nodes
-- Each node joins locally
 
-- STRATEGY 3: Hash Join (when both tables are large)
SELECT o.id, u.name, SUM(o.amount)
FROM orders o
JOIN users u ON o.user_id = u.id
GROUP BY o.id, u.name;
-- Both tables are large, no useful indexes
-- Shuffle both tables by user_id
-- Perform hash join after co-location

Join Ordering Optimization

For multi-table joins, order matters enormously. A query joining 5 tables has 5! = 120 possible join orderings. The optimizer must:

Estimate intermediate result sizes
Consider available indexes after each join
Minimize total data shuffled across network
Balance parallelism with coordination overhead

NewSQL optimizers use dynamic programming (similar to traditional optimizers) but with additional cost models for network transfer and distributed coordination.

Design for Locality

The best join is one you never perform across nodes. Designing schemas with colocation in mind—interleaving child tables, choosing partition keys that match common join keys—can eliminate distributed join overhead for the most critical queries.

Secondary Indexes at Scale

Secondary indexes enable efficient queries on non-primary-key columns. In distributed databases, implementing them correctly is non-trivial.

The Challenge

Consider a Users table partitioned by user_id (primary key):

CREATE TABLE users (
  user_id INT PRIMARY KEY,
  email STRING UNIQUE,
  region STRING,
  created_at TIMESTAMP
);

A query by email (WHERE email = 'alice@example.com') can't use the partition structure—the email could belong to any user_id, on any node.

Solution 1: Global Indexes

Create a separate, independently partitioned index:

global_index.pseudocode
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
// Global Index Structure for 'email'
 
// Index is partitioned by email (not user_id)
// Range A: emails 'a...' to 'm...'  -> Nodes {1, 2, 3}
// Range B: emails 'n...' to 'z...'  -> Nodes {4, 5, 6}
 
// Index entry format:
// { email: 'alice@example.com', user_id: 12345 }
 
// Query: WHERE email = 'alice@example.com'
// 1. Hash/range lookup in global email index
// 2. Find user_id = 12345  
// 3. Fetch user row from users table using user_id
 
// Trade-offs:
// + Fast reads (single index lookup)
// - Writes must update both table AND index (cross-node transaction)
// - Index updates increase transaction latency

Solution 2: Local Indexes

Each node indexes only its own data:

Node 1 (users 0-999):
  Local email index: { 'alice@example.com' -> 50 }
  
Node 2 (users 1000-1999):
  Local email index: { 'bob@example.com' -> 1500 }

Writes: Only local—no cross-node coordination
Reads: Must scatter query to ALL nodes, gather results (expensive)
Best for: Queries that also filter on partition key

Solution 3: Covering Indexes

Include additional columns in the index to avoid table lookup:

covering_index.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
-- Covering index includes non-key columns
CREATE INDEX idx_users_email_covering 
ON users (email) 
STORING (name, region);
 
-- Query that can be answered from index alone:
SELECT name, region FROM users WHERE email = 'alice@example.com';
 
-- Execution:
-- 1. Lookup in global email index
-- 2. Index entry contains name and region
-- 3. No secondary fetch from users table needed
 
-- Trade-off: Larger index, but faster reads

Secondary Index Trade-offs
Index Type	Read Performance	Write Performance	Storage Cost
Global Index	O(1) lookup	Cross-node transaction	Separate replication
Local Index	Scatter to all nodes	Local update only	Stored with table data
Covering Index	No table lookup	Larger updates	Higher storage for extra columns
Partial Index	O(1) for filtered rows	Only indexed rows updated	Smaller index size

Linear Scalability in Practice

NewSQL's ultimate promise is linear scalability—doubling nodes should approximately double throughput. Let's examine how this works and its practical limits.

Scaling Dimensions

NewSQL databases scale along multiple dimensions:

Scalability Dimensions

•Read Throughput: Add nodes → more replicas → distribute read load. Near-linear scaling for read-heavy workloads.
•Write Throughput: Add nodes → more range leaders → distribute write load. Linear for well-partitioned writes.
•Data Capacity: Add nodes → automatic rebalancing redistributes ranges. Linear storage scaling.
•Concurrent Connections: Query execution distributes across nodes. More nodes = more parallel query capacity.

Factors That Limit Linear Scaling

Real-world workloads may not scale perfectly linearly due to:

1. Hotspots

If one range receives disproportionate traffic, one node becomes the bottleneck
Solution: Better partition key selection, automatic range splitting for hot ranges

2. Cross-Shard Transactions

Transactions touching many ranges require coordination across many nodes
2PC overhead increases with participant count
Solution: Design schemas for locality, batch operations

3. Global Indexes

Every write to an indexed column requires updating the (possibly remote) index
More indexes = more cross-node coordination
Solution: Selective indexing, covering indexes to reduce updates

4. Large Scans

Full table scans must touch all nodes regardless of cluster size
Scaling doesn't help queries that inherently touch all data
Solution: Better query design, materialized views, proper indexing

Scaling Behavior by Workload Type
Workload Pattern	Scaling Behavior	Optimization Approach
Point reads by primary key	Near-perfect linear	Increase replica count
Point writes to distributed keys	Near-linear	Proper partition key selection
Range scans with partition filter	Linear within partition	Ensure filter uses partition key
Full table scans	Parallel but fixed overhead	Add covering indexes, materialized views
Multi-shard transactions	Sub-linear (coordination overhead)	Schema colocation, batch operations
Hotspot writes (single key)	No scaling (single leader)	Redesign schema, use sequences

Real-World Scalability

Production NewSQL deployments regularly demonstrate near-linear scaling. CockroachDB benchmarks show 1 million+ writes/second across 256 nodes. Google Spanner handles Google's advertising infrastructure—billions of transactions daily. The technology delivers on its promise for well-designed workloads.

Summary: How NewSQL Achieves SQL at Scale

We've explored the deep technical mechanisms that enable NewSQL databases to provide SQL semantics at distributed scale. Let's consolidate the key insights:

Key Takeaways

•Distributed query processing parallelizes SQL execution across nodes, using push-down optimizations to minimize data movement.
•Two-Phase Commit with enhancements enables atomic transactions across multiple nodes, with Raft consensus providing fault tolerance.
•Range-based partitioning maintains key ordering for efficient scans while enabling automatic splitting and rebalancing.
•Hybrid Logical Clocks and TrueTime provide globally consistent transaction ordering without centralized coordination.
•Multiple join strategies (colocation, broadcast, hash, lookup) optimize for different data distributions and query patterns.
•Global and local secondary indexes enable efficient queries on non-partition columns with understood trade-offs.
•Linear scalability is achievable for well-designed workloads, with known patterns for avoiding bottlenecks.

What's Next

With the architectural foundations understood, we'll now examine specific NewSQL implementations:

Google Spanner: The pioneering system that proved global-scale SQL was possible, powered by TrueTime and internal Google expertise
CockroachDB: An open-source, PostgreSQL-compatible system that brings Spanner's concepts to all organizations

Page Complete

You now understand how NewSQL databases achieve the combination of SQL semantics and horizontal scalability. The distributed query processing, transaction protocols, and data distribution strategies work together to deliver what was once considered impossible. Next, we'll see these concepts in action with Google Spanner.