Database Management SystemsSharding

Sharding: Horizontal Scaling for Distributed Databases

LevelAdvanced

Duration90 mins

TopicSharding

1 / 5

Sharding Concept: The Foundation of Horizontal Database Scaling

When Vertical Scaling Hits Its Ceiling

Every successful database eventually faces a defining moment: the point where vertical scaling (adding more CPU, RAM, or faster storage to a single machine) can no longer satisfy growing demands. At this inflection point, organizations confront a fundamental architectural question—how do we continue scaling when we've maximized what a single server can deliver?

The answer, refined over decades of distributed systems research and battle-tested at companies like Google, Facebook, Amazon, and Netflix, is sharding—the practice of partitioning data horizontally across multiple independent database nodes. Sharding transforms a database from a monolithic bottleneck into a horizontally scalable distributed system capable of handling petabytes of data and millions of operations per second.

But sharding is not merely a technical implementation; it is an architectural commitment that fundamentally changes how applications interact with data, how failures propagate, and how the entire system evolves over time.

What You Will Learn

By the end of this page, you will understand what sharding is at its core, why it exists, how it differs from other scaling approaches, the fundamental terminology and concepts, and the architectural patterns that define sharded systems. You'll gain the foundational knowledge required to evaluate, design, and implement sharding strategies for large-scale distributed databases.

Defining Sharding: Horizontal Partitioning for Scale

Sharding is a database architecture pattern that partitions data horizontally across multiple independent database instances, called shards. Each shard holds a distinct subset of the total data, and together, the shards comprise the complete dataset. Unlike replication (which creates copies of the same data across nodes), sharding divides the data—each record exists on exactly one shard.

Formal Definition:

Sharding is the horizontal partitioning of a database such that data is distributed across multiple autonomous database instances based on a deterministic function of one or more attributes (the shard key), enabling the system to scale storage capacity and throughput linearly with the number of shards.

The critical insight is that sharding allows databases to scale out rather than scale up. Instead of buying a more powerful single server (which has physical and economic limits), you add more commodity servers, each handling a portion of the data and workload.

Sharding vs. Other Scaling Approaches
Approach	Mechanism	Data Distribution	Primary Benefit	Limitation
Vertical Scaling	Upgrade single server hardware	All data on one node	Simplicity—no distribution complexity	Physical and cost limits of single machine
Replication	Copy data across multiple nodes	Same data on all replicas	Read scalability and fault tolerance	Write bottleneck—all writes go to primary
Sharding	Partition data across nodes	Different data on each shard	Read AND write scalability	Increased complexity, cross-shard operations
Sharding + Replication	Partition and replicate each shard	Each shard is independently replicated	Full scalability and fault tolerance	Maximum complexity, but maximum capability

Sharding Is Complementary to Replication

In production systems, sharding and replication are typically combined. Each shard is independently replicated (usually primary-replica or multi-primary), providing both horizontal scalability through sharding and fault tolerance through replication. This combination is the foundation of virtually all large-scale distributed databases.

Why Sharding Exists: Motivations and Driving Forces

Understanding why sharding exists requires examining the fundamental constraints that databases face at scale. These constraints are not arbitrary—they arise from physics, economics, and the mathematical properties of distributed systems.

The Four Pillars of Sharding Motivation:

Core Motivations for Sharding

•Storage Capacity Limits — A single server has finite storage. Even with the largest available drives, a single machine cannot hold petabytes of data. Sharding distributes storage across many machines, enabling virtually unlimited total capacity.
•Throughput Limits — CPU, memory bandwidth, network I/O, and disk I/O are all finite on a single machine. A single server handling 100,000 queries per second will saturate long before you reach 1,000,000. Sharding parallelizes workload across multiple servers.
•Memory Constraints — Database performance depends heavily on keeping hot data in memory. A single machine's RAM is limited (typically 256GB–1TB for commodity hardware). Sharding allows the working set to be distributed across many machines, each with its own memory pool.
•Failure Isolation — A sharded architecture contains failures. If one shard fails, only the data on that shard is affected—other shards continue operating. Combined with per-shard replication, this creates a highly resilient system.

The Economics of Scale:

Beyond technical constraints, sharding is driven by economics. There is a non-linear cost curve for high-end hardware—a server with 2x the capacity costs significantly more than 2 servers with 1x capacity each. At extreme scales, vertical scaling becomes prohibitively expensive or simply impossible.

Consider a practical example:

Requirement	Vertical Approach	Sharded Approach
10TB storage	Single high-end server: $50,000	10 commodity servers × 1TB: $20,000
100TB storage	Specialized hardware: $500,000+	100 commodity servers: $200,000
1PB storage	Not feasible with single server	1,000 commodity servers: ~$2,000,000

Sharding doesn't just enable scale—it makes large-scale data systems economically viable.

The Cloud Amplifies Sharding Benefits

In cloud environments, sharding aligns perfectly with the pay-as-you-go model. You can start with a small number of shards and add more as data grows. Cloud instance types make horizontal scaling straightforward—spin up more instances rather than migrating to a larger (and more expensive) instance type.

Sharding Terminology: The Vocabulary of Distributed Data

Before diving deeper into sharding strategies and implementations, we must establish precise terminology. These terms form the vocabulary that database architects, engineers, and operators use when designing and discussing sharded systems.

Essential Sharding Terms:

Core Sharding Terminology
Term	Definition	Example
Shard	A single horizontal partition containing a subset of the total data. Each shard is an independent database instance.	Shard 1 contains users A-M; Shard 2 contains users N-Z.
Shard Key	The attribute(s) used to determine which shard holds a given record. The shard key is the foundation of the partitioning strategy.	Using `user_id` as shard key to partition user data.
Partition Function	The algorithm that maps shard key values to specific shards. Common functions include range, hash, and directory-based.	`shard_id = hash(user_id) % num_shards`
Shard Map / Routing Table	Metadata that tracks which shards exist and what data each contains. Essential for query routing.	Shard 1: user_id 1-1000; Shard 2: user_id 1001-2000.
Shard-Local Query	A query that can be satisfied by accessing only a single shard. These are the most efficient queries.	`SELECT * FROM orders WHERE user_id = 42` (if user_id is shard key)
Cross-Shard Query	A query that must access multiple shards and combine results. These are expensive and should be minimized.	`SELECT COUNT(*) FROM users WHERE signup_date > '2024-01-01'` (requires all shards)
Scatter-Gather	The pattern of sending a query to multiple shards (scatter) and combining results (gather). Used for cross-shard operations.	Query all shards for top 10 users by score, then merge results.
Router / Query Router	A component that receives queries, determines which shard(s) to contact, and routes requests accordingly.	MongoDB's `mongos`, Vitess's `VTGate`.
Hotspot	A shard that receives disproportionately more traffic than others, creating a bottleneck.	All new users routed to Shard 1 due to sequential IDs.
Resharding	The process of changing the sharding scheme, typically to add shards or rebalance data distribution.	Doubling shard count from 16 to 32 to handle growth.
Colocation	Ensuring related data (e.g., a user and their orders) resides on the same shard to enable efficient local joins.	`users` and `orders` tables both sharded by `user_id`.

Terminology Varies Across Systems

Different database systems use different terminology. MySQL calls them 'partitions,' some systems call them 'chunks,' and others use 'tablets' or 'regions.' The concepts are equivalent, but always verify terminology in the context of your specific database system.

Sharding Architecture Patterns: How Systems Are Organized

Sharded systems can be organized in several architectural patterns, each with distinct tradeoffs. Understanding these patterns is essential for designing systems that meet specific requirements for scalability, complexity, and operational characteristics.

Pattern 1: Client-Aware Sharding

In this architecture, the application is aware of the sharding scheme and directly connects to the appropriate shard for each query. The application embeds the partition function and shard map.

client-aware-sharding.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
# Client-Aware Sharding Example
class ShardedDatabaseClient:
    def __init__(self, shard_connections: dict):
        """
        Initialize with direct connections to all shards.
        shard_connections: {shard_id: connection_string}
        """
        self.shards = {}
        for shard_id, conn_string in shard_connections.items():
            self.shards[shard_id] = create_connection(conn_string)
        self.num_shards = len(shard_connections)
    
    def get_shard_id(self, user_id: int) -> int:
        """
        Partition function: hash-based sharding by user_id.
        This logic is embedded in the application.
        """
        return hash(user_id) % self.num_shards
    
    def get_user(self, user_id: int) -> dict:
        """
        Direct routing to the correct shard.
        """
        shard_id = self.get_shard_id(user_id)
        connection = self.shards[shard_id]
        return connection.query(f"SELECT * FROM users WHERE id = {user_id}")
    
    def update_user(self, user_id: int, data: dict):
        """
        Write goes directly to the owning shard.
        """
        shard_id = self.get_shard_id(user_id)
        connection = self.shards[shard_id]
        return connection.execute(
            f"UPDATE users SET ... WHERE id = {user_id}", data
        )

Advantages

•Lowest latency—no routing hop
•Maximum control for application
•No single point of failure in routing layer
•Simple architecture—fewer components

Disadvantages

•Sharding logic duplicated across all applications
•Resharding requires application changes
•Multiple languages require reimplementation
•Configuration management complexity

Pattern 2: Proxy-Based Sharding

A dedicated proxy or query router sits between applications and shards. The proxy understands the sharding scheme, routes queries, and handles scatter-gather for cross-shard queries. Applications treat the proxy as a single database endpoint.

Converting Mermaid diagram...

Pattern 3: Database-Native Sharding

Some database systems have sharding built into the core database engine. Examples include CockroachDB, Google Spanner, TiDB, and YugabyteDB. These systems handle shard management, routing, rebalancing, and distributed transactions internally.

Architecture Pattern	Example Systems	Best For
Client-Aware	Custom implementations, early-stage sharding	Simple use cases, maximum performance
Proxy-Based	Vitess (MySQL), Citus (PostgreSQL), ProxySQL	Existing databases needing sharding
Database-Native	CockroachDB, Spanner, TiDB, YugabyteDB	Greenfield projects, automatic management

Modern Trend: Database-Native Sharding

The industry is moving toward database-native sharding solutions. Systems like CockroachDB and TiDB provide the SQL interface applications expect while handling all sharding complexity internally. This dramatically reduces operational burden and eliminates category of errors related to application-level sharding logic.

Shard Key Fundamentals: The Most Critical Decision

The shard key is the single most important decision in a sharding implementation. A poorly chosen shard key can doom a system to hotspots, expensive cross-shard queries, and operational nightmares. A well-chosen shard key enables linear scalability, efficient queries, and graceful growth.

Shard Key Properties:

An ideal shard key has three critical properties:

Critical Shard Key Properties

•High Cardinality — The shard key must have many distinct values. A shard key with only 10 possible values can only distribute data across 10 shards maximum. User IDs, UUIDs, or composite keys typically have excellent cardinality.
•Uniform Distribution — Shard key values should be evenly distributed across the value space. Sequential IDs, timestamps, or skewed categorical data can create hotspots where some shards receive vastly more data than others.
•Query Affinity — The shard key should align with how data is accessed. If most queries filter by user_id, then user_id should be the shard key or part of it. This enables shard-local queries.

Compound Shard Keys:

When a single attribute doesn't satisfy all requirements, compound shard keys combine multiple attributes:

shard_key = (tenant_id, user_id)

This approach is common in multi-tenant systems where data must be isolated per tenant but also distributed within each tenant. The partition function might use:

shard_id = hash(tenant_id, user_id) % num_shards

Shard Key Anti-Patterns:

Anti-Pattern	Problem	Example
Low Cardinality	Limited distribution	Sharding by `country` (only ~200 values)
Monotonic Keys	All new data goes to one shard	Sharding by auto-increment `id` or `created_at`
Query Mismatch	Most queries become cross-shard	Sharding by `user_id` but querying by `product_id`
Mutable Keys	Records must be moved between shards	Sharding by `status` that changes over time

The Shard Key Is (Usually) Immutable

Once chosen, the shard key is extremely difficult to change. Changing it requires resharding the entire dataset—a complex, risky, and expensive operation. Invest significant upfront effort in shard key design. Consult with team members, analyze query patterns deeply, and plan for future access patterns before committing.

Sharding Tradeoffs: What You Gain and What You Lose

Sharding is a powerful tool, but it's not free. Every benefit comes with corresponding costs. Understanding these tradeoffs is essential for making informed architectural decisions.

What Sharding Gives You:

Benefits of Sharding

•Horizontal Scalability — Add capacity by adding shards, not upgrading hardware.
•Increased Throughput — Total system throughput = sum of all shard throughputs.
•Large Dataset Support — Store petabytes by distributing across thousands of nodes.
•Failure Isolation — Shard failures are localized; healthy shards continue operating.
•Geographic Distribution — Place shards near users for lower latency.

Costs of Sharding

•Increased Complexity — More moving parts, more failure modes, more operational burden.
•Cross-Shard Query Cost — Queries spanning shards are slow and resource-intensive.
•Distributed Transaction Overhead — ACID across shards requires 2PC or similar protocols.
•Resharding Difficulty — Adding/removing shards or changing the shard key is hard.
•Referential Integrity Challenges — Foreign keys across shards cannot be enforced by the database.

The Cross-Shard Query Problem:

The most significant operational reality of sharding is the cross-shard query cost. Consider a query like:

SELECT user_id, SUM(amount) 
FROM orders 
GROUP BY user_id 
ORDER BY SUM(amount) DESC 
LIMIT 10;

If orders is sharded by user_id, this query:

Must be sent to every shard (scatter)
Each shard computes partial results
Results are merged at the routing layer (gather)
Sorting and limiting happen on the merged result

For a single query, this is manageable. But if this pattern appears in hot code paths, performance degrades rapidly. Good shard key design minimizes cross-shard queries for common access patterns.

When NOT to Shard

Sharding adds significant complexity. If your data fits comfortably on a single server (< 1TB typically), if your query patterns require many cross-table joins, or if your traffic can be handled by read replicas, consider NOT sharding. Many systems over-shard prematurely, incurring complexity costs without corresponding benefits.

Real-World Sharding Examples: Industry Implementations

Understanding how major technology companies implement sharding provides valuable insights into practical design decisions. These examples demonstrate that sharding strategies vary based on data characteristics, access patterns, and business requirements.

Example 1: Instagram (Facebook)

Instagram shards user data by user_id. All data associated with a user—posts, likes, comments, followers—lives on the same shard. This design enables shard-local queries for the common access pattern: 'show me everything about user X.'

Shard key: user_id
Strategy: Hash-based sharding
Colocation: User's media, comments, and activity on same shard
Cross-shard: Feed aggregation (posts from followed users)

Example 2: Slack

Slack shards by team_id (workspace). All messages, channels, and members for a workspace are colocated. This aligns with Slack's access pattern where users primarily interact within a single workspace.

Shard key: team_id
Strategy: Directory-based (workspace → shard mapping)
Colocation: All workspace data together
Cross-shard: Enterprise grid features, cross-workspace search

Example 3: Uber

Uber's geolocation data is sharded by geographic region. Cities or regions map to specific shards. This aligns with the locality of ride data—a ride in New York never needs to query data from Tokyo.

Shard key: Geographic region
Strategy: Range-based on geo-hash
Colocation: All rides, drivers, and requests in a region
Cross-shard: Global analytics, cross-region trips

Industry Sharding Patterns
Company	Shard Key	Strategy	Shards Count	Database
Instagram	`user_id`	Hash-based	Thousands	PostgreSQL + custom sharding
Slack	`team_id`	Directory-based	Hundreds	Vitess (MySQL)
Uber	Geographic region	Range-based	Thousands	Custom + Cassandra
Pinterest	`user_id`	Hash-based	Thousands	MySQL + custom sharding
Notion	`workspace_id`	Hash-based	Growing	PostgreSQL + Citus

Pattern Recognition

Notice the common pattern: shard key aligns with the primary unit of isolation in the application domain. For consumer apps, it's often user/account. For B2B apps, it's tenant/workspace. For geographic services, it's location. The shard key reflects the natural boundaries of data access.

Summary: Sharding Concept Foundations

We've established the foundational understanding of database sharding. Let's consolidate the key takeaways before exploring specific sharding strategies in subsequent pages:

Key Takeaways

•Sharding is horizontal partitioning — Data is divided across multiple independent database instances (shards), enabling scale beyond single-machine limits.
•Sharding enables both read and write scalability — Unlike replication (which only scales reads), sharding distributes the write workload across shards.
•The shard key is the critical design decision — It determines data distribution, query efficiency, and the feasibility of future resharding. High cardinality, uniform distribution, and query affinity are essential properties.
•Sharding architectures vary — Client-aware, proxy-based, and database-native approaches each have distinct tradeoffs in complexity, performance, and operational burden.
•Cross-shard operations are expensive — Design access patterns to minimize scatter-gather queries. Good shard key alignment with query patterns is essential.
•Sharding is a commitment — The complexity, operational overhead, and architectural constraints mean sharding should be adopted deliberately when truly needed.

What's Next:

Now that we understand what sharding is and why it exists, we'll dive into the most critical aspect: shard key selection. The next page explores how to choose shard keys, analyze workloads, and avoid the common pitfalls that lead to hotspots and cross-shard bottlenecks.

Page Complete

You now understand the foundational concepts of database sharding—what it is, why it exists, and the core terminology and architecture patterns. This knowledge forms the basis for understanding shard key selection, partitioning strategies, and resharding operations covered in subsequent pages.

1 / 5

Loading learning content...

Database Management SystemsSharding

Sharding: Horizontal Scaling for Distributed Databases

LevelAdvanced

Duration90 mins

TopicSharding

1 / 5

Sharding Concept: The Foundation of Horizontal Database Scaling

When Vertical Scaling Hits Its Ceiling

What You Will Learn

Defining Sharding: Horizontal Partitioning for Scale

Formal Definition:

Sharding is the horizontal partitioning of a database such that data is distributed across multiple autonomous database instances based on a deterministic function of one or more attributes (the shard key), enabling the system to scale storage capacity and throughput linearly with the number of shards.

Sharding vs. Other Scaling Approaches
Approach	Mechanism	Data Distribution	Primary Benefit	Limitation
Vertical Scaling	Upgrade single server hardware	All data on one node	Simplicity—no distribution complexity	Physical and cost limits of single machine
Replication	Copy data across multiple nodes	Same data on all replicas	Read scalability and fault tolerance	Write bottleneck—all writes go to primary
Sharding	Partition data across nodes	Different data on each shard	Read AND write scalability	Increased complexity, cross-shard operations
Sharding + Replication	Partition and replicate each shard	Each shard is independently replicated	Full scalability and fault tolerance	Maximum complexity, but maximum capability

Sharding Is Complementary to Replication

Why Sharding Exists: Motivations and Driving Forces

The Four Pillars of Sharding Motivation:

Core Motivations for Sharding

•Storage Capacity Limits — A single server has finite storage. Even with the largest available drives, a single machine cannot hold petabytes of data. Sharding distributes storage across many machines, enabling virtually unlimited total capacity.
•Throughput Limits — CPU, memory bandwidth, network I/O, and disk I/O are all finite on a single machine. A single server handling 100,000 queries per second will saturate long before you reach 1,000,000. Sharding parallelizes workload across multiple servers.
•Memory Constraints — Database performance depends heavily on keeping hot data in memory. A single machine's RAM is limited (typically 256GB–1TB for commodity hardware). Sharding allows the working set to be distributed across many machines, each with its own memory pool.
•Failure Isolation — A sharded architecture contains failures. If one shard fails, only the data on that shard is affected—other shards continue operating. Combined with per-shard replication, this creates a highly resilient system.

The Economics of Scale:

Consider a practical example:

Requirement	Vertical Approach	Sharded Approach
10TB storage	Single high-end server: $50,000	10 commodity servers × 1TB: $20,000
100TB storage	Specialized hardware: $500,000+	100 commodity servers: $200,000
1PB storage	Not feasible with single server	1,000 commodity servers: ~$2,000,000

Sharding doesn't just enable scale—it makes large-scale data systems economically viable.

The Cloud Amplifies Sharding Benefits

Sharding Terminology: The Vocabulary of Distributed Data

Essential Sharding Terms:

Core Sharding Terminology
Term	Definition	Example
Shard	A single horizontal partition containing a subset of the total data. Each shard is an independent database instance.	Shard 1 contains users A-M; Shard 2 contains users N-Z.
Shard Key	The attribute(s) used to determine which shard holds a given record. The shard key is the foundation of the partitioning strategy.	Using `user_id` as shard key to partition user data.
Partition Function	The algorithm that maps shard key values to specific shards. Common functions include range, hash, and directory-based.	`shard_id = hash(user_id) % num_shards`
Shard Map / Routing Table	Metadata that tracks which shards exist and what data each contains. Essential for query routing.	Shard 1: user_id 1-1000; Shard 2: user_id 1001-2000.
Shard-Local Query	A query that can be satisfied by accessing only a single shard. These are the most efficient queries.	`SELECT * FROM orders WHERE user_id = 42` (if user_id is shard key)
Cross-Shard Query	A query that must access multiple shards and combine results. These are expensive and should be minimized.	`SELECT COUNT(*) FROM users WHERE signup_date > '2024-01-01'` (requires all shards)
Scatter-Gather	The pattern of sending a query to multiple shards (scatter) and combining results (gather). Used for cross-shard operations.	Query all shards for top 10 users by score, then merge results.
Router / Query Router	A component that receives queries, determines which shard(s) to contact, and routes requests accordingly.	MongoDB's `mongos`, Vitess's `VTGate`.
Hotspot	A shard that receives disproportionately more traffic than others, creating a bottleneck.	All new users routed to Shard 1 due to sequential IDs.
Resharding	The process of changing the sharding scheme, typically to add shards or rebalance data distribution.	Doubling shard count from 16 to 32 to handle growth.
Colocation	Ensuring related data (e.g., a user and their orders) resides on the same shard to enable efficient local joins.	`users` and `orders` tables both sharded by `user_id`.

Terminology Varies Across Systems

Sharding Architecture Patterns: How Systems Are Organized

Pattern 1: Client-Aware Sharding

In this architecture, the application is aware of the sharding scheme and directly connects to the appropriate shard for each query. The application embeds the partition function and shard map.

client-aware-sharding.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
# Client-Aware Sharding Example
class ShardedDatabaseClient:
    def __init__(self, shard_connections: dict):
        """
        Initialize with direct connections to all shards.
        shard_connections: {shard_id: connection_string}
        """
        self.shards = {}
        for shard_id, conn_string in shard_connections.items():
            self.shards[shard_id] = create_connection(conn_string)
        self.num_shards = len(shard_connections)
    
    def get_shard_id(self, user_id: int) -> int:
        """
        Partition function: hash-based sharding by user_id.
        This logic is embedded in the application.
        """
        return hash(user_id) % self.num_shards
    
    def get_user(self, user_id: int) -> dict:
        """
        Direct routing to the correct shard.
        """
        shard_id = self.get_shard_id(user_id)
        connection = self.shards[shard_id]
        return connection.query(f"SELECT * FROM users WHERE id = {user_id}")
    
    def update_user(self, user_id: int, data: dict):
        """
        Write goes directly to the owning shard.
        """
        shard_id = self.get_shard_id(user_id)
        connection = self.shards[shard_id]
        return connection.execute(
            f"UPDATE users SET ... WHERE id = {user_id}", data
        )

Advantages

•Lowest latency—no routing hop
•Maximum control for application
•No single point of failure in routing layer
•Simple architecture—fewer components

Disadvantages

•Sharding logic duplicated across all applications
•Resharding requires application changes
•Multiple languages require reimplementation
•Configuration management complexity

Pattern 2: Proxy-Based Sharding

Converting Mermaid diagram...

Pattern 3: Database-Native Sharding

Architecture Pattern	Example Systems	Best For
Client-Aware	Custom implementations, early-stage sharding	Simple use cases, maximum performance
Proxy-Based	Vitess (MySQL), Citus (PostgreSQL), ProxySQL	Existing databases needing sharding
Database-Native	CockroachDB, Spanner, TiDB, YugabyteDB	Greenfield projects, automatic management

Modern Trend: Database-Native Sharding

Shard Key Fundamentals: The Most Critical Decision

Shard Key Properties:

An ideal shard key has three critical properties:

Critical Shard Key Properties

•High Cardinality — The shard key must have many distinct values. A shard key with only 10 possible values can only distribute data across 10 shards maximum. User IDs, UUIDs, or composite keys typically have excellent cardinality.
•Uniform Distribution — Shard key values should be evenly distributed across the value space. Sequential IDs, timestamps, or skewed categorical data can create hotspots where some shards receive vastly more data than others.
•Query Affinity — The shard key should align with how data is accessed. If most queries filter by user_id, then user_id should be the shard key or part of it. This enables shard-local queries.

Compound Shard Keys:

When a single attribute doesn't satisfy all requirements, compound shard keys combine multiple attributes:

shard_key = (tenant_id, user_id)

This approach is common in multi-tenant systems where data must be isolated per tenant but also distributed within each tenant. The partition function might use:

shard_id = hash(tenant_id, user_id) % num_shards

Shard Key Anti-Patterns:

Anti-Pattern	Problem	Example
Low Cardinality	Limited distribution	Sharding by `country` (only ~200 values)
Monotonic Keys	All new data goes to one shard	Sharding by auto-increment `id` or `created_at`
Query Mismatch	Most queries become cross-shard	Sharding by `user_id` but querying by `product_id`
Mutable Keys	Records must be moved between shards	Sharding by `status` that changes over time

The Shard Key Is (Usually) Immutable

Sharding Tradeoffs: What You Gain and What You Lose

Sharding is a powerful tool, but it's not free. Every benefit comes with corresponding costs. Understanding these tradeoffs is essential for making informed architectural decisions.

What Sharding Gives You:

Benefits of Sharding

•Horizontal Scalability — Add capacity by adding shards, not upgrading hardware.
•Increased Throughput — Total system throughput = sum of all shard throughputs.
•Large Dataset Support — Store petabytes by distributing across thousands of nodes.
•Failure Isolation — Shard failures are localized; healthy shards continue operating.
•Geographic Distribution — Place shards near users for lower latency.

Costs of Sharding

•Increased Complexity — More moving parts, more failure modes, more operational burden.
•Cross-Shard Query Cost — Queries spanning shards are slow and resource-intensive.
•Distributed Transaction Overhead — ACID across shards requires 2PC or similar protocols.
•Resharding Difficulty — Adding/removing shards or changing the shard key is hard.
•Referential Integrity Challenges — Foreign keys across shards cannot be enforced by the database.

The Cross-Shard Query Problem:

The most significant operational reality of sharding is the cross-shard query cost. Consider a query like:

SELECT user_id, SUM(amount) 
FROM orders 
GROUP BY user_id 
ORDER BY SUM(amount) DESC 
LIMIT 10;

If orders is sharded by user_id, this query:

Must be sent to every shard (scatter)
Each shard computes partial results
Results are merged at the routing layer (gather)
Sorting and limiting happen on the merged result

When NOT to Shard

Real-World Sharding Examples: Industry Implementations

Example 1: Instagram (Facebook)

Shard key: user_id
Strategy: Hash-based sharding
Colocation: User's media, comments, and activity on same shard
Cross-shard: Feed aggregation (posts from followed users)

Example 2: Slack

Shard key: team_id
Strategy: Directory-based (workspace → shard mapping)
Colocation: All workspace data together
Cross-shard: Enterprise grid features, cross-workspace search

Example 3: Uber

Shard key: Geographic region
Strategy: Range-based on geo-hash
Colocation: All rides, drivers, and requests in a region
Cross-shard: Global analytics, cross-region trips

Industry Sharding Patterns
Company	Shard Key	Strategy	Shards Count	Database
Instagram	`user_id`	Hash-based	Thousands	PostgreSQL + custom sharding
Slack	`team_id`	Directory-based	Hundreds	Vitess (MySQL)
Uber	Geographic region	Range-based	Thousands	Custom + Cassandra
Pinterest	`user_id`	Hash-based	Thousands	MySQL + custom sharding
Notion	`workspace_id`	Hash-based	Growing	PostgreSQL + Citus

Pattern Recognition

Summary: Sharding Concept Foundations

We've established the foundational understanding of database sharding. Let's consolidate the key takeaways before exploring specific sharding strategies in subsequent pages:

Key Takeaways

•Sharding is horizontal partitioning — Data is divided across multiple independent database instances (shards), enabling scale beyond single-machine limits.
•Sharding enables both read and write scalability — Unlike replication (which only scales reads), sharding distributes the write workload across shards.
•The shard key is the critical design decision — It determines data distribution, query efficiency, and the feasibility of future resharding. High cardinality, uniform distribution, and query affinity are essential properties.
•Sharding architectures vary — Client-aware, proxy-based, and database-native approaches each have distinct tradeoffs in complexity, performance, and operational burden.
•Cross-shard operations are expensive — Design access patterns to minimize scatter-gather queries. Good shard key alignment with query patterns is essential.
•Sharding is a commitment — The complexity, operational overhead, and architectural constraints mean sharding should be adopted deliberately when truly needed.

What's Next:

Page Complete

1 / 5