System DesignDocument Stores

Document Stores: MongoDB and Document-Oriented Databases

LevelIntermediate

Duration90 mins

TopicDocument Stores

2 / 5

MongoDB: Replica Sets and Sharding Architecture

From Single Server to Global Scale

A single MongoDB server can handle impressive workloads—millions of documents, thousands of operations per second. But production systems demand more than raw performance. They require high availability (the database must remain operational when servers fail) and horizontal scalability (the system must grow beyond what any single machine can handle).

MongoDB addresses these requirements through two complementary mechanisms:

Replica Sets — Multiple copies of your data across different servers, providing automatic failover and read scaling
Sharding — Distributing data across multiple replica sets, enabling datasets and throughputs that exceed single-server limits

Understanding these architectures isn't optional for production MongoDB deployments. Misconfigured replica sets lead to data loss during failures. Poorly designed shard keys create performance nightmares that require application rewrites to fix. This page equips you to design and operate MongoDB clusters correctly from day one.

What You Will Learn

By the end of this page, you will understand replica set architecture including election mechanics and failover behavior, configure read/write concerns for your consistency requirements, design sharding strategies that distribute load evenly, and recognize operational patterns that prevent common cluster failures.

Replica Set Fundamentals

A replica set is a group of MongoDB processes (called mongod instances) that maintain the same data set. Replica sets provide redundancy, high availability, and form the foundation of all MongoDB production deployments.

Core Replica Set Concepts:

Replica Set Components

•Primary — The single member that receives all write operations. There is exactly one primary at any time.
•Secondary — Members that replicate data from the primary. Can serve read operations if configured.
•Arbiter — A lightweight member that participates in elections but holds no data. Used to break ties.
•Oplog — The operation log, a capped collection that records all operations modifying data. Secondaries replicate by tailing the oplog.
•Heartbeat — Members ping each other every 2 seconds to detect failures and trigger elections.

Standard Replica Set Topology

The minimum recommended production deployment is a 3-member replica set. This provides fault tolerance for one member failure while maintaining a majority for elections:

Converting Mermaid diagram...

The Oplog: Heart of Replication

The oplog (operation log) is a special capped collection in the local database that records every operation that modifies data. Understanding the oplog is crucial for operational awareness:

oplog-examination.js
JavaScript
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
// Connect to MongoDB and examine oplog
use local
 
// View oplog status
db.oplog.rs.stats()
// Returns: size, maxSize, count of operations
 
// Sample oplog entries
db.oplog.rs.find().sort({ $natural: -1 }).limit(5)
 
// Example oplog entry for an insert:
{
  "ts": Timestamp(1705312200, 1),      // Timestamp + increment (unique per second)
  "t": NumberLong(1),                   // Term (election epoch)
  "h": NumberLong("123456789"),         // Unique operation hash
  "v": 2,                               // Oplog version
  "op": "i",                            // Operation type: insert
  "ns": "mydb.users",                   // Namespace (database.collection)
  "ui": UUID("..."),                    // Collection UUID
  "o": {                                // Operation document
    "_id": ObjectId("..."),
    "name": "Alice",
    "email": "alice@example.com"
  }
}
 
// Operation types:
// "i" - insert
// "u" - update
// "d" - delete
// "c" - command (createCollection, dropCollection, etc.)
// "n" - no-op (heartbeat, used to advance replication)
 
// Example update oplog entry:
{
  "ts": Timestamp(1705312201, 1),
  "op": "u",
  "ns": "mydb.users",
  "o2": { "_id": ObjectId("...") },     // Query to find document
  "o": {                                 // The update operation
    "$v": 2,
    "diff": {
      "u": { "email": "newemail@example.com" }
    }
  }
}

Oplog Size Planning

The oplog is a capped collection with a fixed size. Once full, oldest entries are overwritten. If a secondary falls too far behind (more than the oplog can retain), it cannot catch up and requires a full resync. Size your oplog based on your write volume—typically 5-50GB. Monitor replication lag relative to oplog window.

Elections and Automatic Failover

When the primary becomes unavailable, the replica set automatically elects a new primary. This process is designed to complete quickly (typically 10-30 seconds) and requires no human intervention.

Election Triggers:

•Primary becomes unreachable (network partition, crash, maintenance)
•Primary steps down explicitly (rs.stepDown() or reconfiguration)
•A higher-priority secondary becomes available and initiates an election
•Initial replica set configuration (first election)

Election Mechanics

MongoDB uses a Raft-like consensus protocol with some modifications. The election process ensures that:

Majority Required: A member needs votes from a majority of voting members to become primary
Most Up-to-Date Wins: Among eligible candidates, the one with the most recent oplog entries is preferred
Priority Considered: Higher priority members are preferred (configurable per member)
Term Tracking: Each election increments a term number to prevent stale primaries

Here's how an election proceeds:

Converting Mermaid diagram...

Understanding Majority

The "majority" requirement is critical and often misunderstood:

Voting Majority by Replica Set Size
Total Voting Members	Majority Needed	Tolerable Failures	Notes
1 (standalone)	1	0	No redundancy; not recommended for production
2	2	0	If one fails, no majority—avoid this config
3	2	1	Minimum recommended production deployment
4	3	1	Same fault tolerance as 3; wastes a node
5	3	2	Good for geographically distributed deployments
7	4	3	Maximum recommended; election complexity increases

The Odd Number Rule

Always deploy an odd number of voting members. With 4 members, a 2-2 split (network partition) leaves neither side with a majority—the cluster becomes read-only. With 5 members, a 2-3 split still has a majority. If you need 4 data-bearing nodes, make one a non-voting member or add an arbiter.

Member Priority and Election Preference

You can influence which members are preferred as primary using priority settings:

priority-configuration.js
JavaScript
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
// Replica set configuration with priorities
rs.initiate({
  _id: "myReplicaSet",
  members: [
    // High priority - preferred primary (e.g., powerful server)
    { _id: 0, host: "mongo-primary.example.com:27017", priority: 10 },
    
    // Normal priority - can become primary
    { _id: 1, host: "mongo-secondary1.example.com:27017", priority: 5 },
    
    // Low priority - backup, rarely becomes primary
    { _id: 2, host: "mongo-secondary2.example.com:27017", priority: 1 },
    
    // Zero priority - can never become primary (DR site, analytics)
    { _id: 3, host: "mongo-analytics.example.com:27017", priority: 0 },
    
    // Arbiter - votes but holds no data
    { _id: 4, host: "mongo-arbiter.example.com:27017", arbiterOnly: true }
  ]
});
 
// Hidden members - not visible to clients, not primary candidates
// Useful for dedicated backup or reporting servers
{
  _id: 5, 
  host: "mongo-hidden.example.com:27017", 
  priority: 0,
  hidden: true  // Won't appear in isMaster response
}
 
// Delayed members - lagging replication for point-in-time recovery
{
  _id: 6, 
  host: "mongo-delayed.example.com:27017", 
  priority: 0,
  hidden: true,
  secondaryDelaySecs: 3600  // 1 hour behind (was slaveDelay)
}

Failover During Elections

During an election (typically 10-30 seconds), the replica set has no primary. Writes will fail with 'not master' errors. Applications should be prepared with retry logic. Reads can continue from secondaries if readPreference allows, but may return stale data during this window.

Write Concern and Read Concern

MongoDB's consistency model is tunable. Write concern controls acknowledgment of writes, while read concern controls the consistency of reads. Understanding these concerns is essential for balancing durability, consistency, and performance.

Write Concern: Durability Guarantees

Write concern specifies how many replica set members must acknowledge a write before it's considered successful:

Write Concern Options
Write Concern	Behavior	Durability	Latency
w: 0	Fire and forget; no acknowledgment	Lowest - may be lost	Lowest
w: 1 (default)	Primary acknowledges; may not be replicated	Low - survives primary restart	Low
w: 'majority'	Majority of replica set acknowledges	High - survives failover	Medium
w: <number>	Specific number of members acknowledge	Configurable	Variable
j: true	Primary journal flushed to disk	Higher within primary	Higher
w: 'majority', j: true	Majority + journaled	Highest	Highest

write-concern-examples.js
JavaScript
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
// Different write concern configurations
 
// Fire and forget - fastest, least durable
await collection.insertOne(
  { event: "pageview", timestamp: new Date() },
  { writeConcern: { w: 0 } }  // Logs, analytics where loss is acceptable
);
 
// Default - primary acknowledges
await collection.insertOne(
  { type: "user_action" },
  { writeConcern: { w: 1 } }  // Normal operations
);
 
// Majority - survives failover
await collection.insertOne(
  { type: "payment", amount: 100.00 },
  { writeConcern: { w: "majority" } }  // Critical transactions
);
 
// Majority + journaled - maximum durability
await collection.insertOne(
  { type: "bank_transfer", amount: 10000.00 },
  { 
    writeConcern: { 
      w: "majority", 
      j: true,           // Wait for journal flush
      wtimeout: 5000     // Timeout in ms
    } 
  }
);
 
// wtimeout prevents blocking forever if members are down
// If timeout expires, write MAY have succeeded on primary
// Application must handle potential ambiguity
 
// Set default write concern at connection level
const client = new MongoClient(uri, {
  writeConcern: { w: "majority", wtimeout: 5000 }
});

Read Concern: Consistency Guarantees

Read concern specifies the consistency and isolation properties of data returned by read operations:

Read Concern Options
Read Concern	Behavior	Use Case
local (default)	Returns most recent data on the node	Low latency, may read uncommitted during failover
available	Returns data without consistency guarantees	Sharded clusters, orphaned document risk
majority	Returns data acknowledged by majority	Consistent reads, no rollback risk
linearizable	Reflects all successful majority writes	Single-document strong consistency
snapshot	Transactional snapshot isolation	Multi-document transactions

read-concern-examples.js
JavaScript
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
// Read concern configurations
 
// Local - fastest, may see rolled-back writes after failover
const doc = await collection.findOne(
  { userId: "12345" },
  { readConcern: { level: "local" } }
);
 
// Majority - guaranteed durable, may lag slightly
const durableDoc = await collection.findOne(
  { orderId: "ORD-123" },
  { readConcern: { level: "majority" } }
);
 
// Linearizable - strongest single-document guarantee
// Waits for all prior majority writes to be visible
// Much slower, use sparingly
const consistentDoc = await collection.findOne(
  { lockId: "critical-resource" },
  { readConcern: { level: "linearizable" } }
);
 
// Combining read and write concern for causal consistency
const session = client.startSession({ causalConsistency: true });
try {
  // Write with majority concern
  await collection.insertOne(
    { key: "value" },
    { session, writeConcern: { w: "majority" } }
  );
  
  // Read will see the write, even on a different node
  const result = await collection.findOne(
    { key: "value" },
    { session, readConcern: { level: "majority" } }
  );
} finally {
  session.endSession();
}

The Rollback Risk

With w:1 (default), writes acknowledged by only the primary can be lost if the primary fails before replicating them. When the old primary recovers, it performs a rollback, and those writes are saved to a rollback directory but removed from the data. If data loss is unacceptable, use w:'majority'.

Read Preference and Load Distribution

Read preference determines which replica set members a client routes read operations to. This enables distributing read load across the cluster and reading from geographically closer nodes.

Read Preference Modes
Mode	Behavior	Best For
primary (default)	All reads from primary	Strongest consistency, all reads see latest writes
primaryPreferred	Primary if available, else secondary	Consistency with fallback during failover
secondary	Read from secondaries only	Offload reads from primary, analytics workloads
secondaryPreferred	Secondaries preferred, primary if none available	Distribute reads, maintain availability
nearest	Lowest network latency member	Geographically distributed deployments

read-preference-config.js
JavaScript
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
const { ReadPreference } = require('mongodb');
 
// Connection with default read preference
const client = new MongoClient(uri, {
  readPreference: ReadPreference.SECONDARY_PREFERRED
});
 
// Per-operation read preference
const analytics = await collection.aggregate([
  { $match: { date: { $gte: lastMonth } } },
  { $group: { _id: "$category", total: { $sum: "$amount" } } }
], {
  readPreference: ReadPreference.SECONDARY  // Heavy analytics on secondary
});
 
// Read preference with tags for data locality
// Tag members: { dc: "us-east", rack: "1" }, { dc: "us-west", rack: "2" }
 
const localResult = await collection.findOne(
  { userId: "12345" },
  {
    readPreference: new ReadPreference(
      ReadPreference.NEAREST,
      [
        { dc: "us-east" },    // Prefer US East
        { dc: "us-west" },    // Then US West
        { }                    // Then any member
      ]
    )
  }
);
 
// maxStalenessSeconds - don't read from very stale secondaries
const freshRead = await collection.findOne(
  { sessionId: "active-session" },
  {
    readPreference: new ReadPreference(
      ReadPreference.SECONDARY_PREFERRED,
      [],
      { maxStalenessSeconds: 30 }  // At most 30s behind primary
    )
  }
);

Read Preference Trade-offs

Reading from secondaries distributes load but introduces staleness. For user-facing reads after their own writes, use primary or enable causal consistency with sessions. For analytics, reporting, and background jobs, secondary reads are ideal and reduce primary load.

Sharding Architecture for Horizontal Scale

When your data grows beyond what a single replica set can handle—whether due to storage limits, write throughput, or working set exceeding RAM—you need to distribute data across multiple servers. MongoDB's sharding distributes data across multiple replica sets, called shards.

Sharded Cluster Components:

Sharded Cluster Architecture

•Shards — Each shard is a replica set holding a subset of the data. Data is divided by shard key ranges or zones.
•Config Servers — A replica set storing cluster metadata: shard locations, chunk ranges, cluster configuration.
•mongos — Query routers that direct operations to appropriate shards. Stateless, typically deployed alongside application servers.

Converting Mermaid diagram...

How Sharding Works

Data is divided into chunks, contiguous ranges of shard key values. The config servers track which chunks live on which shards. When you query:

mongos receives the query
For queries including the shard key, mongos routes to the specific shard(s)
For queries without shard key, mongos broadcasts to all shards (scatter-gather)
Results are merged and returned to the client

Chunk Splitting and Balancing:

As data grows, chunks are split when they exceed the maximum size (default 128MB). The balancer runs periodically to move chunks between shards, maintaining roughly even distribution.

sharding-setup.js
JavaScript
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
// Enable sharding on a database
sh.enableSharding("ecommerce")
 
// Shard a collection - CRITICAL: choose shard key carefully!
// This cannot be changed without rebuilding the collection
 
// Range-based sharding on orderId
sh.shardCollection(
  "ecommerce.orders",
  { orderId: 1 }  // Shard key: orderId ascending
);
 
// Hashed sharding for even distribution
sh.shardCollection(
  "ecommerce.products",
  { _id: "hashed" }  // Hash of _id for distribution
);
 
// Compound shard key - supports range queries on both fields
sh.shardCollection(
  "ecommerce.events",
  { tenantId: 1, timestamp: 1 }  // Multi-tenant time-series
);
 
// View shard distribution
sh.status()
 
// Output shows chunk distribution:
// --- Sharding Status ---
// shards:
//   { "_id": "shard0", "host": "shard0/...", state: 1 }
//   { "_id": "shard1", "host": "shard1/...", state: 1 }
//   { "_id": "shard2", "host": "shard2/...", state: 1 }
// databases:
//   { "_id": "ecommerce", "primary": "shard0", "partitioned": true }
// ecommerce.orders chunks:
//   shard0: 42 chunks
//   shard1: 41 chunks
//   shard2: 41 chunks

Shard Key Selection: The Critical Decision

The shard key is the most important decision in a sharded MongoDB deployment. Once set, it cannot be changed without rebuilding the collection. A poor shard key leads to uneven distribution (hotspots), inefficient queries (scatter-gather), and operational nightmares.

Ideal Shard Key Properties:

Good Shard Key Characteristics

•High Cardinality — Many distinct values allow fine-grained distribution. Low cardinality (e.g., boolean, country code) limits the maximum number of chunks.
•Even Distribution — Values should be roughly uniformly distributed, not skewed toward common values.
•Query Isolation — Most queries should include the shard key, enabling targeted routing rather than scatter-gather.
•Write Distribution — Writes should spread across shards, not concentrate on one (monotonically increasing keys create hotspots).
•Non-Monotonic — Avoid auto-incrementing IDs as shard keys—all new inserts go to one shard.

Common Shard Key Patterns:

Good Shard Keys

•{ tenantId: 1, timestamp: 1 } — Multi-tenant: isolates tenant data, allows time-range queries
•{ userId: 'hashed' } — Evenly distributes users; most queries are by userId
•{ region: 1, _id: 1 } — Geographic zones with unique identifier
•{ category: 1, productId: 1 } — E-commerce: queries by category are targeted

Poor Shard Keys

•{ timestamp: 1 } — All new writes go to last chunk (insert hotspot)
•{ _id: 1 } (ObjectId) — Monotonically increasing, same problem
•{ status: 1 } — Only a few values (pending, complete, failed)
•{ country: 1 } — Low cardinality, US shard would be overloaded

shard-key-analysis.js
JavaScript
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
// Analyze shard key distribution BEFORE sharding
 
// Check cardinality
const cardinality = await db.collection.aggregate([
  { $group: { _id: "$proposedShardKey" } },
  { $count: "distinctValues" }
]).toArray();
console.log("Distinct values:", cardinality[0].distinctValues);
// Goal: Much higher than expected number of shards
 
// Check distribution
const distribution = await db.collection.aggregate([
  { $group: { 
      _id: "$proposedShardKey", 
      count: { $sum: 1 } 
  }},
  { $sort: { count: -1 } },
  { $limit: 20 }
]).toArray();
// Check: No single value dominates
 
// Check query patterns
// Run explain on common queries
const explain = await db.collection.find(
  { userId: "12345", timestamp: { $gte: lastWeek } }
).explain("executionStats");
// Check: Query includes shard key fields
 
// After sharding: monitor chunk distribution
db.orders.getShardDistribution()
// Shows data and chunk distribution per shard
// Look for imbalances > 20%

The Hotspot Trap

Monotonically increasing shard keys (timestamps, auto-increment IDs, ObjectIds) cause all new writes to target the current 'last' chunk. While the balancer moves chunks, the shard receiving inserts is constantly overloaded. Use hashed sharding or compound keys starting with a high-cardinality, non-monotonic field.

Zones and Geographic Distribution

Zone sharding allows you to control where specific data resides. This enables geographic data locality, compliance with data residency requirements, and tiered storage strategies.

Zone Use Cases:

•Data Locality — Keep European customer data on EU shards for GDPR compliance
•Tiered Storage — Recent 'hot' data on SSDs, older 'cold' data on HDDs
•Multi-Tenancy — Isolate large tenants to dedicated shards
•Performance Optimization — Keep related data together (user + their orders on same shard)

zone-sharding.js
JavaScript
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
// Create zones for geographic data locality
// Shard key: { region: 1, userId: 1 }
 
// Add shards to zones
sh.addShardToZone("shard-us-east", "US-Data")
sh.addShardToZone("shard-us-west", "US-Data")
sh.addShardToZone("shard-eu-west", "EU-Data")
sh.addShardToZone("shard-eu-central", "EU-Data")
sh.addShardToZone("shard-apac", "APAC-Data")
 
// Define zone ranges
// All documents with region "US" go to US-Data zone
sh.updateZoneKeyRange(
  "ecommerce.customers",
  { region: "US", userId: MinKey },    // Range start
  { region: "US", userId: MaxKey },    // Range end
  "US-Data"
);
 
sh.updateZoneKeyRange(
  "ecommerce.customers",
  { region: "EU", userId: MinKey },
  { region: "EU", userId: MaxKey },
  "EU-Data"
);
 
sh.updateZoneKeyRange(
  "ecommerce.customers",
  { region: "APAC", userId: MinKey },
  { region: "APAC", userId: MaxKey },
  "APAC-Data"
);
 
// Tiered storage example
// Shard key: { createdMonth: 1, _id: 1 }
sh.addShardToZone("shard-ssd-1", "Hot-Storage")
sh.addShardToZone("shard-ssd-2", "Hot-Storage")
sh.addShardToZone("shard-hdd-1", "Cold-Storage")
sh.addShardToZone("shard-hdd-2", "Cold-Storage")
 
// Recent months on SSD
sh.updateZoneKeyRange(
  "logs.events",
  { createdMonth: "2024-01", _id: MinKey },
  { createdMonth: "2024-12", _id: MaxKey },
  "Hot-Storage"
);
 
// Older months on HDD (update zone ranges as time progresses)
sh.updateZoneKeyRange(
  "logs.events",
  { createdMonth: "2020-01", _id: MinKey },
  { createdMonth: "2023-12", _id: MaxKey },
  "Cold-Storage"
);

Zone Planning

Design your shard key with zones in mind. The first field of a compound shard key often represents the zone dimension (region, tenant, time period). This enables efficient zone-based data placement while the second field provides cardinality for distribution within the zone.

Summary: MongoDB Distributed Architecture

MongoDB's replica sets and sharding provide a powerful foundation for building highly available, horizontally scalable systems. Let's consolidate the key operational knowledge:

Key Takeaways

•Replica sets provide high availability through automatic failover — A 3-member minimum ensures fault tolerance with majority elections.
•The oplog is the replication backbone — Size it based on write volume; insufficient oplog size forces full resyncs.
•Elections require majority voting — Always use odd numbers of voting members to avoid split-brain during network partitions.
•Write concern controls durability — Use w:'majority' for critical data; understand the rollback risk with w:1.
•Read concern controls consistency — 'majority' ensures you don't read data that could be rolled back.
•Read preference distributes load — Use secondaries for analytics and reporting; primary for read-your-writes consistency.
•Sharding enables horizontal scaling — mongos routers direct queries; config servers track chunk locations.
•Shard key is the critical decision — Cannot be changed post-hoc; analyze cardinality, distribution, and query patterns before sharding.
•Avoid monotonically increasing shard keys — They create insert hotspots; use hashed sharding or compound keys.
•Zones enable data placement control — Geographic locality, compliance requirements, and tiered storage.

What's Next:

With MongoDB's architecture understood, we'll explore flexible schemas—one of the most powerful (and dangerous) features of document databases. You'll learn schema design patterns that maintain flexibility while preserving data integrity.

Page Complete

You now understand MongoDB's distributed architecture for production deployments. You can design replica sets for high availability, configure appropriate consistency levels, and make informed sharding decisions. Next, we'll dive into flexible schema design patterns.

2 / 5

Loading learning content...

System DesignDocument Stores

Document Stores: MongoDB and Document-Oriented Databases

LevelIntermediate

Duration90 mins

TopicDocument Stores

2 / 5

MongoDB: Replica Sets and Sharding Architecture

From Single Server to Global Scale

MongoDB addresses these requirements through two complementary mechanisms:

Replica Sets — Multiple copies of your data across different servers, providing automatic failover and read scaling
Sharding — Distributing data across multiple replica sets, enabling datasets and throughputs that exceed single-server limits

What You Will Learn

Replica Set Fundamentals

Core Replica Set Concepts:

Replica Set Components

•Primary — The single member that receives all write operations. There is exactly one primary at any time.
•Secondary — Members that replicate data from the primary. Can serve read operations if configured.
•Arbiter — A lightweight member that participates in elections but holds no data. Used to break ties.
•Oplog — The operation log, a capped collection that records all operations modifying data. Secondaries replicate by tailing the oplog.
•Heartbeat — Members ping each other every 2 seconds to detect failures and trigger elections.

Standard Replica Set Topology

The minimum recommended production deployment is a 3-member replica set. This provides fault tolerance for one member failure while maintaining a majority for elections:

Converting Mermaid diagram...

The Oplog: Heart of Replication

The oplog (operation log) is a special capped collection in the local database that records every operation that modifies data. Understanding the oplog is crucial for operational awareness:

oplog-examination.js
JavaScript
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
// Connect to MongoDB and examine oplog
use local
 
// View oplog status
db.oplog.rs.stats()
// Returns: size, maxSize, count of operations
 
// Sample oplog entries
db.oplog.rs.find().sort({ $natural: -1 }).limit(5)
 
// Example oplog entry for an insert:
{
  "ts": Timestamp(1705312200, 1),      // Timestamp + increment (unique per second)
  "t": NumberLong(1),                   // Term (election epoch)
  "h": NumberLong("123456789"),         // Unique operation hash
  "v": 2,                               // Oplog version
  "op": "i",                            // Operation type: insert
  "ns": "mydb.users",                   // Namespace (database.collection)
  "ui": UUID("..."),                    // Collection UUID
  "o": {                                // Operation document
    "_id": ObjectId("..."),
    "name": "Alice",
    "email": "alice@example.com"
  }
}
 
// Operation types:
// "i" - insert
// "u" - update
// "d" - delete
// "c" - command (createCollection, dropCollection, etc.)
// "n" - no-op (heartbeat, used to advance replication)
 
// Example update oplog entry:
{
  "ts": Timestamp(1705312201, 1),
  "op": "u",
  "ns": "mydb.users",
  "o2": { "_id": ObjectId("...") },     // Query to find document
  "o": {                                 // The update operation
    "$v": 2,
    "diff": {
      "u": { "email": "newemail@example.com" }
    }
  }
}

Oplog Size Planning

Elections and Automatic Failover

When the primary becomes unavailable, the replica set automatically elects a new primary. This process is designed to complete quickly (typically 10-30 seconds) and requires no human intervention.

Election Triggers:

•Primary becomes unreachable (network partition, crash, maintenance)
•Primary steps down explicitly (rs.stepDown() or reconfiguration)
•A higher-priority secondary becomes available and initiates an election
•Initial replica set configuration (first election)

Election Mechanics

MongoDB uses a Raft-like consensus protocol with some modifications. The election process ensures that:

Majority Required: A member needs votes from a majority of voting members to become primary
Most Up-to-Date Wins: Among eligible candidates, the one with the most recent oplog entries is preferred
Priority Considered: Higher priority members are preferred (configurable per member)
Term Tracking: Each election increments a term number to prevent stale primaries

Here's how an election proceeds:

Converting Mermaid diagram...

Understanding Majority

The "majority" requirement is critical and often misunderstood:

Voting Majority by Replica Set Size
Total Voting Members	Majority Needed	Tolerable Failures	Notes
1 (standalone)	1	0	No redundancy; not recommended for production
2	2	0	If one fails, no majority—avoid this config
3	2	1	Minimum recommended production deployment
4	3	1	Same fault tolerance as 3; wastes a node
5	3	2	Good for geographically distributed deployments
7	4	3	Maximum recommended; election complexity increases

The Odd Number Rule

Member Priority and Election Preference

You can influence which members are preferred as primary using priority settings:

priority-configuration.js
JavaScript
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
// Replica set configuration with priorities
rs.initiate({
  _id: "myReplicaSet",
  members: [
    // High priority - preferred primary (e.g., powerful server)
    { _id: 0, host: "mongo-primary.example.com:27017", priority: 10 },
    
    // Normal priority - can become primary
    { _id: 1, host: "mongo-secondary1.example.com:27017", priority: 5 },
    
    // Low priority - backup, rarely becomes primary
    { _id: 2, host: "mongo-secondary2.example.com:27017", priority: 1 },
    
    // Zero priority - can never become primary (DR site, analytics)
    { _id: 3, host: "mongo-analytics.example.com:27017", priority: 0 },
    
    // Arbiter - votes but holds no data
    { _id: 4, host: "mongo-arbiter.example.com:27017", arbiterOnly: true }
  ]
});
 
// Hidden members - not visible to clients, not primary candidates
// Useful for dedicated backup or reporting servers
{
  _id: 5, 
  host: "mongo-hidden.example.com:27017", 
  priority: 0,
  hidden: true  // Won't appear in isMaster response
}
 
// Delayed members - lagging replication for point-in-time recovery
{
  _id: 6, 
  host: "mongo-delayed.example.com:27017", 
  priority: 0,
  hidden: true,
  secondaryDelaySecs: 3600  // 1 hour behind (was slaveDelay)
}

Failover During Elections

Write Concern and Read Concern

Write Concern: Durability Guarantees

Write concern specifies how many replica set members must acknowledge a write before it's considered successful:

Write Concern Options
Write Concern	Behavior	Durability	Latency
w: 0	Fire and forget; no acknowledgment	Lowest - may be lost	Lowest
w: 1 (default)	Primary acknowledges; may not be replicated	Low - survives primary restart	Low
w: 'majority'	Majority of replica set acknowledges	High - survives failover	Medium
w: <number>	Specific number of members acknowledge	Configurable	Variable
j: true	Primary journal flushed to disk	Higher within primary	Higher
w: 'majority', j: true	Majority + journaled	Highest	Highest

write-concern-examples.js
JavaScript
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
// Different write concern configurations
 
// Fire and forget - fastest, least durable
await collection.insertOne(
  { event: "pageview", timestamp: new Date() },
  { writeConcern: { w: 0 } }  // Logs, analytics where loss is acceptable
);
 
// Default - primary acknowledges
await collection.insertOne(
  { type: "user_action" },
  { writeConcern: { w: 1 } }  // Normal operations
);
 
// Majority - survives failover
await collection.insertOne(
  { type: "payment", amount: 100.00 },
  { writeConcern: { w: "majority" } }  // Critical transactions
);
 
// Majority + journaled - maximum durability
await collection.insertOne(
  { type: "bank_transfer", amount: 10000.00 },
  { 
    writeConcern: { 
      w: "majority", 
      j: true,           // Wait for journal flush
      wtimeout: 5000     // Timeout in ms
    } 
  }
);
 
// wtimeout prevents blocking forever if members are down
// If timeout expires, write MAY have succeeded on primary
// Application must handle potential ambiguity
 
// Set default write concern at connection level
const client = new MongoClient(uri, {
  writeConcern: { w: "majority", wtimeout: 5000 }
});

Read Concern: Consistency Guarantees

Read concern specifies the consistency and isolation properties of data returned by read operations:

Read Concern Options
Read Concern	Behavior	Use Case
local (default)	Returns most recent data on the node	Low latency, may read uncommitted during failover
available	Returns data without consistency guarantees	Sharded clusters, orphaned document risk
majority	Returns data acknowledged by majority	Consistent reads, no rollback risk
linearizable	Reflects all successful majority writes	Single-document strong consistency
snapshot	Transactional snapshot isolation	Multi-document transactions

read-concern-examples.js
JavaScript
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
// Read concern configurations
 
// Local - fastest, may see rolled-back writes after failover
const doc = await collection.findOne(
  { userId: "12345" },
  { readConcern: { level: "local" } }
);
 
// Majority - guaranteed durable, may lag slightly
const durableDoc = await collection.findOne(
  { orderId: "ORD-123" },
  { readConcern: { level: "majority" } }
);
 
// Linearizable - strongest single-document guarantee
// Waits for all prior majority writes to be visible
// Much slower, use sparingly
const consistentDoc = await collection.findOne(
  { lockId: "critical-resource" },
  { readConcern: { level: "linearizable" } }
);
 
// Combining read and write concern for causal consistency
const session = client.startSession({ causalConsistency: true });
try {
  // Write with majority concern
  await collection.insertOne(
    { key: "value" },
    { session, writeConcern: { w: "majority" } }
  );
  
  // Read will see the write, even on a different node
  const result = await collection.findOne(
    { key: "value" },
    { session, readConcern: { level: "majority" } }
  );
} finally {
  session.endSession();
}

The Rollback Risk

Read Preference and Load Distribution

Read preference determines which replica set members a client routes read operations to. This enables distributing read load across the cluster and reading from geographically closer nodes.

Read Preference Modes
Mode	Behavior	Best For
primary (default)	All reads from primary	Strongest consistency, all reads see latest writes
primaryPreferred	Primary if available, else secondary	Consistency with fallback during failover
secondary	Read from secondaries only	Offload reads from primary, analytics workloads
secondaryPreferred	Secondaries preferred, primary if none available	Distribute reads, maintain availability
nearest	Lowest network latency member	Geographically distributed deployments

read-preference-config.js
JavaScript
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
const { ReadPreference } = require('mongodb');
 
// Connection with default read preference
const client = new MongoClient(uri, {
  readPreference: ReadPreference.SECONDARY_PREFERRED
});
 
// Per-operation read preference
const analytics = await collection.aggregate([
  { $match: { date: { $gte: lastMonth } } },
  { $group: { _id: "$category", total: { $sum: "$amount" } } }
], {
  readPreference: ReadPreference.SECONDARY  // Heavy analytics on secondary
});
 
// Read preference with tags for data locality
// Tag members: { dc: "us-east", rack: "1" }, { dc: "us-west", rack: "2" }
 
const localResult = await collection.findOne(
  { userId: "12345" },
  {
    readPreference: new ReadPreference(
      ReadPreference.NEAREST,
      [
        { dc: "us-east" },    // Prefer US East
        { dc: "us-west" },    // Then US West
        { }                    // Then any member
      ]
    )
  }
);
 
// maxStalenessSeconds - don't read from very stale secondaries
const freshRead = await collection.findOne(
  { sessionId: "active-session" },
  {
    readPreference: new ReadPreference(
      ReadPreference.SECONDARY_PREFERRED,
      [],
      { maxStalenessSeconds: 30 }  // At most 30s behind primary
    )
  }
);

Read Preference Trade-offs

Sharding Architecture for Horizontal Scale

Sharded Cluster Components:

Sharded Cluster Architecture

•Shards — Each shard is a replica set holding a subset of the data. Data is divided by shard key ranges or zones.
•Config Servers — A replica set storing cluster metadata: shard locations, chunk ranges, cluster configuration.
•mongos — Query routers that direct operations to appropriate shards. Stateless, typically deployed alongside application servers.

Converting Mermaid diagram...

How Sharding Works

Data is divided into chunks, contiguous ranges of shard key values. The config servers track which chunks live on which shards. When you query:

mongos receives the query
For queries including the shard key, mongos routes to the specific shard(s)
For queries without shard key, mongos broadcasts to all shards (scatter-gather)
Results are merged and returned to the client

Chunk Splitting and Balancing:

As data grows, chunks are split when they exceed the maximum size (default 128MB). The balancer runs periodically to move chunks between shards, maintaining roughly even distribution.

sharding-setup.js
JavaScript
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
// Enable sharding on a database
sh.enableSharding("ecommerce")
 
// Shard a collection - CRITICAL: choose shard key carefully!
// This cannot be changed without rebuilding the collection
 
// Range-based sharding on orderId
sh.shardCollection(
  "ecommerce.orders",
  { orderId: 1 }  // Shard key: orderId ascending
);
 
// Hashed sharding for even distribution
sh.shardCollection(
  "ecommerce.products",
  { _id: "hashed" }  // Hash of _id for distribution
);
 
// Compound shard key - supports range queries on both fields
sh.shardCollection(
  "ecommerce.events",
  { tenantId: 1, timestamp: 1 }  // Multi-tenant time-series
);
 
// View shard distribution
sh.status()
 
// Output shows chunk distribution:
// --- Sharding Status ---
// shards:
//   { "_id": "shard0", "host": "shard0/...", state: 1 }
//   { "_id": "shard1", "host": "shard1/...", state: 1 }
//   { "_id": "shard2", "host": "shard2/...", state: 1 }
// databases:
//   { "_id": "ecommerce", "primary": "shard0", "partitioned": true }
// ecommerce.orders chunks:
//   shard0: 42 chunks
//   shard1: 41 chunks
//   shard2: 41 chunks

Shard Key Selection: The Critical Decision

Ideal Shard Key Properties:

Good Shard Key Characteristics

•High Cardinality — Many distinct values allow fine-grained distribution. Low cardinality (e.g., boolean, country code) limits the maximum number of chunks.
•Even Distribution — Values should be roughly uniformly distributed, not skewed toward common values.
•Query Isolation — Most queries should include the shard key, enabling targeted routing rather than scatter-gather.
•Write Distribution — Writes should spread across shards, not concentrate on one (monotonically increasing keys create hotspots).
•Non-Monotonic — Avoid auto-incrementing IDs as shard keys—all new inserts go to one shard.

Common Shard Key Patterns:

Good Shard Keys

•{ tenantId: 1, timestamp: 1 } — Multi-tenant: isolates tenant data, allows time-range queries
•{ userId: 'hashed' } — Evenly distributes users; most queries are by userId
•{ region: 1, _id: 1 } — Geographic zones with unique identifier
•{ category: 1, productId: 1 } — E-commerce: queries by category are targeted

Poor Shard Keys

•{ timestamp: 1 } — All new writes go to last chunk (insert hotspot)
•{ _id: 1 } (ObjectId) — Monotonically increasing, same problem
•{ status: 1 } — Only a few values (pending, complete, failed)
•{ country: 1 } — Low cardinality, US shard would be overloaded

shard-key-analysis.js
JavaScript
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
// Analyze shard key distribution BEFORE sharding
 
// Check cardinality
const cardinality = await db.collection.aggregate([
  { $group: { _id: "$proposedShardKey" } },
  { $count: "distinctValues" }
]).toArray();
console.log("Distinct values:", cardinality[0].distinctValues);
// Goal: Much higher than expected number of shards
 
// Check distribution
const distribution = await db.collection.aggregate([
  { $group: { 
      _id: "$proposedShardKey", 
      count: { $sum: 1 } 
  }},
  { $sort: { count: -1 } },
  { $limit: 20 }
]).toArray();
// Check: No single value dominates
 
// Check query patterns
// Run explain on common queries
const explain = await db.collection.find(
  { userId: "12345", timestamp: { $gte: lastWeek } }
).explain("executionStats");
// Check: Query includes shard key fields
 
// After sharding: monitor chunk distribution
db.orders.getShardDistribution()
// Shows data and chunk distribution per shard
// Look for imbalances > 20%

The Hotspot Trap

Zones and Geographic Distribution

Zone sharding allows you to control where specific data resides. This enables geographic data locality, compliance with data residency requirements, and tiered storage strategies.

Zone Use Cases:

•Data Locality — Keep European customer data on EU shards for GDPR compliance
•Tiered Storage — Recent 'hot' data on SSDs, older 'cold' data on HDDs
•Multi-Tenancy — Isolate large tenants to dedicated shards
•Performance Optimization — Keep related data together (user + their orders on same shard)

zone-sharding.js
JavaScript
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
// Create zones for geographic data locality
// Shard key: { region: 1, userId: 1 }
 
// Add shards to zones
sh.addShardToZone("shard-us-east", "US-Data")
sh.addShardToZone("shard-us-west", "US-Data")
sh.addShardToZone("shard-eu-west", "EU-Data")
sh.addShardToZone("shard-eu-central", "EU-Data")
sh.addShardToZone("shard-apac", "APAC-Data")
 
// Define zone ranges
// All documents with region "US" go to US-Data zone
sh.updateZoneKeyRange(
  "ecommerce.customers",
  { region: "US", userId: MinKey },    // Range start
  { region: "US", userId: MaxKey },    // Range end
  "US-Data"
);
 
sh.updateZoneKeyRange(
  "ecommerce.customers",
  { region: "EU", userId: MinKey },
  { region: "EU", userId: MaxKey },
  "EU-Data"
);
 
sh.updateZoneKeyRange(
  "ecommerce.customers",
  { region: "APAC", userId: MinKey },
  { region: "APAC", userId: MaxKey },
  "APAC-Data"
);
 
// Tiered storage example
// Shard key: { createdMonth: 1, _id: 1 }
sh.addShardToZone("shard-ssd-1", "Hot-Storage")
sh.addShardToZone("shard-ssd-2", "Hot-Storage")
sh.addShardToZone("shard-hdd-1", "Cold-Storage")
sh.addShardToZone("shard-hdd-2", "Cold-Storage")
 
// Recent months on SSD
sh.updateZoneKeyRange(
  "logs.events",
  { createdMonth: "2024-01", _id: MinKey },
  { createdMonth: "2024-12", _id: MaxKey },
  "Hot-Storage"
);
 
// Older months on HDD (update zone ranges as time progresses)
sh.updateZoneKeyRange(
  "logs.events",
  { createdMonth: "2020-01", _id: MinKey },
  { createdMonth: "2023-12", _id: MaxKey },
  "Cold-Storage"
);

Zone Planning

Summary: MongoDB Distributed Architecture

MongoDB's replica sets and sharding provide a powerful foundation for building highly available, horizontally scalable systems. Let's consolidate the key operational knowledge:

Key Takeaways

•Replica sets provide high availability through automatic failover — A 3-member minimum ensures fault tolerance with majority elections.
•The oplog is the replication backbone — Size it based on write volume; insufficient oplog size forces full resyncs.
•Elections require majority voting — Always use odd numbers of voting members to avoid split-brain during network partitions.
•Write concern controls durability — Use w:'majority' for critical data; understand the rollback risk with w:1.
•Read concern controls consistency — 'majority' ensures you don't read data that could be rolled back.
•Read preference distributes load — Use secondaries for analytics and reporting; primary for read-your-writes consistency.
•Sharding enables horizontal scaling — mongos routers direct queries; config servers track chunk locations.
•Shard key is the critical decision — Cannot be changed post-hoc; analyze cardinality, distribution, and query patterns before sharding.
•Avoid monotonically increasing shard keys — They create insert hotspots; use hashed sharding or compound keys.
•Zones enable data placement control — Geographic locality, compliance requirements, and tiered storage.

What's Next:

Page Complete

2 / 5