System Design (HLD)Wide-Column Stores

Wide-Column Stores: Mastering Column-Family Databases

LevelAdvanced

Duration90 mins

TopicWide-Column Stores

2 / 5

Apache Cassandra — Distributed, Eventually Consistent Architecture

The Database That Never Sleeps

When Netflix streams to 230 million subscribers across 190 countries, when Apple manages billions of iCloud data points, when Discord handles 4 million concurrent voice users, and when Instagram delivers feeds to 2 billion monthly users, they all rely on one database for their most demanding workloads: Apache Cassandra.

Cassandra is not just another NoSQL database. It represents a fundamentally different philosophy about distributed systems—one where single points of failure are architecturally impossible, where writes never block on reads, and where you trade away some of the consistency guarantees that relational databases provide in exchange for availability and partition tolerance that relational databases cannot match.

This page will give you a practitioner's understanding of Cassandra's architecture. You'll learn how it achieves its legendary fault tolerance, how data flows through the system, and how its consistency model enables you to tune between availability and consistency on a per-query basis. This knowledge is essential whether you're evaluating Cassandra for a project or designing systems that must operate at similar scale.

What You Will Master

By the end of this page, you will understand Cassandra's masterless ring architecture, how the gossip protocol maintains cluster state without a coordinator, how data is partitioned and replicated across nodes, and how tunable consistency lets you make per-query tradeoffs between performance and correctness.

Cassandra's Origin and Design Philosophy

Cassandra's design philosophy is best understood through its origin story—a fusion of two landmark distributed systems papers that defined modern database architecture.

The Amazon Dynamo Influence (2007)

Amazon's Dynamo paper introduced revolutionary concepts that became Cassandra's distributed layer:

Consistent hashing: Data partitioned across nodes using hash ring, enabling linear scalability
Eventual consistency: Sacrifice strong consistency for always-writable availability
Peer-to-peer architecture: No master nodes, every node is equal
Tunable consistency: Choose W (writes) and R (reads) quorums per operation
Sloppy quorums and hinted handoff: Continue operating even when target nodes are down

The Google Bigtable Influence (2006)

Bigtable contributed Cassandra's data model and storage engine:

Column-family data model: Sparse, multi-dimensional sorted map
LSM-tree storage: Write-optimized log-structured merge trees
SSTable format: Sorted String Tables for efficient disk storage
Bloom filters: Probabilistic membership testing to reduce disk reads

The Synthesis: Facebook Inbox (2008)

Facebook engineers, led by Avinash Lakshman (original Dynamo author) and Prashant Malik, combined these approaches to handle Facebook's inbox search—a problem requiring:

High write throughput (billions of messages/day)
Always-on availability (no maintenance windows)
Linear horizontal scalability (grow by adding nodes)
Multi-datacenter deployment (global user base)

Cassandra's Core Promise

Cassandra makes a fundamental trade-off: it sacrifices the convenience of single-node transactions and complex queries to achieve something relational databases cannot—linear horizontal scalability with no single point of failure. Every node can handle reads and writes. Any node can fail without impacting availability.

Cassandra's CAP Theorem Position

Understanding Cassandra requires understanding CAP theorem trade-offs:

Consistency (C): Every read returns the most recent write
Availability (A): Every request receives a response
Partition Tolerance (P): System operates despite network partitions

Cassandra is an AP system by default—it chooses availability and partition tolerance over strong consistency. During a network partition, Cassandra continues accepting reads and writes on both sides, potentially creating conflicting updates that are resolved later through conflict resolution (last-write-wins by default).

However, Cassandra's tunable consistency allows you to move along the AP-CP spectrum per query. With QUORUM or ALL consistency levels, you can achieve stronger consistency at the cost of availability.

Cassandra's Position in the Database Landscape
Characteristic	Cassandra	PostgreSQL	MongoDB
CAP Position	AP (tunable)	CP	CP (by default)
Scaling Model	Horizontal (shared-nothing)	Vertical + read replicas	Horizontal (sharded)
Write Model	Any node, no coordination	Single writer + replication	Primary per shard
Failure Mode	Continue serving	May block writes	Failover delay
Consistency	Tunable per query	ACID transactions	Configurable
Query Capability	Limited (partition key)	Full SQL	Rich queries

The Ring Architecture: Masterless Distribution

Cassandra's defining architectural feature is its masterless ring topology. Unlike leader-follower systems (PostgreSQL, MongoDB primary), every Cassandra node is a peer—there is no single coordinator, no primary that handles all writes, and no single point of failure.

Consistent Hashing and the Token Ring

Cassandra uses consistent hashing to distribute data across nodes. Each node is assigned one or more tokens representing positions on a hash ring. Data placement is determined by:

Hash the partition key: Apply a consistent hash function (Murmur3 by default)
Map hash to ring position: The hash value maps to a position on the ring
Find the owning node: The first node clockwise from the hash position owns that data
Replicate to N-1 additional nodes: Continue clockwise to find replica owners

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
┌────────────────────────────────────────────────────────────────────────────┐
│                          CASSANDRA TOKEN RING                              │
├────────────────────────────────────────────────────────────────────────────┤
│                                                                            │
│                          Token: 0                                          │
│                              ●                                             │
│                         ╱    │    ╲                                        │
│                   Node A     │     Node F                                  │
│                    [0-17]    │      [85-0]                                 │
│                  ╱           │           ╲                                 │
│           Token: 17          │            Token: 85                        │
│               ●              │                ●                            │
│              │               │                │                            │
│         Node B               │            Node E                           │
│        [17-34]               │           [68-85]                           │
│              │               │                │                            │
│               ●              │                ●                            │
│           Token: 34          │            Token: 68                        │
│                  ╲           │           ╱                                 │
│                   Node C     │     Node D                                  │
│                    [34-51]   │    [51-68]                                  │
│                         ╲    │    ╱                                        │
│                              ●                                             │
│                          Token: 51                                         │
│                                                                            │
├────────────────────────────────────────────────────────────────────────────┤
│  Example: Write "user_12345" with Replication Factor = 3                   │
│                                                                            │
│  1. hash("user_12345") = 45 (mapped to ring position)                      │
│  2. Position 45 falls in range [34-51] → Primary: Node C                   │
│  3. Replicate clockwise → Replica 1: Node D, Replica 2: Node E             │
│                                                                            │
│  Data is now stored on: Node C (primary), Node D, Node E                   │
│  Any of these 3 nodes can serve reads for "user_12345"                     │
└────────────────────────────────────────────────────────────────────────────┘

Virtual Nodes (Vnodes)

Early Cassandra assigned a single token per node, which caused problems:

Uneven data distribution: Random token assignment rarely produced perfect balance
Painful rebalancing: Adding a node required manually redistributing tokens
Slow rebuilds: A failed node's data had to stream from only one neighbor

Virtual nodes solve these issues by assigning many tokens (typically 256) to each physical node. Benefits include:

Better distribution: More tokens = more even data spread (statistical averaging)
Automatic rebalancing: New nodes automatically take token ranges from all existing nodes
Parallel rebuilds: Failed node's data streams from many neighbors simultaneously
Heterogeneous clusters: Nodes with more resources can use more tokens

Vnode Configuration

The default num_tokens is 256 in modern Cassandra. With 6 nodes and RF=3, each node owns roughly 3 × (256/6) ≈ 128 token ranges, resulting in very even distribution. Larger clusters can use fewer vnodes (128 or 64) to reduce overhead.

Replication Strategies

Cassandra supports multiple replication strategies to control where replicas are placed:

SimpleStrategy: Replicas placed on consecutive nodes clockwise around the ring. Good for development, not recommended for production.

NetworkTopologyStrategy: Specify replication factor per datacenter. Cassandra places replicas on different racks within each datacenter to survive hardware failures.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
-- Simple strategy (development only)
CREATE KEYSPACE user_data
WITH REPLICATION = {
    'class': 'SimpleStrategy',
    'replication_factor': 3
};
 
-- Network topology strategy (production)
CREATE KEYSPACE user_data
WITH REPLICATION = {
    'class': 'NetworkTopologyStrategy',
    'us-east': 3,     -- 3 replicas in US-East datacenter
    'us-west': 3,     -- 3 replicas in US-West datacenter
    'eu-west': 3      -- 3 replicas in EU-West datacenter
};
 
-- With rack awareness, replicas spread across racks
-- If us-east has racks rack1, rack2, rack3 → one replica per rack
-- This survives entire rack failures
 
-- Example: What happens to data for partition key "user_12345"?
-- 1. Hash maps to token range owned by node us-east-rack1-node1
-- 2. Replica placement in us-east: rack1-node1, rack2-node1, rack3-node2
-- 3. Replica placement in us-west: rack1-node2, rack2-node1, rack3-node1
-- 4. Replica placement in eu-west: rack1-node1, rack2-node2, rack3-node1
-- Total: 9 replicas across 3 datacenters for full global coverage

The Gossip Protocol: Decentralized Cluster Coordination

In a masterless system, how do nodes discover each other, detect failures, and coordinate cluster state? Cassandra uses a gossip protocol—a decentralized, epidemic-style communication mechanism inspired by how rumors spread through social networks.

How Gossip Works

Every second, each node in the cluster:

Selects a random peer: Choose another node (with preference for seeds and recently contacted nodes)
Exchanges state: Send local knowledge of all nodes' states
Merges information: Update local state with any newer information received
Detects failures: Use heartbeat generations and timing to identify dead nodes

This simple mechanism has powerful properties:

Decentralized: No coordinator required; all nodes participate equally
Fault-tolerant: Information propagates even if some nodes are unreachable
Convergent: All nodes eventually agree on cluster state (eventual consistency of metadata)
Self-healing: Nodes automatically discover new members and detect departed ones

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
┌────────────────────────────────────────────────────────────────────────────┐
│                        GOSSIP ROUND (Every Second)                         │
├────────────────────────────────────────────────────────────────────────────┤
│                                                                            │
│  Node A's Local State:                                                     │
│  ┌───────────────────────────────────────────────────────────────────┐    │
│  │ Node | Generation | Heartbeat | Status | Load    | Schema Ver    │    │
│  ├──────┼────────────┼───────────┼────────┼─────────┼───────────────┤    │
│  │ A    │ 1704067200 │ 500       │ NORMAL │ 150 GB  │ abc123        │    │
│  │ B    │ 1704067100 │ 480       │ NORMAL │ 155 GB  │ abc123        │    │
│  │ C    │ 1704066000 │ 420       │ NORMAL │ 148 GB  │ abc123        │    │
│  │ D    │ 1704065000 │ 0         │ DEAD   │ ???     │ ???           │    │
│  └───────────────────────────────────────────────────────────────────┘    │
│                                                                            │
│  Step 1: Node A randomly selects Node B                                    │
│                                                                            │
│  Step 2: Node A sends SYN message                                          │
│  ┌─────────────────────────────────────────────────────────────────────┐  │
│  │ A → B: "I know about: A@gen1704067200/hb500, B@gen.../hb480,        │  │
│  │         C@gen.../hb420, D is DEAD"                                  │  │
│  └─────────────────────────────────────────────────────────────────────┘  │
│                                                                            │
│  Step 3: Node B compares and responds with ACK                             │
│  ┌─────────────────────────────────────────────────────────────────────┐  │
│  │ B → A: "My C is newer: hb425. My D is ALIVE@hb5 (just recovered)!   │  │
│  │         Also here's new node E I discovered: E@gen1704067000/hb10"  │  │
│  └─────────────────────────────────────────────────────────────────────┘  │
│                                                                            │
│  Step 4: Node A updates local state                                        │
│  • C's heartbeat updated to 425                                            │
│  • D marked as recovering/NORMAL                                           │
│  • E added to known nodes list                                             │
│                                                                            │
│  Step 5: Node A sends ACK2 confirming receipt                              │
│                                                                            │
└────────────────────────────────────────────────────────────────────────────┘

Failure Detection: Phi Accrual Detection

Cassandra doesn't use simple timeout-based failure detection. Instead, it uses the Phi Accrual Failure Detector, which:

Tracks inter-arrival times: Measures time between heartbeats from each node
Builds a distribution: Models expected heartbeat intervals statistically
Calculates phi (φ): A suspicion level based on how overdue a heartbeat is
Threshold-based decision: When φ exceeds a threshold (default 8), node is marked down

The beauty of phi accrual detection:

Adaptive: Adjusts to network latency naturally; high-latency links don't trigger false positives
Probabilistic: Phi represents confidence; phi=8 means probability of being alive is e^(-8) ≈ 0.00034
No configuration required: Works across wildly different network conditions

Gossip Convergence Time

In a 100-node cluster, gossip converges within O(log N) rounds, meaning cluster-wide knowledge propagates in ~7 seconds. This is remarkably fast for decentralized coordination and why Cassandra can operate without a centralized membership service.

Seed Nodes

When a Cassandra node starts, it needs to discover the cluster. Seed nodes solve the bootstrap problem:

Known contact points: Each node is configured with a list of seed nodes
Initial gossip targets: New nodes first contact seeds to learn about cluster topology
Not special: Seeds are regular nodes that happen to be listed in configuration
Multiple seeds: Always configure multiple seeds across racks/datacenters for redundancy

Best Practices:

Use 2-3 seed nodes per datacenter
Spread seeds across racks
Seeds should be stable, long-running nodes
Don't make all nodes seeds (creates unnecessary gossip traffic)

Cassandra Data Model and CQL

Cassandra's data model appears similar to relational tables but operates on fundamentally different principles. Understanding these differences is critical to successful Cassandra implementations.

Keyspaces, Tables, and Primary Keys

Cassandra organizes data hierarchically:

Keyspace: Top-level namespace, similar to a database in SQL. Defines replication strategy and factor.
Table: Collection of rows sharing a schema. Unlike relational tables, optimized for specific queries.
Primary Key: Defines both data partitioning and sorting. Consists of partition key + clustering columns.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
-- Create keyspace with replication settings
CREATE KEYSPACE ecommerce
WITH REPLICATION = {
    'class': 'NetworkTopologyStrategy',
    'dc1': 3,
    'dc2': 3
};
 
USE ecommerce;
 
-- Table: Orders by User (query: get user's orders by date)
CREATE TABLE orders_by_user (
    user_id         UUID,
    order_date      DATE,
    order_id        UUID,
    status          TEXT,
    total_amount    DECIMAL,
    items           LIST<FROZEN<item_type>>,
    
    -- Primary key explanation:
    -- (user_id) = Partition key - determines which node stores the data
    -- order_date DESC, order_id = Clustering columns - determines sort order within partition
    PRIMARY KEY ((user_id), order_date, order_id)
) WITH CLUSTERING ORDER BY (order_date DESC, order_id ASC);
 
-- This design enables:
-- ✅ Get all orders for a user (single partition read)
-- ✅ Get user's orders in a date range (partition read with range filter)
-- ✅ Get specific order by user + date + order_id (single row read)
 
-- This design does NOT efficiently support:
-- ❌ Get all orders with status = 'SHIPPED' (requires full table scan)
-- ❌ Get all orders on a specific date across all users (requires full table scan)
-- ❌ Get order by order_id only (order_id is not the partition key)
 
-- For the above queries, create additional tables:
CREATE TABLE orders_by_status (
    status          TEXT,
    order_date      DATE,
    order_id        UUID,
    user_id         UUID,
    total_amount    DECIMAL,
    PRIMARY KEY ((status), order_date, order_id)
) WITH CLUSTERING ORDER BY (order_date DESC);
 
CREATE TABLE orders_by_id (
    order_id        UUID PRIMARY KEY,
    user_id         UUID,
    order_date      DATE,
    status          TEXT,
    total_amount    DECIMAL,
    items           LIST<FROZEN<item_type>>
);

Partition Keys vs Clustering Columns

The distinction between partition keys and clustering columns is fundamental:

Partition Key (determines data placement):

Hash of partition key determines which nodes store the data
All rows with the same partition key are stored together (same partition)
Should be chosen to distribute data evenly (avoid hot partitions)
Queries MUST specify the partition key for efficiency

Clustering Columns (determines sort order):

Defines the sort order of rows within a partition
Enables range queries within a partition
Queries can filter on clustering columns (but must filter on partition key first)
Commonly used: timestamps (for time-series), hierarchical IDs

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
-- PATTERN 1: Simple Primary Key
-- Use when: Each row is independently accessed by a unique identifier
CREATE TABLE users (
    user_id UUID PRIMARY KEY,
    name TEXT,
    email TEXT
);
-- Efficient: SELECT * FROM users WHERE user_id = ?
-- Inefficient: SELECT * FROM users WHERE email = ? (full table scan)
 
-- PATTERN 2: Composite Partition Key
-- Use when: Data should be co-located based on multiple fields
CREATE TABLE sensor_readings_by_region_date (
    region TEXT,
    reading_date DATE,
    sensor_id UUID,
    reading_time TIMESTAMP,
    value DOUBLE,
    PRIMARY KEY ((region, reading_date), sensor_id, reading_time)
);
-- Partitions by (region + date) → all readings for US-EAST on 2024-01-15 in one partition
-- Efficient: SELECT * FROM sensor_readings WHERE region = ? AND reading_date = ?
-- Also efficient: Add AND sensor_id = ? to narrow further
 
-- PATTERN 3: Time-Bucketed Data
-- Use when: Time-series data would create unbounded partitions
CREATE TABLE logs_by_service (
    service_name TEXT,
    time_bucket TEXT,  -- e.g., "2024-01-15-14" (hour bucket)
    log_time TIMESTAMP,
    log_id UUID,
    level TEXT,
    message TEXT,
    PRIMARY KEY ((service_name, time_bucket), log_time, log_id)
) WITH CLUSTERING ORDER BY (log_time DESC);
-- Each partition = 1 hour of logs for 1 service
-- Prevents unbounded partition growth
-- Query recent logs: filter on current time_bucket(s)
 
-- PATTERN 4: Reverse Lookups via Materialized Views
CREATE MATERIALIZED VIEW users_by_email AS
    SELECT * FROM users
    WHERE email IS NOT NULL
    PRIMARY KEY (email, user_id);
-- Now efficient: SELECT * FROM users_by_email WHERE email = ?
-- Cassandra maintains the view automatically on writes

Partition Size Limits

Partitions should not exceed 100MB or ~100K rows as a rule of thumb. Larger partitions cause slow reads, compaction problems, and memory pressure. Time-bucket your partition keys to prevent unbounded growth in time-series workloads.

Tunable Consistency: Trading Availability and Correctness

Cassandra's killer feature for system designers is tunable consistency—the ability to choose consistency guarantees on a per-query basis. This allows you to make different tradeoffs for different operations within the same application.

Consistency Levels Explained

When writing or reading, you specify how many replicas must acknowledge the operation:

Cassandra Consistency Levels
Consistency Level	Writes: Acknowledged By	Reads: Returned By	Use Case
ANY	1 node (can be hinted handoff)	N/A	Fire-and-forget writes
ONE	1 replica node	1 replica node	Low-latency, eventual consistency
TWO	2 replica nodes	2 replica nodes	Slightly stronger than ONE
THREE	3 replica nodes	3 replica nodes	Stronger guarantee
QUORUM	(RF / 2) + 1 nodes	(RF / 2) + 1 nodes	Strong consistency
LOCAL_QUORUM	Quorum in local DC	Quorum in local DC	Strong local, cross-DC async
EACH_QUORUM	Quorum in each DC	N/A	Strong multi-DC writes
ALL	All replica nodes	All replica nodes	Maximum consistency

The Consistency Formula

Strong consistency requires that reads always see the most recent write. This is achieved when:

R + W > N (Read consistency + Write consistency > Replication Factor)

Common configurations:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
Replication Factor N = 3
 
Configuration 1: QUORUM reads + QUORUM writes
├── W = 2, R = 2
├── R + W = 4 > 3 ✅ Strong consistency
├── Tolerates: 1 node failure for reads AND writes
└── Use case: Financial data, user authentication
 
Configuration 2: ONE reads + ALL writes
├── W = 3, R = 1
├── R + W = 4 > 3 ✅ Strong consistency
├── Tolerates: 2 node failures for reads, 0 for writes
└── Use case: Write-rarely, read-frequently data
 
Configuration 3: ALL reads + ONE writes
├── W = 1, R = 3
├── R + W = 4 > 3 ✅ Strong consistency
├── Tolerates: 0 node failures for reads, 2 for writes
└── Use case: Write-frequently, read-rarely (rare)
 
Configuration 4: ONE reads + ONE writes
├── W = 1, R = 1
├── R + W = 2 < 3 ❌ Eventual consistency
├── Tolerates: 2 node failures for reads AND writes
└── Use case: Analytics, logs, metrics where staleness is acceptable
 
Configuration 5: LOCAL_QUORUM reads + LOCAL_QUORUM writes
├── W = 2 (local DC), R = 2 (local DC)
├── Strong consistency within datacenter
├── Eventual consistency across datacenters
└── Use case: Multi-region with low latency requirements

Practical Consistency Patterns

Real applications often use different consistency levels for different operations:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
// E-commerce application consistency strategy
 
// User registration - must not lose data, strong write
await cassandra.execute(
    'INSERT INTO users (id, email, password_hash) VALUES (?, ?, ?)',
    [userId, email, hash],
    { consistency: ConsistencyLevel.QUORUM }
);
 
// User login check - must read latest password hash
const user = await cassandra.execute(
    'SELECT password_hash FROM users WHERE id = ?',
    [userId],
    { consistency: ConsistencyLevel.QUORUM }
);
 
// View product catalog - stale data is acceptable
const products = await cassandra.execute(
    'SELECT * FROM products WHERE category = ?',
    [category],
    { consistency: ConsistencyLevel.ONE }  // Fast, eventual consistency
);
 
// Add to cart - needs to be durable
await cassandra.execute(
    'INSERT INTO cart_items (user_id, product_id, quantity) VALUES (?, ?, ?)',
    [userId, productId, quantity],
    { consistency: ConsistencyLevel.LOCAL_QUORUM }
);
 
// Page view analytics - fire and forget
await cassandra.execute(
    'UPDATE page_views SET count = count + 1 WHERE page_id = ?',
    [pageId],
    { consistency: ConsistencyLevel.ANY }  // Even hinted handoff is fine
);

LOCAL_QUORUM for Multi-DC

For multi-datacenter deployments, LOCAL_QUORUM is the sweet spot for most workloads. It provides strong consistency within a datacenter with low latency, while replication to other DCs happens asynchronously. This is how Netflix, Apple, and Discord achieve both low latency and global availability.

Write and Read Paths: How Data Flows

Understanding the internal flow of reads and writes is essential for performance tuning and debugging. Cassandra's paths are optimized for write throughput while maintaining acceptable read performance.

The Write Path

When a client sends a write request:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
┌────────────────────────────────────────────────────────────────────────────┐
│                          CASSANDRA WRITE PATH                              │
├────────────────────────────────────────────────────────────────────────────┤
│                                                                            │
│  Client                                                                    │
│    │                                                                       │
│    │ (1) Write request                                                     │
│    ▼                                                                       │
│  ┌───────────────────┐                                                     │
│  │ Coordinator Node  │ ← Any node can coordinate (masterless)              │
│  └─────────┬─────────┘                                                     │
│            │                                                               │
│            │ (2) Determine replica nodes (partition key → token → nodes)   │
│            │                                                               │
│            ├─────────────┬─────────────┬─────────────┐                     │
│            ▼             ▼             ▼             ▼                     │
│       ┌────────┐    ┌────────┐    ┌────────┐    (if RF > 3)                │
│       │Replica1│    │Replica2│    │Replica3│                               │
│       └────┬───┘    └────┬───┘    └────┬───┘                               │
│            │             │             │                                   │
│  ┌─────────┴─────────────┴─────────────┴─────────┐                         │
│  │            On Each Replica Node:              │                         │
│  ├───────────────────────────────────────────────┤                         │
│  │                                               │                         │
│  │  (3) Append to Commit Log (sequential write)  │                         │
│  │      └─ Durability: data survives node crash  │                         │
│  │                                               │                         │
│  │  (4) Write to MemTable (in-memory)            │                         │
│  │      └─ Sorted structure for fast lookup      │                         │
│  │                                               │                         │
│  │  (5) Return success to coordinator            │                         │
│  │                                               │                         │
│  └───────────────────────────────────────────────┘                         │
│                                                                            │
│  (6) Coordinator waits for consistency level acknowledgments               │
│      QUORUM with RF=3: wait for 2 of 3 replicas                            │
│                                                                            │
│  (7) Return success to client                                              │
│                                                                            │
│  (Background) MemTable flushes to SSTable when full                        │
│  (Background) Compaction merges SSTables                                   │
│                                                                            │
└────────────────────────────────────────────────────────────────────────────┘

Why Writes Are Fast:

No seeks required: Commit log is append-only (sequential I/O)
Memory-first: MemTable write is purely in-memory
No read-before-write: Insert/update doesn't need existing value
Parallelism: Replicas write concurrently

The Read Path

Reading is more complex because data may exist in multiple locations:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
┌────────────────────────────────────────────────────────────────────────────┐
│                          CASSANDRA READ PATH                               │
├────────────────────────────────────────────────────────────────────────────┤
│                                                                            │
│  Client                                                                    │
│    │                                                                       │
│    │ (1) Read request                                                      │
│    ▼                                                                       │
│  ┌───────────────────┐                                                     │
│  │ Coordinator Node  │                                                     │
│  └─────────┬─────────┘                                                     │
│            │                                                               │
│            │ (2) Determine replica nodes + select fastest                  │
│            │     (Dynamic snitch tracks response times)                    │
│            │                                                               │
│            ├───────────────────────────────────────────┐                   │
│            │                                           │                   │
│            ▼                                           ▼                   │
│       ┌────────────┐                            ┌────────────┐             │
│       │Fastest     │ ← Full data request        │Other       │ ← Digest   │
│       │Replica     │                            │Replicas    │   request  │
│       └─────┬──────┘                            └─────┬──────┘             │
│             │                                         │                    │
│  ┌──────────┴──────────┐                              │                    │
│  │ On Fastest Replica: │                              │                    │
│  ├─────────────────────┤                              │                    │
│  │                     │                              │                    │
│  │ (3) Check Row Cache │ ← Hot rows cached           │                    │
│  │     ↓ (miss)        │                              │                    │
│  │ (4) Check MemTable  │ ← Recent writes in memory   │                    │
│  │     ↓ (miss)        │                              │                    │
│  │ (5) Check Bloom     │ ← Which SSTables might      │                    │
│  │     filters         │   contain this key?          │                    │
│  │     ↓               │                              │                    │
│  │ (6) Read SSTable    │ ← Block index → disk read   │                    │
│  │     index           │                              │                    │
│  │     ↓               │                              │                    │
│  │ (7) Merge results   │ ← Combine all sources,      │                    │
│  │                     │   return latest version      │                    │
│  └─────────────────────┘                              │                    │
│             │                                         │                    │
│             │ (8) Return full data                    │ (8) Return digest  │
│             │                                         │     (hash of data) │
│             ▼                                         ▼                    │
│  ┌───────────────────────────────────────────────────────────────────┐    │
│  │ Coordinator: Compare digest with hash of full data                │    │
│  │                                                                   │    │
│  │ If match: Return data to client ✓                                 │    │
│  │ If mismatch: Read Repair (fetch full data from all, merge,        │    │
│  │              write merged result back to out-of-date replicas)    │    │
│  └───────────────────────────────────────────────────────────────────┘    │
│                                                                            │
└────────────────────────────────────────────────────────────────────────────┘

Read Repair and Anti-Entropy

Read repair is opportunistic—it fixes inconsistencies when detected during reads. For proactive consistency, Cassandra also runs 'nodetool repair' operations that compare all replicas using Merkle trees and synchronize differences. Production clusters typically run repair weekly.

Summary: Cassandra's Architectural Mastery

We've explored Apache Cassandra's architecture at a practitioner's depth. Let's consolidate the key concepts that will inform your system design decisions:

Key Takeaways

•Cassandra is an AP system that trades consistency for availability and partition tolerance, with tunable consistency allowing per-query tradeoffs
•Masterless ring architecture means no single point of failure—every node can accept reads and writes, and any node can coordinate requests
•Consistent hashing with vnodes distributes data evenly across nodes, enables linear scalability, and simplifies cluster operations
•Gossip protocol maintains decentralized cluster coordination with automatic failure detection using phi accrual method
•Primary key design is critical—partition key determines data placement, clustering columns determine sort order within partitions
•Tunable consistency lets you choose R + W > N for strong consistency or lower levels for higher availability and performance
•Write path is optimized with append-only commit log and in-memory MemTable; read path uses bloom filters and caching to mitigate SSTable lookups

When to Choose Cassandra:

✅ Write-heavy workloads (millions of writes/second) ✅ Always-on availability requirements (no maintenance windows) ✅ Linear horizontal scalability needs ✅ Multi-datacenter deployment for global availability ✅ Time-series, IoT, messaging, and activity logging

When to Avoid Cassandra:

❌ Complex queries requiring JOINs or aggregations ❌ Strong consistency requirements for every operation ❌ Small datasets where operational complexity isn't justified ❌ Teams without distributed systems expertise

What's Next:

We'll explore HBase—another wide-column store with different architectural choices. Where Cassandra prioritizes availability and write performance, HBase emphasizes strong consistency through its integration with Hadoop's HDFS and ZooKeeper. Understanding both enables you to choose the right tool for your specific requirements.

Cassandra Mastery Achieved

You now understand Cassandra's architecture at a level sufficient to design systems, evaluate tradeoffs, and troubleshoot issues. This knowledge positions you to make informed decisions about when Cassandra is the right choice and how to use it effectively.

2 / 5

Loading learning content...

System Design (HLD)Wide-Column Stores

Wide-Column Stores: Mastering Column-Family Databases

LevelAdvanced

Duration90 mins

TopicWide-Column Stores

2 / 5

Apache Cassandra — Distributed, Eventually Consistent Architecture

The Database That Never Sleeps

What You Will Master

Cassandra's Origin and Design Philosophy

Cassandra's design philosophy is best understood through its origin story—a fusion of two landmark distributed systems papers that defined modern database architecture.

The Amazon Dynamo Influence (2007)

Amazon's Dynamo paper introduced revolutionary concepts that became Cassandra's distributed layer:

Consistent hashing: Data partitioned across nodes using hash ring, enabling linear scalability
Eventual consistency: Sacrifice strong consistency for always-writable availability
Peer-to-peer architecture: No master nodes, every node is equal
Tunable consistency: Choose W (writes) and R (reads) quorums per operation
Sloppy quorums and hinted handoff: Continue operating even when target nodes are down

The Google Bigtable Influence (2006)

Bigtable contributed Cassandra's data model and storage engine:

Column-family data model: Sparse, multi-dimensional sorted map
LSM-tree storage: Write-optimized log-structured merge trees
SSTable format: Sorted String Tables for efficient disk storage
Bloom filters: Probabilistic membership testing to reduce disk reads

The Synthesis: Facebook Inbox (2008)

Facebook engineers, led by Avinash Lakshman (original Dynamo author) and Prashant Malik, combined these approaches to handle Facebook's inbox search—a problem requiring:

High write throughput (billions of messages/day)
Always-on availability (no maintenance windows)
Linear horizontal scalability (grow by adding nodes)
Multi-datacenter deployment (global user base)

Cassandra's Core Promise

Cassandra's CAP Theorem Position

Understanding Cassandra requires understanding CAP theorem trade-offs:

Consistency (C): Every read returns the most recent write
Availability (A): Every request receives a response
Partition Tolerance (P): System operates despite network partitions

Cassandra's Position in the Database Landscape
Characteristic	Cassandra	PostgreSQL	MongoDB
CAP Position	AP (tunable)	CP	CP (by default)
Scaling Model	Horizontal (shared-nothing)	Vertical + read replicas	Horizontal (sharded)
Write Model	Any node, no coordination	Single writer + replication	Primary per shard
Failure Mode	Continue serving	May block writes	Failover delay
Consistency	Tunable per query	ACID transactions	Configurable
Query Capability	Limited (partition key)	Full SQL	Rich queries

The Ring Architecture: Masterless Distribution

Consistent Hashing and the Token Ring

Cassandra uses consistent hashing to distribute data across nodes. Each node is assigned one or more tokens representing positions on a hash ring. Data placement is determined by:

Hash the partition key: Apply a consistent hash function (Murmur3 by default)
Map hash to ring position: The hash value maps to a position on the ring
Find the owning node: The first node clockwise from the hash position owns that data
Replicate to N-1 additional nodes: Continue clockwise to find replica owners

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
┌────────────────────────────────────────────────────────────────────────────┐
│                          CASSANDRA TOKEN RING                              │
├────────────────────────────────────────────────────────────────────────────┤
│                                                                            │
│                          Token: 0                                          │
│                              ●                                             │
│                         ╱    │    ╲                                        │
│                   Node A     │     Node F                                  │
│                    [0-17]    │      [85-0]                                 │
│                  ╱           │           ╲                                 │
│           Token: 17          │            Token: 85                        │
│               ●              │                ●                            │
│              │               │                │                            │
│         Node B               │            Node E                           │
│        [17-34]               │           [68-85]                           │
│              │               │                │                            │
│               ●              │                ●                            │
│           Token: 34          │            Token: 68                        │
│                  ╲           │           ╱                                 │
│                   Node C     │     Node D                                  │
│                    [34-51]   │    [51-68]                                  │
│                         ╲    │    ╱                                        │
│                              ●                                             │
│                          Token: 51                                         │
│                                                                            │
├────────────────────────────────────────────────────────────────────────────┤
│  Example: Write "user_12345" with Replication Factor = 3                   │
│                                                                            │
│  1. hash("user_12345") = 45 (mapped to ring position)                      │
│  2. Position 45 falls in range [34-51] → Primary: Node C                   │
│  3. Replicate clockwise → Replica 1: Node D, Replica 2: Node E             │
│                                                                            │
│  Data is now stored on: Node C (primary), Node D, Node E                   │
│  Any of these 3 nodes can serve reads for "user_12345"                     │
└────────────────────────────────────────────────────────────────────────────┘

Virtual Nodes (Vnodes)

Early Cassandra assigned a single token per node, which caused problems:

Uneven data distribution: Random token assignment rarely produced perfect balance
Painful rebalancing: Adding a node required manually redistributing tokens
Slow rebuilds: A failed node's data had to stream from only one neighbor

Virtual nodes solve these issues by assigning many tokens (typically 256) to each physical node. Benefits include:

Better distribution: More tokens = more even data spread (statistical averaging)
Automatic rebalancing: New nodes automatically take token ranges from all existing nodes
Parallel rebuilds: Failed node's data streams from many neighbors simultaneously
Heterogeneous clusters: Nodes with more resources can use more tokens

Vnode Configuration

Replication Strategies

Cassandra supports multiple replication strategies to control where replicas are placed:

SimpleStrategy: Replicas placed on consecutive nodes clockwise around the ring. Good for development, not recommended for production.

NetworkTopologyStrategy: Specify replication factor per datacenter. Cassandra places replicas on different racks within each datacenter to survive hardware failures.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
-- Simple strategy (development only)
CREATE KEYSPACE user_data
WITH REPLICATION = {
    'class': 'SimpleStrategy',
    'replication_factor': 3
};
 
-- Network topology strategy (production)
CREATE KEYSPACE user_data
WITH REPLICATION = {
    'class': 'NetworkTopologyStrategy',
    'us-east': 3,     -- 3 replicas in US-East datacenter
    'us-west': 3,     -- 3 replicas in US-West datacenter
    'eu-west': 3      -- 3 replicas in EU-West datacenter
};
 
-- With rack awareness, replicas spread across racks
-- If us-east has racks rack1, rack2, rack3 → one replica per rack
-- This survives entire rack failures
 
-- Example: What happens to data for partition key "user_12345"?
-- 1. Hash maps to token range owned by node us-east-rack1-node1
-- 2. Replica placement in us-east: rack1-node1, rack2-node1, rack3-node2
-- 3. Replica placement in us-west: rack1-node2, rack2-node1, rack3-node1
-- 4. Replica placement in eu-west: rack1-node1, rack2-node2, rack3-node1
-- Total: 9 replicas across 3 datacenters for full global coverage

The Gossip Protocol: Decentralized Cluster Coordination

How Gossip Works

Every second, each node in the cluster:

Selects a random peer: Choose another node (with preference for seeds and recently contacted nodes)
Exchanges state: Send local knowledge of all nodes' states
Merges information: Update local state with any newer information received
Detects failures: Use heartbeat generations and timing to identify dead nodes

This simple mechanism has powerful properties:

Decentralized: No coordinator required; all nodes participate equally
Fault-tolerant: Information propagates even if some nodes are unreachable
Convergent: All nodes eventually agree on cluster state (eventual consistency of metadata)
Self-healing: Nodes automatically discover new members and detect departed ones

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
┌────────────────────────────────────────────────────────────────────────────┐
│                        GOSSIP ROUND (Every Second)                         │
├────────────────────────────────────────────────────────────────────────────┤
│                                                                            │
│  Node A's Local State:                                                     │
│  ┌───────────────────────────────────────────────────────────────────┐    │
│  │ Node | Generation | Heartbeat | Status | Load    | Schema Ver    │    │
│  ├──────┼────────────┼───────────┼────────┼─────────┼───────────────┤    │
│  │ A    │ 1704067200 │ 500       │ NORMAL │ 150 GB  │ abc123        │    │
│  │ B    │ 1704067100 │ 480       │ NORMAL │ 155 GB  │ abc123        │    │
│  │ C    │ 1704066000 │ 420       │ NORMAL │ 148 GB  │ abc123        │    │
│  │ D    │ 1704065000 │ 0         │ DEAD   │ ???     │ ???           │    │
│  └───────────────────────────────────────────────────────────────────┘    │
│                                                                            │
│  Step 1: Node A randomly selects Node B                                    │
│                                                                            │
│  Step 2: Node A sends SYN message                                          │
│  ┌─────────────────────────────────────────────────────────────────────┐  │
│  │ A → B: "I know about: A@gen1704067200/hb500, B@gen.../hb480,        │  │
│  │         C@gen.../hb420, D is DEAD"                                  │  │
│  └─────────────────────────────────────────────────────────────────────┘  │
│                                                                            │
│  Step 3: Node B compares and responds with ACK                             │
│  ┌─────────────────────────────────────────────────────────────────────┐  │
│  │ B → A: "My C is newer: hb425. My D is ALIVE@hb5 (just recovered)!   │  │
│  │         Also here's new node E I discovered: E@gen1704067000/hb10"  │  │
│  └─────────────────────────────────────────────────────────────────────┘  │
│                                                                            │
│  Step 4: Node A updates local state                                        │
│  • C's heartbeat updated to 425                                            │
│  • D marked as recovering/NORMAL                                           │
│  • E added to known nodes list                                             │
│                                                                            │
│  Step 5: Node A sends ACK2 confirming receipt                              │
│                                                                            │
└────────────────────────────────────────────────────────────────────────────┘

Failure Detection: Phi Accrual Detection

Cassandra doesn't use simple timeout-based failure detection. Instead, it uses the Phi Accrual Failure Detector, which:

Tracks inter-arrival times: Measures time between heartbeats from each node
Builds a distribution: Models expected heartbeat intervals statistically
Calculates phi (φ): A suspicion level based on how overdue a heartbeat is
Threshold-based decision: When φ exceeds a threshold (default 8), node is marked down

The beauty of phi accrual detection:

Adaptive: Adjusts to network latency naturally; high-latency links don't trigger false positives
Probabilistic: Phi represents confidence; phi=8 means probability of being alive is e^(-8) ≈ 0.00034
No configuration required: Works across wildly different network conditions

Gossip Convergence Time

Seed Nodes

When a Cassandra node starts, it needs to discover the cluster. Seed nodes solve the bootstrap problem:

Known contact points: Each node is configured with a list of seed nodes
Initial gossip targets: New nodes first contact seeds to learn about cluster topology
Not special: Seeds are regular nodes that happen to be listed in configuration
Multiple seeds: Always configure multiple seeds across racks/datacenters for redundancy

Best Practices:

Use 2-3 seed nodes per datacenter
Spread seeds across racks
Seeds should be stable, long-running nodes
Don't make all nodes seeds (creates unnecessary gossip traffic)

Cassandra Data Model and CQL

Cassandra's data model appears similar to relational tables but operates on fundamentally different principles. Understanding these differences is critical to successful Cassandra implementations.

Keyspaces, Tables, and Primary Keys

Cassandra organizes data hierarchically:

Keyspace: Top-level namespace, similar to a database in SQL. Defines replication strategy and factor.
Table: Collection of rows sharing a schema. Unlike relational tables, optimized for specific queries.
Primary Key: Defines both data partitioning and sorting. Consists of partition key + clustering columns.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
-- Create keyspace with replication settings
CREATE KEYSPACE ecommerce
WITH REPLICATION = {
    'class': 'NetworkTopologyStrategy',
    'dc1': 3,
    'dc2': 3
};
 
USE ecommerce;
 
-- Table: Orders by User (query: get user's orders by date)
CREATE TABLE orders_by_user (
    user_id         UUID,
    order_date      DATE,
    order_id        UUID,
    status          TEXT,
    total_amount    DECIMAL,
    items           LIST<FROZEN<item_type>>,
    
    -- Primary key explanation:
    -- (user_id) = Partition key - determines which node stores the data
    -- order_date DESC, order_id = Clustering columns - determines sort order within partition
    PRIMARY KEY ((user_id), order_date, order_id)
) WITH CLUSTERING ORDER BY (order_date DESC, order_id ASC);
 
-- This design enables:
-- ✅ Get all orders for a user (single partition read)
-- ✅ Get user's orders in a date range (partition read with range filter)
-- ✅ Get specific order by user + date + order_id (single row read)
 
-- This design does NOT efficiently support:
-- ❌ Get all orders with status = 'SHIPPED' (requires full table scan)
-- ❌ Get all orders on a specific date across all users (requires full table scan)
-- ❌ Get order by order_id only (order_id is not the partition key)
 
-- For the above queries, create additional tables:
CREATE TABLE orders_by_status (
    status          TEXT,
    order_date      DATE,
    order_id        UUID,
    user_id         UUID,
    total_amount    DECIMAL,
    PRIMARY KEY ((status), order_date, order_id)
) WITH CLUSTERING ORDER BY (order_date DESC);
 
CREATE TABLE orders_by_id (
    order_id        UUID PRIMARY KEY,
    user_id         UUID,
    order_date      DATE,
    status          TEXT,
    total_amount    DECIMAL,
    items           LIST<FROZEN<item_type>>
);

Partition Keys vs Clustering Columns

The distinction between partition keys and clustering columns is fundamental:

Partition Key (determines data placement):

Hash of partition key determines which nodes store the data
All rows with the same partition key are stored together (same partition)
Should be chosen to distribute data evenly (avoid hot partitions)
Queries MUST specify the partition key for efficiency

Clustering Columns (determines sort order):

Defines the sort order of rows within a partition
Enables range queries within a partition
Queries can filter on clustering columns (but must filter on partition key first)
Commonly used: timestamps (for time-series), hierarchical IDs

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
-- PATTERN 1: Simple Primary Key
-- Use when: Each row is independently accessed by a unique identifier
CREATE TABLE users (
    user_id UUID PRIMARY KEY,
    name TEXT,
    email TEXT
);
-- Efficient: SELECT * FROM users WHERE user_id = ?
-- Inefficient: SELECT * FROM users WHERE email = ? (full table scan)
 
-- PATTERN 2: Composite Partition Key
-- Use when: Data should be co-located based on multiple fields
CREATE TABLE sensor_readings_by_region_date (
    region TEXT,
    reading_date DATE,
    sensor_id UUID,
    reading_time TIMESTAMP,
    value DOUBLE,
    PRIMARY KEY ((region, reading_date), sensor_id, reading_time)
);
-- Partitions by (region + date) → all readings for US-EAST on 2024-01-15 in one partition
-- Efficient: SELECT * FROM sensor_readings WHERE region = ? AND reading_date = ?
-- Also efficient: Add AND sensor_id = ? to narrow further
 
-- PATTERN 3: Time-Bucketed Data
-- Use when: Time-series data would create unbounded partitions
CREATE TABLE logs_by_service (
    service_name TEXT,
    time_bucket TEXT,  -- e.g., "2024-01-15-14" (hour bucket)
    log_time TIMESTAMP,
    log_id UUID,
    level TEXT,
    message TEXT,
    PRIMARY KEY ((service_name, time_bucket), log_time, log_id)
) WITH CLUSTERING ORDER BY (log_time DESC);
-- Each partition = 1 hour of logs for 1 service
-- Prevents unbounded partition growth
-- Query recent logs: filter on current time_bucket(s)
 
-- PATTERN 4: Reverse Lookups via Materialized Views
CREATE MATERIALIZED VIEW users_by_email AS
    SELECT * FROM users
    WHERE email IS NOT NULL
    PRIMARY KEY (email, user_id);
-- Now efficient: SELECT * FROM users_by_email WHERE email = ?
-- Cassandra maintains the view automatically on writes

Partition Size Limits

Tunable Consistency: Trading Availability and Correctness

Consistency Levels Explained

When writing or reading, you specify how many replicas must acknowledge the operation:

Cassandra Consistency Levels
Consistency Level	Writes: Acknowledged By	Reads: Returned By	Use Case
ANY	1 node (can be hinted handoff)	N/A	Fire-and-forget writes
ONE	1 replica node	1 replica node	Low-latency, eventual consistency
TWO	2 replica nodes	2 replica nodes	Slightly stronger than ONE
THREE	3 replica nodes	3 replica nodes	Stronger guarantee
QUORUM	(RF / 2) + 1 nodes	(RF / 2) + 1 nodes	Strong consistency
LOCAL_QUORUM	Quorum in local DC	Quorum in local DC	Strong local, cross-DC async
EACH_QUORUM	Quorum in each DC	N/A	Strong multi-DC writes
ALL	All replica nodes	All replica nodes	Maximum consistency

The Consistency Formula

Strong consistency requires that reads always see the most recent write. This is achieved when:

R + W > N (Read consistency + Write consistency > Replication Factor)

Common configurations:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
Replication Factor N = 3
 
Configuration 1: QUORUM reads + QUORUM writes
├── W = 2, R = 2
├── R + W = 4 > 3 ✅ Strong consistency
├── Tolerates: 1 node failure for reads AND writes
└── Use case: Financial data, user authentication
 
Configuration 2: ONE reads + ALL writes
├── W = 3, R = 1
├── R + W = 4 > 3 ✅ Strong consistency
├── Tolerates: 2 node failures for reads, 0 for writes
└── Use case: Write-rarely, read-frequently data
 
Configuration 3: ALL reads + ONE writes
├── W = 1, R = 3
├── R + W = 4 > 3 ✅ Strong consistency
├── Tolerates: 0 node failures for reads, 2 for writes
└── Use case: Write-frequently, read-rarely (rare)
 
Configuration 4: ONE reads + ONE writes
├── W = 1, R = 1
├── R + W = 2 < 3 ❌ Eventual consistency
├── Tolerates: 2 node failures for reads AND writes
└── Use case: Analytics, logs, metrics where staleness is acceptable
 
Configuration 5: LOCAL_QUORUM reads + LOCAL_QUORUM writes
├── W = 2 (local DC), R = 2 (local DC)
├── Strong consistency within datacenter
├── Eventual consistency across datacenters
└── Use case: Multi-region with low latency requirements

Practical Consistency Patterns

Real applications often use different consistency levels for different operations:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
// E-commerce application consistency strategy
 
// User registration - must not lose data, strong write
await cassandra.execute(
    'INSERT INTO users (id, email, password_hash) VALUES (?, ?, ?)',
    [userId, email, hash],
    { consistency: ConsistencyLevel.QUORUM }
);
 
// User login check - must read latest password hash
const user = await cassandra.execute(
    'SELECT password_hash FROM users WHERE id = ?',
    [userId],
    { consistency: ConsistencyLevel.QUORUM }
);
 
// View product catalog - stale data is acceptable
const products = await cassandra.execute(
    'SELECT * FROM products WHERE category = ?',
    [category],
    { consistency: ConsistencyLevel.ONE }  // Fast, eventual consistency
);
 
// Add to cart - needs to be durable
await cassandra.execute(
    'INSERT INTO cart_items (user_id, product_id, quantity) VALUES (?, ?, ?)',
    [userId, productId, quantity],
    { consistency: ConsistencyLevel.LOCAL_QUORUM }
);
 
// Page view analytics - fire and forget
await cassandra.execute(
    'UPDATE page_views SET count = count + 1 WHERE page_id = ?',
    [pageId],
    { consistency: ConsistencyLevel.ANY }  // Even hinted handoff is fine
);

LOCAL_QUORUM for Multi-DC

Write and Read Paths: How Data Flows

The Write Path

When a client sends a write request:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
┌────────────────────────────────────────────────────────────────────────────┐
│                          CASSANDRA WRITE PATH                              │
├────────────────────────────────────────────────────────────────────────────┤
│                                                                            │
│  Client                                                                    │
│    │                                                                       │
│    │ (1) Write request                                                     │
│    ▼                                                                       │
│  ┌───────────────────┐                                                     │
│  │ Coordinator Node  │ ← Any node can coordinate (masterless)              │
│  └─────────┬─────────┘                                                     │
│            │                                                               │
│            │ (2) Determine replica nodes (partition key → token → nodes)   │
│            │                                                               │
│            ├─────────────┬─────────────┬─────────────┐                     │
│            ▼             ▼             ▼             ▼                     │
│       ┌────────┐    ┌────────┐    ┌────────┐    (if RF > 3)                │
│       │Replica1│    │Replica2│    │Replica3│                               │
│       └────┬───┘    └────┬───┘    └────┬───┘                               │
│            │             │             │                                   │
│  ┌─────────┴─────────────┴─────────────┴─────────┐                         │
│  │            On Each Replica Node:              │                         │
│  ├───────────────────────────────────────────────┤                         │
│  │                                               │                         │
│  │  (3) Append to Commit Log (sequential write)  │                         │
│  │      └─ Durability: data survives node crash  │                         │
│  │                                               │                         │
│  │  (4) Write to MemTable (in-memory)            │                         │
│  │      └─ Sorted structure for fast lookup      │                         │
│  │                                               │                         │
│  │  (5) Return success to coordinator            │                         │
│  │                                               │                         │
│  └───────────────────────────────────────────────┘                         │
│                                                                            │
│  (6) Coordinator waits for consistency level acknowledgments               │
│      QUORUM with RF=3: wait for 2 of 3 replicas                            │
│                                                                            │
│  (7) Return success to client                                              │
│                                                                            │
│  (Background) MemTable flushes to SSTable when full                        │
│  (Background) Compaction merges SSTables                                   │
│                                                                            │
└────────────────────────────────────────────────────────────────────────────┘

Why Writes Are Fast:

No seeks required: Commit log is append-only (sequential I/O)
Memory-first: MemTable write is purely in-memory
No read-before-write: Insert/update doesn't need existing value
Parallelism: Replicas write concurrently

The Read Path

Reading is more complex because data may exist in multiple locations:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
┌────────────────────────────────────────────────────────────────────────────┐
│                          CASSANDRA READ PATH                               │
├────────────────────────────────────────────────────────────────────────────┤
│                                                                            │
│  Client                                                                    │
│    │                                                                       │
│    │ (1) Read request                                                      │
│    ▼                                                                       │
│  ┌───────────────────┐                                                     │
│  │ Coordinator Node  │                                                     │
│  └─────────┬─────────┘                                                     │
│            │                                                               │
│            │ (2) Determine replica nodes + select fastest                  │
│            │     (Dynamic snitch tracks response times)                    │
│            │                                                               │
│            ├───────────────────────────────────────────┐                   │
│            │                                           │                   │
│            ▼                                           ▼                   │
│       ┌────────────┐                            ┌────────────┐             │
│       │Fastest     │ ← Full data request        │Other       │ ← Digest   │
│       │Replica     │                            │Replicas    │   request  │
│       └─────┬──────┘                            └─────┬──────┘             │
│             │                                         │                    │
│  ┌──────────┴──────────┐                              │                    │
│  │ On Fastest Replica: │                              │                    │
│  ├─────────────────────┤                              │                    │
│  │                     │                              │                    │
│  │ (3) Check Row Cache │ ← Hot rows cached           │                    │
│  │     ↓ (miss)        │                              │                    │
│  │ (4) Check MemTable  │ ← Recent writes in memory   │                    │
│  │     ↓ (miss)        │                              │                    │
│  │ (5) Check Bloom     │ ← Which SSTables might      │                    │
│  │     filters         │   contain this key?          │                    │
│  │     ↓               │                              │                    │
│  │ (6) Read SSTable    │ ← Block index → disk read   │                    │
│  │     index           │                              │                    │
│  │     ↓               │                              │                    │
│  │ (7) Merge results   │ ← Combine all sources,      │                    │
│  │                     │   return latest version      │                    │
│  └─────────────────────┘                              │                    │
│             │                                         │                    │
│             │ (8) Return full data                    │ (8) Return digest  │
│             │                                         │     (hash of data) │
│             ▼                                         ▼                    │
│  ┌───────────────────────────────────────────────────────────────────┐    │
│  │ Coordinator: Compare digest with hash of full data                │    │
│  │                                                                   │    │
│  │ If match: Return data to client ✓                                 │    │
│  │ If mismatch: Read Repair (fetch full data from all, merge,        │    │
│  │              write merged result back to out-of-date replicas)    │    │
│  └───────────────────────────────────────────────────────────────────┘    │
│                                                                            │
└────────────────────────────────────────────────────────────────────────────┘

Read Repair and Anti-Entropy

Summary: Cassandra's Architectural Mastery

We've explored Apache Cassandra's architecture at a practitioner's depth. Let's consolidate the key concepts that will inform your system design decisions:

Key Takeaways

•Cassandra is an AP system that trades consistency for availability and partition tolerance, with tunable consistency allowing per-query tradeoffs
•Masterless ring architecture means no single point of failure—every node can accept reads and writes, and any node can coordinate requests
•Consistent hashing with vnodes distributes data evenly across nodes, enables linear scalability, and simplifies cluster operations
•Gossip protocol maintains decentralized cluster coordination with automatic failure detection using phi accrual method
•Primary key design is critical—partition key determines data placement, clustering columns determine sort order within partitions
•Tunable consistency lets you choose R + W > N for strong consistency or lower levels for higher availability and performance
•Write path is optimized with append-only commit log and in-memory MemTable; read path uses bloom filters and caching to mitigate SSTable lookups

When to Choose Cassandra:

When to Avoid Cassandra:

What's Next:

Cassandra Mastery Achieved

2 / 5