System Design (HLD)Wide-Column Stores

Wide-Column Stores: Mastering Column-Family Databases

LevelAdvanced

Duration90 mins

TopicWide-Column Stores

4 / 5

Write-Optimized Workloads — Leveraging Wide-Column Strengths

The Write Throughput Revolution

When Netflix streams a video, dozens of data points are generated every second—playback position, buffering events, quality changes, user interactions. When a smart factory operates, thousands of sensors emit readings every millisecond. When a social network processes activity, billions of events flow through the system daily. These workloads share a common characteristic: they are write-dominant, generating far more writes than reads, often by orders of magnitude.

Traditional relational databases struggle with high write throughput because their B-tree storage engines optimize for reads at the expense of writes. Each write must find the correct position in a sorted structure, potentially triggering expensive page splits and random I/O. Wide-column stores use a fundamentally different approach—the Log-Structured Merge Tree (LSM-tree)—that inverts these trade-offs, achieving write throughput that traditional databases cannot match.

This page explores why wide-column stores are the natural choice for write-heavy workloads. We'll examine the LSM-tree architecture that makes this possible, understand the trade-offs involved, and learn patterns for designing systems that leverage high write throughput while maintaining acceptable read performance.

What You Will Master

By the end of this page, you will understand why LSM-trees excel at writes, how to quantify and manage write amplification, patterns for time-series and event streaming workloads, and how to tune wide-column stores for optimal write performance while maintaining read efficiency.

Why Writes Are Expensive in Traditional Databases

Before appreciating why LSM-trees revolutionize write performance, we must understand why traditional databases struggle. The root cause lies in B-tree storage engines and their inherent design trade-offs.

B-Tree Write Operations

B-trees maintain data in sorted order across balanced tree structures. When you insert a new row:

Navigate the tree: Traverse from root to appropriate leaf node (O(log n) comparisons)
Insert in sorted position: Find correct location within the leaf page
Handle overflow: If page is full, split it and propagate changes up the tree
Write to disk: Update the page in place (random I/O)
Update indexes: Repeat for each secondary index

The critical issue is random I/O. Each write requires updating a specific page on disk, which is dramatically slower than sequential writes:

I/O Performance: Sequential vs. Random
Storage Type	Sequential Write	Random Write	Ratio
HDD (7200 RPM)	150 MB/s	~1 MB/s	150x
SATA SSD	500 MB/s	~50 MB/s	10x
NVMe SSD	3,000 MB/s	~500 MB/s	6x
Cloud Block Storage (EBS gp3)	125 MB/s	~16,000 IOPS × 4KB = 64 MB/s	2x

Even on modern NVMe SSDs, sequential writes are significantly faster than random writes. On HDDs, the difference is catastrophic—150x slower for random access due to mechanical seek times.

The Write Amplification Problem

B-trees also suffer from write amplification—a single logical write results in multiple physical writes:

Write-ahead log (durability)
Actual page modification
Page split overhead
Secondary index updates
Checkpoint/fsync operations

A single row insert might trigger 5-10 physical writes, each requiring random I/O. Under high write load, the storage subsystem becomes saturated.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
┌────────────────────────────────────────────────────────────────────────────┐
│                    B-TREE WRITE OPERATION                                  │
├────────────────────────────────────────────────────────────────────────────┤
│                                                                            │
│  INSERT INTO users (id, name, email) VALUES (42, 'Alice', 'alice@ex.com')  │
│                                                                            │
│  Step 1: Append to WAL                    [Sequential Write ✓]             │
│          └─ Write to transaction log                                       │
│                                                                            │
│  Step 2: Navigate B-tree                  [Random Reads - O(log n)]        │
│          ┌────────────────┐                                                │
│          │   Root Page    │                                                │
│          │ [10, 30, 50]   │  Read page 1 (random I/O)                      │
│          └───────┬────────┘                                                │
│                  │ 42 > 30 && 42 < 50                                      │
│                  ▼                                                         │
│          ┌────────────────┐                                                │
│          │ Internal Page  │                                                │
│          │ [35, 40, 45]   │  Read page 47 (random I/O)                     │
│          └───────┬────────┘                                                │
│                  │ 42 > 40 && 42 < 45                                      │
│                  ▼                                                         │
│          ┌────────────────┐                                                │
│          │  Leaf Page     │                                                │
│          │ [40,41,_,_,_]  │  Read page 123 (random I/O)                    │
│          └────────────────┘                                                │
│                                                                            │
│  Step 3: Insert and Write Page            [Random Write - Worst Part!]     │
│          ┌────────────────┐                                                │
│          │  Leaf Page     │                                                │
│          │ [40,41,42,_,_] │  Write page 123 back to disk                   │
│          └────────────────┘                                                │
│                                                                            │
│  Step 4: Update indexes                   [More Random I/O per index]      │
│          └─ email_idx: Random seek + write                                 │
│          └─ name_idx: Random seek + write                                  │
│                                                                            │
│  Total: 1 sequential write + 3-5 random reads + 2-4 random writes          │
│  At scale: Random I/O becomes the bottleneck                               │
│                                                                            │
└────────────────────────────────────────────────────────────────────────────┘

B-Tree Sweet Spot

B-trees are not bad—they're optimized for read-heavy workloads with moderate writes. Their sorted structure enables efficient range scans, point lookups, and secondary indexes. The problem arises only when write throughput requirements exceed what random I/O can sustain.

LSM-Tree Architecture: Sequential Writes at Scale

The Log-Structured Merge Tree (LSM-tree) inverts B-tree trade-offs to optimize for write throughput. Instead of updating data in place, LSM-trees treat all writes as sequential appends, deferring the cost of sorting to background processes.

Core LSM-Tree Principles

Writes go to memory first: Incoming writes append to an in-memory sorted structure (MemTable)
Flush to immutable files: When MemTable is full, write it sequentially to disk as a sorted file (SSTable)
Background compaction: Merge multiple SSTables to maintain sort order and remove obsolete data
Never update in place: Data is never modified after writing; obsolete data is garbage collected during compaction

This design converts random writes into sequential writes:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
┌────────────────────────────────────────────────────────────────────────────┐
│                      LSM-TREE WRITE OPERATION                              │
├────────────────────────────────────────────────────────────────────────────┤
│                                                                            │
│  INSERT INTO events (id, timestamp, data) VALUES (...)                     │
│                                                                            │
│  Step 1: Append to Commit Log (WAL)       [Sequential Write ✓]             │
│          └─ Append-only log, no seeks                                      │
│          └─ Durability: data survives crash even if MemTable lost          │
│                                                                            │
│  Step 2: Insert into MemTable             [In-Memory - No I/O!]            │
│          ┌────────────────────────────────┐                                │
│          │        MemTable (64-256 MB)   │                                 │
│          │        ──────────────────     │                                 │
│          │   Skip List / Red-Black Tree  │                                 │
│          │                               │                                 │
│          │   [event_001] → data          │                                 │
│          │   [event_002] → data          │ ← Insert here (O(log n))        │
│          │   [event_003] → data          │                                 │
│          │   ...                         │                                 │
│          └────────────────────────────────┘                                │
│                                                                            │
│  Step 3: Return success to client         [Ack after WAL + MemTable]       │
│          └─ Latency: ~1-5ms (mostly WAL sync time)                         │
│                                                                            │
│  === BACKGROUND (Asynchronous) ===                                         │
│                                                                            │
│  Step 4: MemTable Flush                   [Sequential Write ✓]             │
│          When MemTable reaches threshold:                                  │
│          ┌────────────────┐                                                │
│          │    MemTable    │                                                │
│          └───────┬────────┘                                                │
│                  │ (freeze MemTable, new writes go to new MemTable)        │
│                  ▼                                                         │
│          ┌────────────────────────────────┐                                │
│          │    SSTable (Sorted String      │                                │
│          │         Table on disk)         │                                │
│          │    ─────────────────────       │                                │
│          │    [event_001 → data]          │                                │
│          │    [event_002 → data]          │ ← Written sequentially!        │
│          │    [event_003 → data]          │                                │
│          │    + Bloom Filter              │                                │
│          │    + Index Block               │                                │
│          └────────────────────────────────┘                                │
│                                                                            │
│  Step 5: Compaction (background)          [Sequential Read + Write]        │
│          Merge SSTables to reduce file count and remove obsolete data      │
│                                                                            │
│  Summary: Only sequential I/O in the write path!                           │
│                                                                            │
└────────────────────────────────────────────────────────────────────────────┘

Why Sequential Writes Win

The performance advantage of LSM-trees comes from replacing random I/O with sequential I/O:

WAL append: ~10,000-50,000 appends/second per disk
MemTable insert: Millions of operations/second (memory speed)
SSTable flush: Full sequential write bandwidth (hundreds of MB/s)
No page splits: Immutable files never fragment

Compared to B-trees:

Operation	B-Tree	LSM-Tree
Write to storage	Random I/O (slow)	Sequential I/O (fast)
In-place updates	Yes (fragmentation)	No (immutable files)
Index maintenance	Per insert (slow)	Batch during compaction
Write latency	~5-20ms	~1-5ms
Theoretical max writes/s	~10,000 (disk-bound)	~100,000+ (memory-bound)

The Write Throughput Multiplier

A well-tuned LSM-tree database can achieve 10-100x higher write throughput than a B-tree database on the same hardware. This is why Cassandra, HBase, RocksDB, and LevelDB all use LSM-trees—write optimization is their primary design goal.

Write Amplification and Compaction Trade-offs

LSM-trees don't eliminate write amplification—they shift it from the foreground (during writes) to the background (during compaction). Understanding this trade-off is essential for capacity planning and performance tuning.

Write Amplification in LSM-Trees

Write amplification (WA) is the ratio of bytes written to storage versus bytes written by the application:

Write Amplification = Total Bytes Written to Disk / Application Bytes Written

In LSM-trees, data is written multiple times:

WAL: 1x (written once, deleted after flush)
SSTable (L0): 1x (flushed from MemTable)
Compaction to L1: ~1x (merged with other L0 files)
Compaction to L2: ~1x (merged with L1 files)
... and so on for each level

For a typical LSM-tree with 10 levels, write amplification can reach 10-30x. A 1 KB write might eventually result in 10-30 KB of disk I/O.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
┌────────────────────────────────────────────────────────────────────────────┐
│                     LSM-TREE COMPACTION ARCHITECTURE                       │
├────────────────────────────────────────────────────────────────────────────┤
│                                                                            │
│  ┌───────────────────────────────────────────────────────────────────┐    │
│  │                        MEMORY                                     │    │
│  │                                                                   │    │
│  │   Active MemTable (64MB)      Immutable MemTable (flushing)       │    │
│  │   ┌─────────────────┐         ┌─────────────────┐                 │    │
│  │   │  ░░░░░░░░░░░░░  │         │  ████████████   │                 │    │
│  │   │  (accepting     │         │  (being written │                 │    │
│  │   │   writes)       │         │   to L0)        │                 │    │
│  │   └─────────────────┘         └─────────────────┘                 │    │
│  └───────────────────────────────────────────────────────────────────┘    │
│                          │                                                 │
│                          │ Flush (sequential write)                        │
│                          ▼                                                 │
│  ┌───────────────────────────────────────────────────────────────────┐    │
│  │                        LEVEL 0 (L0)                               │    │
│  │   SSTable files, possibly overlapping key ranges                  │    │
│  │   ┌──────┐ ┌──────┐ ┌──────┐ ┌──────┐                             │    │
│  │   │ 64MB │ │ 64MB │ │ 64MB │ │ 64MB │  (4 files, each from flush) │    │
│  │   └──────┘ └──────┘ └──────┘ └──────┘                             │    │
│  └───────────────────────────────────────────────────────────────────┘    │
│                          │                                                 │
│                          │ Minor Compaction                                │
│                          │ (Merge + Sort → Write Amplification +1)         │
│                          ▼                                                 │
│  ┌───────────────────────────────────────────────────────────────────┐    │
│  │                        LEVEL 1 (L1)                               │    │
│  │   Non-overlapping key ranges, size = L0_size × ratio (10x)        │    │
│  │   ┌────────────────────────────────────────────────────────────┐  │    │
│  │   │                        ~640 MB                              │  │    │
│  │   └────────────────────────────────────────────────────────────┘  │    │
│  └───────────────────────────────────────────────────────────────────┘    │
│                          │                                                 │
│                          │ Compaction (Write Amplification +1)             │
│                          ▼                                                 │
│  ┌───────────────────────────────────────────────────────────────────┐    │
│  │                        LEVEL 2 (L2)                               │    │
│  │   ┌────────────────────────────────────────────────────────────┐  │    │
│  │   │                        ~6.4 GB                              │  │    │
│  │   └────────────────────────────────────────────────────────────┘  │    │
│  └───────────────────────────────────────────────────────────────────┘    │
│                          │                                                 │
│                          ▼ ... continues ...                               │
│                                                                            │
│  ┌───────────────────────────────────────────────────────────────────┐    │
│  │                        LEVEL N (Largest)                          │    │
│  │   ┌────────────────────────────────────────────────────────────┐  │    │
│  │   │                       ~1 TB                                 │  │    │
│  │   └────────────────────────────────────────────────────────────┘  │    │
│  └───────────────────────────────────────────────────────────────────┘    │
│                                                                            │
│  Write Amplification Calculation:                                          │
│  ─────────────────────────────────                                         │
│  • Data written to MemTable: 1x                                            │
│  • Flushed to L0: 1x                                                       │
│  • Compacted L0 → L1: ~1x   (merged with L1 files)                         │
│  • Compacted L1 → L2: ~1x                                                  │
│  • ... for each level ...                                                  │
│                                                                            │
│  Total WA ≈ 1 + (levels × compaction_factor/size_ratio)                    │
│  Typical range: 10-30x                                                     │
│                                                                            │
└────────────────────────────────────────────────────────────────────────────┘

Compaction Strategies

Different compaction strategies trade off write amplification, read amplification, and space amplification:

Size-Tiered Compaction (STCS)

Merge similarly-sized SSTables
Lower write amplification (~5-10x)
Higher read amplification (more files to check)
Best for: Write-heavy workloads, time-series data

Leveled Compaction (LCS)

Maintain non-overlapping key ranges per level
Higher write amplification (~10-30x)
Lower read amplification (fewer files per key range)
Best for: Read-heavy workloads with updates

Time-Window Compaction (TWCS)

Cassandra-specific: Group data by time windows
Excellent for time-series with TTL-based deletion
Avoids rewriting old data that will expire
Best for: Logs, metrics, IoT data with expiration

SSD Considerations

Write amplification matters especially on SSDs, which have limited write endurance. A 10x write amplification means your 100 TB/day write workload is actually writing 1 PB/day to SSDs. Choose compaction strategies and tune compaction frequency based on your SSD durability budget.

Time-Series and Event Streaming Patterns

Wide-column stores are the natural choice for time-series and event streaming workloads. Their write optimization, flexible schema, and cell versioning align perfectly with these access patterns.

Time-Series Data Characteristics

Time-Series Workload Properties

•Write-heavy: Data flows in constantly; writes dominate reads by 10-100x
•Append-mostly: New data is written, existing data rarely updated
•Time-ordered access: Queries typically request recent data or specific time ranges
•Natural expiration: Old data loses value; TTL-based cleanup is common
•High cardinality: Thousands of sources (sensors, users, hosts) generating data

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
// Time-series data modeling for IoT sensor data
 
// PATTERN 1: Time-bucketed partitions
// Prevents unbounded partition growth
// Enables efficient time-range queries
 
/*
CREATE TABLE sensor_data (
    sensor_id UUID,
    time_bucket TEXT,           -- e.g., "2024-01-15" (daily bucket)
    reading_time TIMESTAMP,
    temperature DOUBLE,
    humidity DOUBLE,
    pressure DOUBLE,
    PRIMARY KEY ((sensor_id, time_bucket), reading_time)
) WITH CLUSTERING ORDER BY (reading_time DESC)
AND compaction = {'class': 'TimeWindowCompactionStrategy', 
                  'compaction_window_size': 1, 
                  'compaction_window_unit': 'DAYS'};
*/
 
// Why this works:
// 1. Partition key (sensor_id, time_bucket) bounds partition size
// 2. Clustering on reading_time DESC → recent data first
// 3. TWCS compaction → old time windows compact together, expire together
 
// Writing sensor data
async function writeSensorReading(
    sensorId: string, 
    reading: SensorReading
): Promise<void> {
    const timeBucket = formatDate(reading.timestamp, 'YYYY-MM-DD');
    
    await cassandra.execute(
        `INSERT INTO sensor_data 
         (sensor_id, time_bucket, reading_time, temperature, humidity, pressure)
         VALUES (?, ?, ?, ?, ?, ?)
         USING TTL 2592000`,  // 30 days TTL
        [sensorId, timeBucket, reading.timestamp, 
         reading.temperature, reading.humidity, reading.pressure],
        { consistency: ConsistencyLevel.ONE }  // High throughput, eventual consistency
    );
}
 
// Querying recent data (single partition read)
async function getRecentReadings(
    sensorId: string, 
    limit: number = 100
): Promise<SensorReading[]> {
    const today = formatDate(new Date(), 'YYYY-MM-DD');
    
    const result = await cassandra.execute(
        `SELECT * FROM sensor_data 
         WHERE sensor_id = ? AND time_bucket = ?
         LIMIT ?`,
        [sensorId, today, limit],
        { consistency: ConsistencyLevel.ONE }
    );
    
    return result.rows;
}
 
// Querying time range (may span multiple partitions)
async function getReadingsInRange(
    sensorId: string,
    startTime: Date,
    endTime: Date
): Promise<SensorReading[]> {
    const buckets = generateDateBuckets(startTime, endTime);
    
    // Query each partition and aggregate
    const allReadings = await Promise.all(
        buckets.map(bucket => 
            cassandra.execute(
                `SELECT * FROM sensor_data 
                 WHERE sensor_id = ? AND time_bucket = ?
                 AND reading_time >= ? AND reading_time <= ?`,
                [sensorId, bucket, startTime, endTime]
            )
        )
    );
    
    return allReadings.flatMap(r => r.rows);
}

Event Streaming Patterns

Event streaming workloads share similar characteristics but emphasize ordered processing and at-least-once delivery:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
// Event sourcing with wide-column store as event log
 
/*
CREATE TABLE events (
    aggregate_type TEXT,
    aggregate_id UUID,
    event_sequence BIGINT,
    event_time TIMESTAMP,
    event_type TEXT,
    payload BLOB,
    metadata MAP<TEXT, TEXT>,
    PRIMARY KEY ((aggregate_type, aggregate_id), event_sequence)
) WITH CLUSTERING ORDER BY (event_sequence ASC);
 
CREATE TABLE event_log (
    time_bucket TEXT,           -- e.g., "2024-01-15-14" (hourly)
    event_time TIMESTAMP,
    event_id UUID,
    aggregate_type TEXT,
    aggregate_id UUID,
    event_type TEXT,
    payload BLOB,
    PRIMARY KEY ((time_bucket), event_time, event_id)
) WITH CLUSTERING ORDER BY (event_time ASC);
*/
 
// Event store implementation
interface DomainEvent {
    eventId: string;
    aggregateType: string;
    aggregateId: string;
    eventType: string;
    payload: object;
    timestamp: Date;
    sequence: number;
}
 
class CassandraEventStore {
    // Append event atomically
    async appendEvent(event: DomainEvent): Promise<void> {
        const batch = new BatchStatement();
        
        // Write to aggregate-partitioned table (for reconstruction)
        batch.add(this.insertEventQuery, [
            event.aggregateType,
            event.aggregateId,
            event.sequence,
            event.timestamp,
            event.eventType,
            Buffer.from(JSON.stringify(event.payload)),
            {}
        ]);
        
        // Write to time-partitioned table (for streaming/replay)
        const timeBucket = formatDate(event.timestamp, 'YYYY-MM-DD-HH');
        batch.add(this.insertLogQuery, [
            timeBucket,
            event.timestamp,
            event.eventId,
            event.aggregateType,
            event.aggregateId,
            event.eventType,
            Buffer.from(JSON.stringify(event.payload))
        ]);
        
        // Logged batch for atomicity across tables
        await this.cassandra.execute(batch, { consistency: ConsistencyLevel.QUORUM });
    }
    
    // Reconstruct aggregate from events
    async getEventsForAggregate(
        aggregateType: string, 
        aggregateId: string,
        fromSequence: number = 0
    ): Promise<DomainEvent[]> {
        const result = await this.cassandra.execute(
            `SELECT * FROM events 
             WHERE aggregate_type = ? AND aggregate_id = ?
             AND event_sequence >= ?`,
            [aggregateType, aggregateId, fromSequence],
            { consistency: ConsistencyLevel.LOCAL_QUORUM }
        );
        
        return result.rows.map(row => this.mapRowToEvent(row));
    }
    
    // Stream events for processing (CDC-like pattern)
    async getEventsSince(
        startTime: Date,
        batchSize: number = 1000
    ): AsyncGenerator<DomainEvent[]> {
        let currentBucket = formatDate(startTime, 'YYYY-MM-DD-HH');
        const endBucket = formatDate(new Date(), 'YYYY-MM-DD-HH');
        
        while (currentBucket <= endBucket) {
            const result = await this.cassandra.execute(
                `SELECT * FROM event_log 
                 WHERE time_bucket = ?
                 ORDER BY event_time ASC`,
                [currentBucket]
            );
            
            yield result.rows.map(row => this.mapRowToEvent(row));
            currentBucket = incrementHour(currentBucket);
        }
    }
}

TTL for Automatic Cleanup

Wide-column stores excel at TTL-based data expiration. With Time-Window Compaction, expired data is cleaned up efficiently—entire SSTables containing only expired data are simply deleted. This is far more efficient than individual row deletes in B-tree databases.

Write Performance Tuning

Achieving maximum write throughput requires understanding the tuning knobs available in wide-column stores. While defaults are reasonable, high-throughput workloads benefit from deliberate configuration.

MemTable Configuration

The MemTable is the first staging area for writes. Larger MemTables:

Reduce flush frequency → lower write amplification
Increase memory usage → potential OOM risk
Increase recovery time → more WAL to replay after crash

MemTable Sizing Guidelines
Workload Type	MemTable Size	Reasoning
Default/Mixed	64-128 MB	Balance between memory and flush frequency
Write-Heavy	256-512 MB	Reduce flush frequency, batch more writes
Memory-Constrained	32-64 MB	Sacrifice throughput for memory stability
High-Cardinality Columns	128-256 MB	Larger MemTable to batch column families

Commit Log / WAL Optimization

The commit log is often the bottleneck for write latency. Configuration options:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
# cassandra.yaml
 
# Commit log sync modes:
# - periodic: Sync every commitlog_sync_period_in_ms (default: 10000ms)
# - batch: Sync after each write, wait up to commitlog_sync_batch_window_in_ms for batching
# - group: Sync after each write, with group commit for concurrent writers
 
commitlog_sync: periodic
commitlog_sync_period_in_ms: 10000  # Sync every 10 seconds
 
# For lower latency at cost of potential data loss on crash:
# commitlog_sync: periodic
# commitlog_sync_period_in_ms: 1000  # Sync every 1 second (higher risk)
 
# For guaranteed durability (lowest throughput):
# commitlog_sync: batch
# commitlog_sync_batch_window_in_ms: 2  # Batch writes for 2ms then sync
 
# Commit log size (should handle burst writes)
commitlog_total_space_in_mb: 8192  # 8GB total commit log space
 
# Compression (reduces I/O at cost of CPU)
commitlog_compression:
  - class_name: LZ4Compressor

Concurrent Writes Configuration

Cassandra manages concurrency through thread pools. Key settings:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
# cassandra.yaml
 
# Native transport (CQL) settings
native_transport_max_threads: 128  # Threads handling client requests
 
# Concurrent writes
concurrent_writes: 32  # Threads for local writes (MemTable + commit log)
                       # Rule of thumb: 8 × number of cores
 
# Concurrent compactors
concurrent_compactors: 4  # Threads for background compaction
                          # Rule of thumb: min(cores, max(2, cores / 2))
 
# Batch size limits (prevent huge batches from causing issues)
batch_size_warn_threshold_in_kb: 5     # Log warning
batch_size_fail_threshold_in_kb: 50    # Reject batch

Write Throughput Optimization Checklist

•Use async writes: Don't wait for each write to complete; use async APIs and batching
•Token-aware routing: Ensure client library routes writes directly to owning node
•Appropriate consistency: Use ONE or LOCAL_ONE for maximum throughput when acceptable
•Batch by partition: Group writes to the same partition in UNLOGGED BATCH for efficiency
•Disable indexes during bulk load: Create secondary indexes after loading data
•Tune compaction: Use STCS or TWCS for write-heavy workloads
•Separate commit log disk: Use dedicated SSD for commit log to avoid I/O contention
•Monitor pending compactions: High pending compaction indicates write rate > compaction rate

Maintaining Read Performance

Write-optimized doesn't mean read-neglected. Wide-column stores provide mechanisms to maintain acceptable read performance even with high write throughput.

Read Amplification in LSM-Trees

The trade-off for fast writes is read amplification—potentially checking multiple locations for each read:

MemTable: Always checked (in-memory, fast)
Immutable MemTable: If flushing is in progress
Multiple SSTables: Each level might contain data for the key

Without mitigation, a read might check 10-20 files. Mitigations include:

Read Amplification Mitigations

•Bloom filters: Probabilistically skip SSTables that don't contain the key (~99.9% accuracy)
•Block cache: Cache frequently accessed data blocks in memory
•Key cache: Cache partition index entries for hot partitions (Cassandra)
•Row cache: Cache entire rows for extremely hot data (use sparingly)
•Compaction: Merge SSTables to reduce file count—lower compaction debt = faster reads
•Bucket Cache: Off-heap or SSD-based cache for larger working sets (HBase)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
# cassandra.yaml - Read Performance Settings
 
# Block cache (OS page cache is primary, this is JVM cache)
file_cache_size_in_mb: 512  # JVM file buffer cache
 
# Key cache (partition index entries)
key_cache_size_in_mb: 100
key_cache_save_period: 14400  # Save every 4 hours
 
# Row cache (use only for hot partition workloads)
row_cache_size_in_mb: 0  # Disabled by default, enable carefully
row_cache_save_period: 0
 
# Read-ahead for sequential scans
disk_optimization_strategy: ssd  # or 'spinning' for HDD
 
# Bloom filter settings (per-table)
# CREATE TABLE ... WITH bloom_filter_fp_chance = 0.01;  # 1% false positive
# Lower = more memory usage, fewer false reads
# Default: 0.01 (1%)
 
# sstable_preemptive_open_interval_in_mb: 50  # Preload SSTables for faster reads

Balancing Write and Read Performance

The art of wide-column store operation is balancing write throughput with read latency:

Configuration	Write Impact	Read Impact
Larger MemTable	Higher throughput	Slower first read (not yet in SSTable)
More aggressive compaction	Lower throughput (I/O used)	Faster reads (fewer files)
Higher bloom filter accuracy	More memory	Fewer false disk reads
Larger block cache	Memory used	Faster repeated reads
STCS compaction	Lower write amp	Higher read amp
LCS compaction	Higher write amp	Lower read amp

Monitor Your Metrics

Track pending compactions, SSTable count per table, read latency percentiles (p99, p999), and cache hit ratios. If pending compactions grow unboundedly, your write rate exceeds compaction capacity—either add nodes or reduce write rate.

Summary: Mastering Write-Optimized Workloads

We've explored why wide-column stores excel at write-heavy workloads and how to leverage this strength effectively. Let's consolidate the key insights:

Key Takeaways

•B-trees are read-optimized at the expense of writes; random I/O limits write throughput to ~10,000 ops/second per disk
•LSM-trees trade read amplification for write performance by converting random writes to sequential appends via MemTable + SSTable architecture
•Write amplification is shifted to background compaction; typical LSM-tree WA is 10-30x, manageable with proper compaction strategy
•Compaction strategy matters: STCS for write-heavy, LCS for read-heavy, TWCS for time-series with TTL
•Time-series workloads are ideal: Time-bucketed partitions, TTL expiration, and sequential access align with LSM-tree strengths
•Tune for your workload: MemTable size, commit log sync mode, thread pools, and caching all affect throughput
•Read performance requires attention: Bloom filters, caching, and compaction management prevent read degradation

What's Next:

We've covered the column-family model, explored Cassandra and HBase implementations, and understood write-optimized workloads. The final page of this module brings it all together with a comprehensive guide to use cases and trade-offs—helping you decide when wide-column stores are the right choice and when to look elsewhere.

Write Optimization Mastery Achieved

You now understand why wide-column stores achieve 10-100x higher write throughput than traditional databases, how LSM-trees enable this, and the trade-offs involved. This knowledge enables you to design systems that leverage high write throughput while maintaining acceptable read performance.

4 / 5

Loading learning content...

System Design (HLD)Wide-Column Stores

Wide-Column Stores: Mastering Column-Family Databases

LevelAdvanced

Duration90 mins

TopicWide-Column Stores

4 / 5

Write-Optimized Workloads — Leveraging Wide-Column Strengths

The Write Throughput Revolution

What You Will Master

Why Writes Are Expensive in Traditional Databases

B-Tree Write Operations

B-trees maintain data in sorted order across balanced tree structures. When you insert a new row:

Navigate the tree: Traverse from root to appropriate leaf node (O(log n) comparisons)
Insert in sorted position: Find correct location within the leaf page
Handle overflow: If page is full, split it and propagate changes up the tree
Write to disk: Update the page in place (random I/O)
Update indexes: Repeat for each secondary index

The critical issue is random I/O. Each write requires updating a specific page on disk, which is dramatically slower than sequential writes:

I/O Performance: Sequential vs. Random
Storage Type	Sequential Write	Random Write	Ratio
HDD (7200 RPM)	150 MB/s	~1 MB/s	150x
SATA SSD	500 MB/s	~50 MB/s	10x
NVMe SSD	3,000 MB/s	~500 MB/s	6x
Cloud Block Storage (EBS gp3)	125 MB/s	~16,000 IOPS × 4KB = 64 MB/s	2x

Even on modern NVMe SSDs, sequential writes are significantly faster than random writes. On HDDs, the difference is catastrophic—150x slower for random access due to mechanical seek times.

The Write Amplification Problem

B-trees also suffer from write amplification—a single logical write results in multiple physical writes:

Write-ahead log (durability)
Actual page modification
Page split overhead
Secondary index updates
Checkpoint/fsync operations

A single row insert might trigger 5-10 physical writes, each requiring random I/O. Under high write load, the storage subsystem becomes saturated.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
┌────────────────────────────────────────────────────────────────────────────┐
│                    B-TREE WRITE OPERATION                                  │
├────────────────────────────────────────────────────────────────────────────┤
│                                                                            │
│  INSERT INTO users (id, name, email) VALUES (42, 'Alice', 'alice@ex.com')  │
│                                                                            │
│  Step 1: Append to WAL                    [Sequential Write ✓]             │
│          └─ Write to transaction log                                       │
│                                                                            │
│  Step 2: Navigate B-tree                  [Random Reads - O(log n)]        │
│          ┌────────────────┐                                                │
│          │   Root Page    │                                                │
│          │ [10, 30, 50]   │  Read page 1 (random I/O)                      │
│          └───────┬────────┘                                                │
│                  │ 42 > 30 && 42 < 50                                      │
│                  ▼                                                         │
│          ┌────────────────┐                                                │
│          │ Internal Page  │                                                │
│          │ [35, 40, 45]   │  Read page 47 (random I/O)                     │
│          └───────┬────────┘                                                │
│                  │ 42 > 40 && 42 < 45                                      │
│                  ▼                                                         │
│          ┌────────────────┐                                                │
│          │  Leaf Page     │                                                │
│          │ [40,41,_,_,_]  │  Read page 123 (random I/O)                    │
│          └────────────────┘                                                │
│                                                                            │
│  Step 3: Insert and Write Page            [Random Write - Worst Part!]     │
│          ┌────────────────┐                                                │
│          │  Leaf Page     │                                                │
│          │ [40,41,42,_,_] │  Write page 123 back to disk                   │
│          └────────────────┘                                                │
│                                                                            │
│  Step 4: Update indexes                   [More Random I/O per index]      │
│          └─ email_idx: Random seek + write                                 │
│          └─ name_idx: Random seek + write                                  │
│                                                                            │
│  Total: 1 sequential write + 3-5 random reads + 2-4 random writes          │
│  At scale: Random I/O becomes the bottleneck                               │
│                                                                            │
└────────────────────────────────────────────────────────────────────────────┘

B-Tree Sweet Spot

LSM-Tree Architecture: Sequential Writes at Scale

Core LSM-Tree Principles

Writes go to memory first: Incoming writes append to an in-memory sorted structure (MemTable)
Flush to immutable files: When MemTable is full, write it sequentially to disk as a sorted file (SSTable)
Background compaction: Merge multiple SSTables to maintain sort order and remove obsolete data
Never update in place: Data is never modified after writing; obsolete data is garbage collected during compaction

This design converts random writes into sequential writes:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
┌────────────────────────────────────────────────────────────────────────────┐
│                      LSM-TREE WRITE OPERATION                              │
├────────────────────────────────────────────────────────────────────────────┤
│                                                                            │
│  INSERT INTO events (id, timestamp, data) VALUES (...)                     │
│                                                                            │
│  Step 1: Append to Commit Log (WAL)       [Sequential Write ✓]             │
│          └─ Append-only log, no seeks                                      │
│          └─ Durability: data survives crash even if MemTable lost          │
│                                                                            │
│  Step 2: Insert into MemTable             [In-Memory - No I/O!]            │
│          ┌────────────────────────────────┐                                │
│          │        MemTable (64-256 MB)   │                                 │
│          │        ──────────────────     │                                 │
│          │   Skip List / Red-Black Tree  │                                 │
│          │                               │                                 │
│          │   [event_001] → data          │                                 │
│          │   [event_002] → data          │ ← Insert here (O(log n))        │
│          │   [event_003] → data          │                                 │
│          │   ...                         │                                 │
│          └────────────────────────────────┘                                │
│                                                                            │
│  Step 3: Return success to client         [Ack after WAL + MemTable]       │
│          └─ Latency: ~1-5ms (mostly WAL sync time)                         │
│                                                                            │
│  === BACKGROUND (Asynchronous) ===                                         │
│                                                                            │
│  Step 4: MemTable Flush                   [Sequential Write ✓]             │
│          When MemTable reaches threshold:                                  │
│          ┌────────────────┐                                                │
│          │    MemTable    │                                                │
│          └───────┬────────┘                                                │
│                  │ (freeze MemTable, new writes go to new MemTable)        │
│                  ▼                                                         │
│          ┌────────────────────────────────┐                                │
│          │    SSTable (Sorted String      │                                │
│          │         Table on disk)         │                                │
│          │    ─────────────────────       │                                │
│          │    [event_001 → data]          │                                │
│          │    [event_002 → data]          │ ← Written sequentially!        │
│          │    [event_003 → data]          │                                │
│          │    + Bloom Filter              │                                │
│          │    + Index Block               │                                │
│          └────────────────────────────────┘                                │
│                                                                            │
│  Step 5: Compaction (background)          [Sequential Read + Write]        │
│          Merge SSTables to reduce file count and remove obsolete data      │
│                                                                            │
│  Summary: Only sequential I/O in the write path!                           │
│                                                                            │
└────────────────────────────────────────────────────────────────────────────┘

Why Sequential Writes Win

The performance advantage of LSM-trees comes from replacing random I/O with sequential I/O:

WAL append: ~10,000-50,000 appends/second per disk
MemTable insert: Millions of operations/second (memory speed)
SSTable flush: Full sequential write bandwidth (hundreds of MB/s)
No page splits: Immutable files never fragment

Compared to B-trees:

Operation	B-Tree	LSM-Tree
Write to storage	Random I/O (slow)	Sequential I/O (fast)
In-place updates	Yes (fragmentation)	No (immutable files)
Index maintenance	Per insert (slow)	Batch during compaction
Write latency	~5-20ms	~1-5ms
Theoretical max writes/s	~10,000 (disk-bound)	~100,000+ (memory-bound)

The Write Throughput Multiplier

Write Amplification and Compaction Trade-offs

Write Amplification in LSM-Trees

Write amplification (WA) is the ratio of bytes written to storage versus bytes written by the application:

Write Amplification = Total Bytes Written to Disk / Application Bytes Written

In LSM-trees, data is written multiple times:

WAL: 1x (written once, deleted after flush)
SSTable (L0): 1x (flushed from MemTable)
Compaction to L1: ~1x (merged with other L0 files)
Compaction to L2: ~1x (merged with L1 files)
... and so on for each level

For a typical LSM-tree with 10 levels, write amplification can reach 10-30x. A 1 KB write might eventually result in 10-30 KB of disk I/O.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
┌────────────────────────────────────────────────────────────────────────────┐
│                     LSM-TREE COMPACTION ARCHITECTURE                       │
├────────────────────────────────────────────────────────────────────────────┤
│                                                                            │
│  ┌───────────────────────────────────────────────────────────────────┐    │
│  │                        MEMORY                                     │    │
│  │                                                                   │    │
│  │   Active MemTable (64MB)      Immutable MemTable (flushing)       │    │
│  │   ┌─────────────────┐         ┌─────────────────┐                 │    │
│  │   │  ░░░░░░░░░░░░░  │         │  ████████████   │                 │    │
│  │   │  (accepting     │         │  (being written │                 │    │
│  │   │   writes)       │         │   to L0)        │                 │    │
│  │   └─────────────────┘         └─────────────────┘                 │    │
│  └───────────────────────────────────────────────────────────────────┘    │
│                          │                                                 │
│                          │ Flush (sequential write)                        │
│                          ▼                                                 │
│  ┌───────────────────────────────────────────────────────────────────┐    │
│  │                        LEVEL 0 (L0)                               │    │
│  │   SSTable files, possibly overlapping key ranges                  │    │
│  │   ┌──────┐ ┌──────┐ ┌──────┐ ┌──────┐                             │    │
│  │   │ 64MB │ │ 64MB │ │ 64MB │ │ 64MB │  (4 files, each from flush) │    │
│  │   └──────┘ └──────┘ └──────┘ └──────┘                             │    │
│  └───────────────────────────────────────────────────────────────────┘    │
│                          │                                                 │
│                          │ Minor Compaction                                │
│                          │ (Merge + Sort → Write Amplification +1)         │
│                          ▼                                                 │
│  ┌───────────────────────────────────────────────────────────────────┐    │
│  │                        LEVEL 1 (L1)                               │    │
│  │   Non-overlapping key ranges, size = L0_size × ratio (10x)        │    │
│  │   ┌────────────────────────────────────────────────────────────┐  │    │
│  │   │                        ~640 MB                              │  │    │
│  │   └────────────────────────────────────────────────────────────┘  │    │
│  └───────────────────────────────────────────────────────────────────┘    │
│                          │                                                 │
│                          │ Compaction (Write Amplification +1)             │
│                          ▼                                                 │
│  ┌───────────────────────────────────────────────────────────────────┐    │
│  │                        LEVEL 2 (L2)                               │    │
│  │   ┌────────────────────────────────────────────────────────────┐  │    │
│  │   │                        ~6.4 GB                              │  │    │
│  │   └────────────────────────────────────────────────────────────┘  │    │
│  └───────────────────────────────────────────────────────────────────┘    │
│                          │                                                 │
│                          ▼ ... continues ...                               │
│                                                                            │
│  ┌───────────────────────────────────────────────────────────────────┐    │
│  │                        LEVEL N (Largest)                          │    │
│  │   ┌────────────────────────────────────────────────────────────┐  │    │
│  │   │                       ~1 TB                                 │  │    │
│  │   └────────────────────────────────────────────────────────────┘  │    │
│  └───────────────────────────────────────────────────────────────────┘    │
│                                                                            │
│  Write Amplification Calculation:                                          │
│  ─────────────────────────────────                                         │
│  • Data written to MemTable: 1x                                            │
│  • Flushed to L0: 1x                                                       │
│  • Compacted L0 → L1: ~1x   (merged with L1 files)                         │
│  • Compacted L1 → L2: ~1x                                                  │
│  • ... for each level ...                                                  │
│                                                                            │
│  Total WA ≈ 1 + (levels × compaction_factor/size_ratio)                    │
│  Typical range: 10-30x                                                     │
│                                                                            │
└────────────────────────────────────────────────────────────────────────────┘

Compaction Strategies

Different compaction strategies trade off write amplification, read amplification, and space amplification:

Size-Tiered Compaction (STCS)

Merge similarly-sized SSTables
Lower write amplification (~5-10x)
Higher read amplification (more files to check)
Best for: Write-heavy workloads, time-series data

Leveled Compaction (LCS)

Maintain non-overlapping key ranges per level
Higher write amplification (~10-30x)
Lower read amplification (fewer files per key range)
Best for: Read-heavy workloads with updates

Time-Window Compaction (TWCS)

Cassandra-specific: Group data by time windows
Excellent for time-series with TTL-based deletion
Avoids rewriting old data that will expire
Best for: Logs, metrics, IoT data with expiration

SSD Considerations

Time-Series and Event Streaming Patterns

Wide-column stores are the natural choice for time-series and event streaming workloads. Their write optimization, flexible schema, and cell versioning align perfectly with these access patterns.

Time-Series Data Characteristics

Time-Series Workload Properties

•Write-heavy: Data flows in constantly; writes dominate reads by 10-100x
•Append-mostly: New data is written, existing data rarely updated
•Time-ordered access: Queries typically request recent data or specific time ranges
•Natural expiration: Old data loses value; TTL-based cleanup is common
•High cardinality: Thousands of sources (sensors, users, hosts) generating data

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
// Time-series data modeling for IoT sensor data
 
// PATTERN 1: Time-bucketed partitions
// Prevents unbounded partition growth
// Enables efficient time-range queries
 
/*
CREATE TABLE sensor_data (
    sensor_id UUID,
    time_bucket TEXT,           -- e.g., "2024-01-15" (daily bucket)
    reading_time TIMESTAMP,
    temperature DOUBLE,
    humidity DOUBLE,
    pressure DOUBLE,
    PRIMARY KEY ((sensor_id, time_bucket), reading_time)
) WITH CLUSTERING ORDER BY (reading_time DESC)
AND compaction = {'class': 'TimeWindowCompactionStrategy', 
                  'compaction_window_size': 1, 
                  'compaction_window_unit': 'DAYS'};
*/
 
// Why this works:
// 1. Partition key (sensor_id, time_bucket) bounds partition size
// 2. Clustering on reading_time DESC → recent data first
// 3. TWCS compaction → old time windows compact together, expire together
 
// Writing sensor data
async function writeSensorReading(
    sensorId: string, 
    reading: SensorReading
): Promise<void> {
    const timeBucket = formatDate(reading.timestamp, 'YYYY-MM-DD');
    
    await cassandra.execute(
        `INSERT INTO sensor_data 
         (sensor_id, time_bucket, reading_time, temperature, humidity, pressure)
         VALUES (?, ?, ?, ?, ?, ?)
         USING TTL 2592000`,  // 30 days TTL
        [sensorId, timeBucket, reading.timestamp, 
         reading.temperature, reading.humidity, reading.pressure],
        { consistency: ConsistencyLevel.ONE }  // High throughput, eventual consistency
    );
}
 
// Querying recent data (single partition read)
async function getRecentReadings(
    sensorId: string, 
    limit: number = 100
): Promise<SensorReading[]> {
    const today = formatDate(new Date(), 'YYYY-MM-DD');
    
    const result = await cassandra.execute(
        `SELECT * FROM sensor_data 
         WHERE sensor_id = ? AND time_bucket = ?
         LIMIT ?`,
        [sensorId, today, limit],
        { consistency: ConsistencyLevel.ONE }
    );
    
    return result.rows;
}
 
// Querying time range (may span multiple partitions)
async function getReadingsInRange(
    sensorId: string,
    startTime: Date,
    endTime: Date
): Promise<SensorReading[]> {
    const buckets = generateDateBuckets(startTime, endTime);
    
    // Query each partition and aggregate
    const allReadings = await Promise.all(
        buckets.map(bucket => 
            cassandra.execute(
                `SELECT * FROM sensor_data 
                 WHERE sensor_id = ? AND time_bucket = ?
                 AND reading_time >= ? AND reading_time <= ?`,
                [sensorId, bucket, startTime, endTime]
            )
        )
    );
    
    return allReadings.flatMap(r => r.rows);
}

Event Streaming Patterns

Event streaming workloads share similar characteristics but emphasize ordered processing and at-least-once delivery:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
// Event sourcing with wide-column store as event log
 
/*
CREATE TABLE events (
    aggregate_type TEXT,
    aggregate_id UUID,
    event_sequence BIGINT,
    event_time TIMESTAMP,
    event_type TEXT,
    payload BLOB,
    metadata MAP<TEXT, TEXT>,
    PRIMARY KEY ((aggregate_type, aggregate_id), event_sequence)
) WITH CLUSTERING ORDER BY (event_sequence ASC);
 
CREATE TABLE event_log (
    time_bucket TEXT,           -- e.g., "2024-01-15-14" (hourly)
    event_time TIMESTAMP,
    event_id UUID,
    aggregate_type TEXT,
    aggregate_id UUID,
    event_type TEXT,
    payload BLOB,
    PRIMARY KEY ((time_bucket), event_time, event_id)
) WITH CLUSTERING ORDER BY (event_time ASC);
*/
 
// Event store implementation
interface DomainEvent {
    eventId: string;
    aggregateType: string;
    aggregateId: string;
    eventType: string;
    payload: object;
    timestamp: Date;
    sequence: number;
}
 
class CassandraEventStore {
    // Append event atomically
    async appendEvent(event: DomainEvent): Promise<void> {
        const batch = new BatchStatement();
        
        // Write to aggregate-partitioned table (for reconstruction)
        batch.add(this.insertEventQuery, [
            event.aggregateType,
            event.aggregateId,
            event.sequence,
            event.timestamp,
            event.eventType,
            Buffer.from(JSON.stringify(event.payload)),
            {}
        ]);
        
        // Write to time-partitioned table (for streaming/replay)
        const timeBucket = formatDate(event.timestamp, 'YYYY-MM-DD-HH');
        batch.add(this.insertLogQuery, [
            timeBucket,
            event.timestamp,
            event.eventId,
            event.aggregateType,
            event.aggregateId,
            event.eventType,
            Buffer.from(JSON.stringify(event.payload))
        ]);
        
        // Logged batch for atomicity across tables
        await this.cassandra.execute(batch, { consistency: ConsistencyLevel.QUORUM });
    }
    
    // Reconstruct aggregate from events
    async getEventsForAggregate(
        aggregateType: string, 
        aggregateId: string,
        fromSequence: number = 0
    ): Promise<DomainEvent[]> {
        const result = await this.cassandra.execute(
            `SELECT * FROM events 
             WHERE aggregate_type = ? AND aggregate_id = ?
             AND event_sequence >= ?`,
            [aggregateType, aggregateId, fromSequence],
            { consistency: ConsistencyLevel.LOCAL_QUORUM }
        );
        
        return result.rows.map(row => this.mapRowToEvent(row));
    }
    
    // Stream events for processing (CDC-like pattern)
    async getEventsSince(
        startTime: Date,
        batchSize: number = 1000
    ): AsyncGenerator<DomainEvent[]> {
        let currentBucket = formatDate(startTime, 'YYYY-MM-DD-HH');
        const endBucket = formatDate(new Date(), 'YYYY-MM-DD-HH');
        
        while (currentBucket <= endBucket) {
            const result = await this.cassandra.execute(
                `SELECT * FROM event_log 
                 WHERE time_bucket = ?
                 ORDER BY event_time ASC`,
                [currentBucket]
            );
            
            yield result.rows.map(row => this.mapRowToEvent(row));
            currentBucket = incrementHour(currentBucket);
        }
    }
}

TTL for Automatic Cleanup

Write Performance Tuning

MemTable Configuration

The MemTable is the first staging area for writes. Larger MemTables:

Reduce flush frequency → lower write amplification
Increase memory usage → potential OOM risk
Increase recovery time → more WAL to replay after crash

MemTable Sizing Guidelines
Workload Type	MemTable Size	Reasoning
Default/Mixed	64-128 MB	Balance between memory and flush frequency
Write-Heavy	256-512 MB	Reduce flush frequency, batch more writes
Memory-Constrained	32-64 MB	Sacrifice throughput for memory stability
High-Cardinality Columns	128-256 MB	Larger MemTable to batch column families

Commit Log / WAL Optimization

The commit log is often the bottleneck for write latency. Configuration options:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
# cassandra.yaml
 
# Commit log sync modes:
# - periodic: Sync every commitlog_sync_period_in_ms (default: 10000ms)
# - batch: Sync after each write, wait up to commitlog_sync_batch_window_in_ms for batching
# - group: Sync after each write, with group commit for concurrent writers
 
commitlog_sync: periodic
commitlog_sync_period_in_ms: 10000  # Sync every 10 seconds
 
# For lower latency at cost of potential data loss on crash:
# commitlog_sync: periodic
# commitlog_sync_period_in_ms: 1000  # Sync every 1 second (higher risk)
 
# For guaranteed durability (lowest throughput):
# commitlog_sync: batch
# commitlog_sync_batch_window_in_ms: 2  # Batch writes for 2ms then sync
 
# Commit log size (should handle burst writes)
commitlog_total_space_in_mb: 8192  # 8GB total commit log space
 
# Compression (reduces I/O at cost of CPU)
commitlog_compression:
  - class_name: LZ4Compressor

Concurrent Writes Configuration

Cassandra manages concurrency through thread pools. Key settings:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
# cassandra.yaml
 
# Native transport (CQL) settings
native_transport_max_threads: 128  # Threads handling client requests
 
# Concurrent writes
concurrent_writes: 32  # Threads for local writes (MemTable + commit log)
                       # Rule of thumb: 8 × number of cores
 
# Concurrent compactors
concurrent_compactors: 4  # Threads for background compaction
                          # Rule of thumb: min(cores, max(2, cores / 2))
 
# Batch size limits (prevent huge batches from causing issues)
batch_size_warn_threshold_in_kb: 5     # Log warning
batch_size_fail_threshold_in_kb: 50    # Reject batch

Write Throughput Optimization Checklist

•Use async writes: Don't wait for each write to complete; use async APIs and batching
•Token-aware routing: Ensure client library routes writes directly to owning node
•Appropriate consistency: Use ONE or LOCAL_ONE for maximum throughput when acceptable
•Batch by partition: Group writes to the same partition in UNLOGGED BATCH for efficiency
•Disable indexes during bulk load: Create secondary indexes after loading data
•Tune compaction: Use STCS or TWCS for write-heavy workloads
•Separate commit log disk: Use dedicated SSD for commit log to avoid I/O contention
•Monitor pending compactions: High pending compaction indicates write rate > compaction rate

Maintaining Read Performance

Write-optimized doesn't mean read-neglected. Wide-column stores provide mechanisms to maintain acceptable read performance even with high write throughput.

Read Amplification in LSM-Trees

The trade-off for fast writes is read amplification—potentially checking multiple locations for each read:

MemTable: Always checked (in-memory, fast)
Immutable MemTable: If flushing is in progress
Multiple SSTables: Each level might contain data for the key

Without mitigation, a read might check 10-20 files. Mitigations include:

Read Amplification Mitigations

•Bloom filters: Probabilistically skip SSTables that don't contain the key (~99.9% accuracy)
•Block cache: Cache frequently accessed data blocks in memory
•Key cache: Cache partition index entries for hot partitions (Cassandra)
•Row cache: Cache entire rows for extremely hot data (use sparingly)
•Compaction: Merge SSTables to reduce file count—lower compaction debt = faster reads
•Bucket Cache: Off-heap or SSD-based cache for larger working sets (HBase)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
# cassandra.yaml - Read Performance Settings
 
# Block cache (OS page cache is primary, this is JVM cache)
file_cache_size_in_mb: 512  # JVM file buffer cache
 
# Key cache (partition index entries)
key_cache_size_in_mb: 100
key_cache_save_period: 14400  # Save every 4 hours
 
# Row cache (use only for hot partition workloads)
row_cache_size_in_mb: 0  # Disabled by default, enable carefully
row_cache_save_period: 0
 
# Read-ahead for sequential scans
disk_optimization_strategy: ssd  # or 'spinning' for HDD
 
# Bloom filter settings (per-table)
# CREATE TABLE ... WITH bloom_filter_fp_chance = 0.01;  # 1% false positive
# Lower = more memory usage, fewer false reads
# Default: 0.01 (1%)
 
# sstable_preemptive_open_interval_in_mb: 50  # Preload SSTables for faster reads

Balancing Write and Read Performance

The art of wide-column store operation is balancing write throughput with read latency:

Configuration	Write Impact	Read Impact
Larger MemTable	Higher throughput	Slower first read (not yet in SSTable)
More aggressive compaction	Lower throughput (I/O used)	Faster reads (fewer files)
Higher bloom filter accuracy	More memory	Fewer false disk reads
Larger block cache	Memory used	Faster repeated reads
STCS compaction	Lower write amp	Higher read amp
LCS compaction	Higher write amp	Lower read amp

Monitor Your Metrics

Summary: Mastering Write-Optimized Workloads

We've explored why wide-column stores excel at write-heavy workloads and how to leverage this strength effectively. Let's consolidate the key insights:

Key Takeaways

•B-trees are read-optimized at the expense of writes; random I/O limits write throughput to ~10,000 ops/second per disk
•LSM-trees trade read amplification for write performance by converting random writes to sequential appends via MemTable + SSTable architecture
•Write amplification is shifted to background compaction; typical LSM-tree WA is 10-30x, manageable with proper compaction strategy
•Compaction strategy matters: STCS for write-heavy, LCS for read-heavy, TWCS for time-series with TTL
•Time-series workloads are ideal: Time-bucketed partitions, TTL expiration, and sequential access align with LSM-tree strengths
•Tune for your workload: MemTable size, commit log sync mode, thread pools, and caching all affect throughput
•Read performance requires attention: Bloom filters, caching, and compaction management prevent read degradation

What's Next:

Write Optimization Mastery Achieved

4 / 5