System Design (HLD)Apache Cassandra

Apache Cassandra: Masterclass in Distributed NoSQL

LevelAdvanced

Duration120 mins

TopicApache Cassandra

5 / 6

Write-Optimized Performance

The Write Path: Speed by Design

Cassandra was born from a simple premise: writes are often more important than reads. While traditional databases optimize for reads (B-trees, in-place updates), many modern workloads—time-series data, event logging, user activity tracking, IoT sensors—generate writes at rates that overwhelm read-optimized architectures.

Apache Cassandra's write path is engineered from the ground up for extreme write throughput. A well-tuned Cassandra cluster can sustain hundreds of thousands of writes per second per node, with sub-millisecond latencies. This performance comes from a fundamentally different storage architecture: the Log-Structured Merge Tree (LSM Tree).

What You Will Learn

By the end of this page, you will understand: (1) The complete write path from client to disk, (2) Why LSM trees outperform B-trees for write-intensive workloads, (3) The role of commit logs, memtables, and SSTables, (4) How compaction works and the different compaction strategies, (5) The trade-offs of write optimization, and (6) How to tune Cassandra for maximum write performance.

The Complete Write Path

When a write arrives at a Cassandra node, it follows a carefully designed path that prioritizes durability and speed:

Write Path Steps:

write_path.txt
Cassandra Write Path
=====================
 
Client sends write request
           │
           ▼
┌─────────────────────────────┐
│    Coordinator Node         │  ← Receives request, identifies replicas
└─────────────────────────────┘
           │
           ▼
┌─────────────────────────────┐
│ Forward to Replica Nodes    │  ← Parallel write to all replicas
└─────────────────────────────┘
           │
           ▼ (On each replica)
┌─────────────────────────────┐
│ 1. Write to Commit Log      │  ← Sequential append (fast!)
│    (append-only, durable)   │     Survives crashes
└─────────────────────────────┘
           │
           ▼
┌─────────────────────────────┐
│ 2. Write to Memtable        │  ← In-memory sorted structure
│    (in-memory, sorted)      │     Extremely fast
└─────────────────────────────┘
           │
           ▼
┌─────────────────────────────┐
│ 3. Return ACK to Client     │  ← Write complete!
│    (after CL satisfied)     │     (Disk flush is async)
└─────────────────────────────┘
 
Background (async):
┌─────────────────────────────┐
│ 4. Memtable → SSTable       │  ← When memtable fills
│    (flush to disk)          │     Sequential write
└─────────────────────────────┘
           │
           ▼
┌─────────────────────────────┐
│ 5. Compaction               │  ← Background merge of SSTables
│    (merge, remove tombstones)│     Reclaims space
└─────────────────────────────┘

Why This Is Fast:

No Random I/O: The commit log is append-only. No seeking, just sequential writes—the fastest operation on any storage device.
Memory-First: The memtable is an in-memory data structure. After the commit log write (for durability), the client can return. The actual disk write happens later.
No Read-Before-Write: Unlike B-trees that must read existing data to update in place, Cassandra just writes new data. Conflict resolution (last-write-wins) happens at read time.
Parallel Replication: Writes to replicas happen in parallel, not sequentially. Total latency is the slowest replica, not the sum.
Decoupled Durability and Performance: The commit log ensures durability; memtable flushing is a background operation that doesn't block clients.

Write Latency Breakdown

Typical write latency: ~0.5-2ms. Commit log append: ~0.1-0.5ms (sequential write). Memtable insert: ~0.01-0.1ms (in-memory). Network overhead: ~0.2-0.5ms. The commit log is the dominant factor; SSDs dramatically improve this.

LSM Trees vs. B-Trees: Why Writes Win

Traditional databases (PostgreSQL, MySQL) use B-trees for storage. B-trees are excellent for reads but suffer for write-intensive workloads. Cassandra uses Log-Structured Merge Trees (LSM trees), which flip this trade-off.

B-Tree Write Path:

btree_write.txt
B-Tree Write (Traditional Databases)
======================================
 
1. Find the page containing the key (multiple disk seeks)
2. Read the page into memory
3. Modify the page in memory
4. Write the page back to disk (random I/O)
5. Update index pages if necessary (more random I/O)
6. Write to WAL for durability (sequential, but still needed)
 
Total: 2-10+ disk I/O operations per write
       Random I/O dominates
       Write amplification from page splits
 
Performance: 1,000 - 10,000 writes/sec per node (typical)

LSM Tree Write Path:

lsm_write.txt
LSM Tree Write (Cassandra)
===========================
 
1. Append to commit log (1 sequential disk write)
2. Insert into memtable (1 in-memory operation)
3. Done! (Return to client)
 
Background (not on critical path):
4. Flush memtable to SSTable when full (1 large sequential write)
5. Compaction merges SSTables (sequential reads and writes)
 
Total: 1 disk I/O operation per write
       100% sequential I/O
       No read-before-write
 
Performance: 100,000 - 500,000+ writes/sec per node (typical)

B-Tree vs. LSM Tree: Write Performance
Characteristic	B-Tree	LSM Tree
Write I/O pattern	Random	Sequential
Disk seeks per write	2-10+	0-1
Write amplification	High (page splits)	Lower (compaction)
Write throughput	~10K/sec	~500K/sec
Read complexity	O(log n)	O(log n) + SSTable count
Space amplification	Low (in-place)	Higher (temporary during compaction)
Update performance	Fast (in-place)	Fast (append; old version stays)

The Trade-off

LSM trees trade read performance for write performance. Reads may need to check multiple SSTables plus the memtable. Compaction mitigates this by merging SSTables, but it consumes CPU and I/O. Cassandra provides configurable compaction strategies to tune this trade-off.

The Commit Log: Durability Guarantee

The commit log is Cassandra's durability mechanism. Every write is first appended to the commit log before any other processing. This ensures that if the node crashes, writes can be replayed from the commit log on recovery.

Commit Log Properties:

Append-only: No seeks, just sequential writes at the end
Shared across all tables: One commit log per node, not per table
Segment-based: Rotates to new segment files periodically
Deleted after flush: Once all writes in a segment are flushed to SSTables, the segment is deleted

Commit Log Sync Modes:

Commit Log Sync Configurations
Mode	Behavior	Durability	Performance
periodic (default)	Sync every commitlog_sync_period_in_ms (10s default)	May lose up to sync period of writes on crash	Highest throughput
batch	Block write until sync; sync when batch fills or timeout	Minimal data loss window	Adds ~2ms latency per write
group	Similar to batch but uses group commit	Minimal data loss window	Better than batch for high throughput

commit_log_config.yaml
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
# Commit log configuration in cassandra.yaml
 
# Sync mode: periodic, batch, or group
commitlog_sync: periodic
 
# For periodic mode: how often to sync (default 10s)
commitlog_sync_period_in_ms: 10000
 
# For batch mode: max time to wait before syncing
# commitlog_sync_batch_window_in_ms: 2
 
# Total space for commit log segments
commitlog_total_space_in_mb: 8192
 
# Each segment size (32MB typical)
commitlog_segment_size_in_mb: 32
 
# Compression for commit log (LZ4 recommended)
commitlog_compression:
  - class_name: LZ4Compressor

Recovery Process:

When a Cassandra node restarts:

Scan commit log segments looking for unflushed writes
Replay writes into memtables for each table
Once replay completes, memtables contain all pending data
Normal operation resumes
Old commit log segments are deleted after successful flush

Best Practices:

Place commit log on a separate disk from data directories (especially for HDDs)
Use SSDs for commit log if possible (reduces sync latency)
Monitor commit log disk usage; full disk blocks all writes
Use compression to reduce I/O (LZ4 is fast with good compression)

Commit Log Full = Blocked Writes

If the commit log disk fills up (because memtables aren't flushing fast enough), writes block until space is available. Monitor commitlog_pending_tasks and commitlog_size metrics. If these grow, you may need faster storage, more memory for memtables, or compaction tuning.

Memtables: The In-Memory Write Buffer

A memtable is Cassandra's in-memory write buffer for each table. When writes arrive (after commit log), they go into the appropriate table's memtable. Memtables are sorted data structures (skip lists in modern Cassandra) that enable fast writes and efficient reads.

Memtable Lifecycle:

Active Memtable: Receives all new writes for its table
Flushing: When triggers are met, memtable becomes immutable and flushes to disk
SSTable: Flush produces an SSTable on disk
Cleanup: Old memtable released; commit log segments can be deleted

Flush Triggers:

When Memtables Flush

•Size threshold: Memtable reaches configured size (memtable_flush_threshold_in_mb)
•Commit log space: Commit log segments need recycling
•Memory pressure: Total memtable size exceeds memtable_heap_space_in_mb or memtable_offheap_space_in_mb
•Manual flush: Operator runs nodetool flush
•Table truncate/drop: Clears memtable as part of operation

memtable_config.yaml
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
# Memtable configuration in cassandra.yaml
 
# Memory allocation for memtables (fraction of heap)
# Default: 0.25 (25% of heap)
memtable_heap_space_in_mb: 2048
 
# Off-heap memtable space (for native memory)  
memtable_offheap_space_in_mb: 2048
 
# Flush threshold for individual memtable
# (Not commonly tuned; global settings usually preferred)
 
# Memtable implementation
# Options: skiplist (default), skiplist_sharded
memtable: 
  class_name: org.apache.cassandra.db.memtable.SkipListMemtable

Memtable Performance Characteristics:

Writes: O(log n) insertion into skip list—very fast
Reads: O(log n) lookup + merge with SSTable reads
Memory: Bounded by configuration; GC pressure if too large on-heap
Flush: Sequential write to disk; produces one SSTable

Off-Heap Memtables:

Cassandra can store memtables off-heap (in native memory outside JVM heap). Benefits:

Reduces GC pressure
Allows larger memtables without GC pauses
Faster flushes (no GC during flush)

Trade-off: Off-heap is slightly slower for writes and requires careful native memory management.

Tuning Memtable Size

Larger memtables reduce flush frequency but increase memory pressure and recovery time. Smaller memtables flush more often, creating more SSTables. Balance based on write rate and available memory. For write-heavy workloads, larger memtables with off-heap storage often help.

SSTables: Immutable On-Disk Storage

When a memtable flushes, it produces a Sorted String Table (SSTable)—an immutable, on-disk file containing a sorted sequence of key-value pairs.

SSTable Structure:

Each SSTable is actually multiple files:

SSTable Component Files
File Extension	Purpose	Description
-Data.db	Data file	Actual row data, sorted by partition and clustering key
-Index.db	Primary index	Partition key → position in Data.db
-Summary.db	Summary index	Sampled entries from Index.db for fast lookup
-Filter.db	Bloom filter	Probabilistic filter to avoid unnecessary reads
-CompressionInfo.db	Compression metadata	Chunk boundaries for compressed data
-Statistics.db	Statistics	Min/max values, tombstone counts, etc.
-TOC.txt	Table of contents	Lists component files for this SSTable

sstable_read.txt
SSTable Read Path
==================
 
Query: SELECT * FROM users WHERE user_id = 'alice-uuid'
 
1. Check Bloom Filter (-Filter.db)
   └─ If filter says "definitely not here" → skip this SSTable
   └─ If filter says "might be here" → continue
 
2. Check Summary Index (-Summary.db)
   └─ Find range that might contain 'alice-uuid'
   └─ Returns offset range in Index.db
 
3. Check Primary Index (-Index.db)
   └─ Binary search within range for 'alice-uuid'
   └─ Returns offset in Data.db
 
4. Read Data (-Data.db)
   └─ Seek to offset, read partition data
   └─ Decompress if needed
   └─ Return to caller
 
Bloom filter can skip entire SSTables!
This is why Bloom filter tuning matters for read performance.

Why Immutability Matters:

No Locking: Since SSTables never change, concurrent reads need no locks
Easy Distribution: Immutable files can be safely copied, streamed, or backed up
Crash Consistent: A complete SSTable is always valid; partially written SSTables are discarded
Cache Friendly: OS page cache can effectively cache SSTable data

SSTable Compaction:

The trade-off of immutability is that updates and deletes don't modify existing data—they write new versions. This means:

Multiple SSTables might contain data for the same partition
Deleted data still occupies space (as tombstones)
Reads must check multiple SSTables

Compaction is the background process that reclaims this space by merging SSTables.

Bloom Filter Effectiveness

Bloom filters are crucial for read performance. They let Cassandra skip SSTables that definitely don't contain the requested data. The bloom_filter_fp_chance setting (default 0.01 = 1% false positive) trades memory for accuracy. Lower values use more memory but skip more unnecessary disk reads.

Compaction Strategies: Managing SSTable Growth

Compaction merges multiple SSTables into fewer, larger SSTables. It removes tombstones (deleted data), merges cells with different timestamps, and reduces the number of SSTables that reads must check.

Cassandra provides several compaction strategies, each optimized for different workloads:

Size-Tiered Compaction Strategy (STCS)

How it works:

Groups SSTables of similar size into "buckets"
When a bucket reaches threshold (default 4), merges into one larger SSTable
Creates progressively larger SSTables over time

Best for:

Write-heavy workloads with few updates/deletes
Time-series data that's rarely modified
Workloads where writes vastly outnumber reads

Trade-offs:

✅ Best write amplification (each byte written ~4x total)
✅ Predictable resource usage
❌ High space amplification (2x during compaction)
❌ Can create very large SSTables
❌ Slower reads (many SSTables to check)

Configuration:

ALTER TABLE mytable WITH compaction = {
    'class': 'SizeTieredCompactionStrategy',
    'min_threshold': 4,
    'max_threshold': 32
};

Compaction Strategy Comparison
Strategy	Write Amplification	Space Amplification	Read Performance	Best Use Case
STCS	Low (~4x)	High (2x)	Variable	Write-heavy, append-only
LCS	High (10-30x)	Low (~10%)	Excellent	Read-heavy, many updates
TWCS	Very Low	Depends on TTL	Good	Time-series with TTL
UCS	Medium	Medium	Good	Mixed/unknown workloads

Changing Compaction Strategy

Changing compaction strategy on a live table triggers recompaction of all data—a resource-intensive operation. Test in non-production first. During recompaction, expect higher I/O and potential latency spikes. Plan for 2x disk space during the transition.

Write Amplification and the LSM Trade-off

Write amplification measures how many bytes are written to storage for each byte of data received from clients. In LSM trees, write amplification comes primarily from compaction.

Sources of Write Amplification:

Initial Write: Commit log + memtable flush = ~2x
Compaction: Each compaction rewrites data = cumulative effect
Total: STCS ~4x, LCS ~10-30x, depending on workload

Why Write Amplification Matters:

Write Amplification Impact

•SSD Wear: SSDs have limited write cycles. Higher amplification = faster wear. With 10x amplification, a 1TB/day workload writes 10TB/day to disk.
•I/O Bandwidth: Compaction consumes I/O bandwidth that could serve client requests. High amplification = more background I/O.
•CPU Usage: Compaction uses CPU for merging, compression, and checksumming. High amplification = higher steady-state CPU.
•Network (in some cases): If data is recompressed or checksummed differently, it may need re-replication.

The Fundamental Trade-off:

Lower write amplification (STCS): Faster writes, more SSTables, slower reads
Higher write amplification (LCS): Slower writes, fewer SSTables, faster reads

Space Amplification:

Another dimension of the trade-off:

STCS: During compaction, both source and destination SSTables exist. May need 2x data size in disk space.
LCS: Level structure limits concurrent compaction size. Typically ~10% overhead.
TWCS: Depends on time window size and retention. Predictable with proper TTL.

Choosing the Right Strategy:

Consider:

Read/write ratio of your workload
Update frequency (in-place updates favor LCS)
Delete frequency and TTL requirements
Available I/O bandwidth vs. SSD longevity concerns
Latency sensitivity (LCS has more predictable read latencies)

Monitoring Compaction

Monitor 'nodetool compactionstats' and the pending_compactions metric. High pending compaction counts mean compaction can't keep up with writes. This leads to many SSTables, even slower reads, and eventually disk space issues. Tune concurrency with concurrent_compactors and compaction_throughput_mb_per_sec.

Tuning Cassandra for Maximum Write Performance

For write-intensive workloads, several tuning knobs can dramatically improve performance:

Write Performance Tuning

•Use SSDs — The single biggest impact. SSDs excel at commit log writes and random reads. NVMe SSDs provide the best performance.
•Separate commit log and data directories — On HDDs, this prevents head contention. On SSDs, still beneficial for I/O accounting and failure isolation.
•Tune commit log sync — For maximum throughput, use 'periodic' sync. For durability, use 'batch' or 'group' and accept the latency.
•Increase memtable size — Larger memtables reduce flush frequency. Use off-heap to avoid GC pressure.
•Enable compression — Reduces I/O at the cost of CPU. LZ4 is fast with good compression; Zstd for better ratio.
•Use CL=ONE or LOCAL_ONE — Lower consistency levels return faster. Accept the consistency trade-off.
•Batch wisely — Unlogged batches for single-partition inserts can help. Avoid large multi-partition batches.
•Tune compaction throughput — Higher compaction_throughput_mb_per_sec prevents SSTable buildup but consumes more I/O.

high_write_config.yaml
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
# cassandra.yaml optimizations for write-heavy workloads
 
# Faster commit log sync (with some durability trade-off)
commitlog_sync: periodic
commitlog_sync_period_in_ms: 10000
 
# Increase memtable space (adjust based on heap size)
memtable_heap_space_in_mb: 4096
memtable_offheap_space_in_mb: 4096
 
# More compaction throughput
compaction_throughput_mb_per_sec: 128
 
# More concurrent compactors (if CPU allows)
concurrent_compactors: 4
 
# Use STCS for write-heavy tables
# ALTER TABLE mytable WITH compaction = {
#     'class': 'SizeTieredCompactionStrategy'
# };
 
# Compression (LZ4 for speed)
# Applied at table level with CREATE TABLE ... WITH compression = {...}

Client-Side Optimizations:

Connection pooling: Reuse connections; don't reconnect per request
Async writes: Use async API and let the driver batch requests
Token awareness: Driver routes to correct node, reducing coordinator overhead
Prepared statements: Reduce parsing overhead
Appropriate driver settings: Tune request timeout, speculative retry based on needs

Benchmark Your Workload

Optimal settings depend on your specific hardware, data model, and access patterns. Use tools like cassandra-stress or YCSB to benchmark different configurations. What works for one workload may not work for another.

Summary and Next Steps

We've explored Cassandra's write-optimized architecture in depth. Let's consolidate the key concepts:

Key Takeaways

•LSM trees optimize for writes — Sequential writes and no read-before-write enable 100x better write throughput than B-trees.
•Commit log provides durability — Append-only log ensures writes survive crashes before memtable flush.
•Memtables buffer writes in memory — Fast insertion with eventual flush to immutable SSTables.
•SSTables are sorted and immutable — Enables efficient reads, easy replication, and crash safety.
•Compaction strategies trade off writes vs. reads — STCS for writes, LCS for reads, TWCS for time-series with TTL.
•Write amplification is the cost of LSM trees — Compaction rewrites data; strategy choice affects total I/O.
•Tuning matters — SSDs, separate disks, memtable sizing, and compaction configuration all impact performance.
•Client optimizations complement server tuning — Async writes, token awareness, and prepared statements maximize throughput.

What's Next:

With Cassandra's architecture now understood—from masterless design and gossip to tunable consistency, the wide-column model, and write-optimized storage—the final page brings it all together: When to Use Cassandra. We'll explore the decision framework for choosing Cassandra, common use cases, and situations where other databases might be more appropriate.

Page Complete

You now understand Cassandra's write-optimized architecture—the LSM tree foundation that enables industry-leading write performance. Next, we'll synthesize everything into practical guidance on when to choose Cassandra for your system design.

5 / 6

Loading learning content...

System Design (HLD)Apache Cassandra

Apache Cassandra: Masterclass in Distributed NoSQL

LevelAdvanced

Duration120 mins

TopicApache Cassandra

5 / 6

Write-Optimized Performance

The Write Path: Speed by Design

What You Will Learn

The Complete Write Path

When a write arrives at a Cassandra node, it follows a carefully designed path that prioritizes durability and speed:

Write Path Steps:

write_path.txt
Cassandra Write Path
=====================
 
Client sends write request
           │
           ▼
┌─────────────────────────────┐
│    Coordinator Node         │  ← Receives request, identifies replicas
└─────────────────────────────┘
           │
           ▼
┌─────────────────────────────┐
│ Forward to Replica Nodes    │  ← Parallel write to all replicas
└─────────────────────────────┘
           │
           ▼ (On each replica)
┌─────────────────────────────┐
│ 1. Write to Commit Log      │  ← Sequential append (fast!)
│    (append-only, durable)   │     Survives crashes
└─────────────────────────────┘
           │
           ▼
┌─────────────────────────────┐
│ 2. Write to Memtable        │  ← In-memory sorted structure
│    (in-memory, sorted)      │     Extremely fast
└─────────────────────────────┘
           │
           ▼
┌─────────────────────────────┐
│ 3. Return ACK to Client     │  ← Write complete!
│    (after CL satisfied)     │     (Disk flush is async)
└─────────────────────────────┘
 
Background (async):
┌─────────────────────────────┐
│ 4. Memtable → SSTable       │  ← When memtable fills
│    (flush to disk)          │     Sequential write
└─────────────────────────────┘
           │
           ▼
┌─────────────────────────────┐
│ 5. Compaction               │  ← Background merge of SSTables
│    (merge, remove tombstones)│     Reclaims space
└─────────────────────────────┘

Why This Is Fast:

No Random I/O: The commit log is append-only. No seeking, just sequential writes—the fastest operation on any storage device.
Memory-First: The memtable is an in-memory data structure. After the commit log write (for durability), the client can return. The actual disk write happens later.
No Read-Before-Write: Unlike B-trees that must read existing data to update in place, Cassandra just writes new data. Conflict resolution (last-write-wins) happens at read time.
Parallel Replication: Writes to replicas happen in parallel, not sequentially. Total latency is the slowest replica, not the sum.
Decoupled Durability and Performance: The commit log ensures durability; memtable flushing is a background operation that doesn't block clients.

Write Latency Breakdown

LSM Trees vs. B-Trees: Why Writes Win

B-Tree Write Path:

btree_write.txt
B-Tree Write (Traditional Databases)
======================================
 
1. Find the page containing the key (multiple disk seeks)
2. Read the page into memory
3. Modify the page in memory
4. Write the page back to disk (random I/O)
5. Update index pages if necessary (more random I/O)
6. Write to WAL for durability (sequential, but still needed)
 
Total: 2-10+ disk I/O operations per write
       Random I/O dominates
       Write amplification from page splits
 
Performance: 1,000 - 10,000 writes/sec per node (typical)

LSM Tree Write Path:

lsm_write.txt
LSM Tree Write (Cassandra)
===========================
 
1. Append to commit log (1 sequential disk write)
2. Insert into memtable (1 in-memory operation)
3. Done! (Return to client)
 
Background (not on critical path):
4. Flush memtable to SSTable when full (1 large sequential write)
5. Compaction merges SSTables (sequential reads and writes)
 
Total: 1 disk I/O operation per write
       100% sequential I/O
       No read-before-write
 
Performance: 100,000 - 500,000+ writes/sec per node (typical)

B-Tree vs. LSM Tree: Write Performance
Characteristic	B-Tree	LSM Tree
Write I/O pattern	Random	Sequential
Disk seeks per write	2-10+	0-1
Write amplification	High (page splits)	Lower (compaction)
Write throughput	~10K/sec	~500K/sec
Read complexity	O(log n)	O(log n) + SSTable count
Space amplification	Low (in-place)	Higher (temporary during compaction)
Update performance	Fast (in-place)	Fast (append; old version stays)

The Trade-off

The Commit Log: Durability Guarantee

Commit Log Properties:

Append-only: No seeks, just sequential writes at the end
Shared across all tables: One commit log per node, not per table
Segment-based: Rotates to new segment files periodically
Deleted after flush: Once all writes in a segment are flushed to SSTables, the segment is deleted

Commit Log Sync Modes:

Commit Log Sync Configurations
Mode	Behavior	Durability	Performance
periodic (default)	Sync every commitlog_sync_period_in_ms (10s default)	May lose up to sync period of writes on crash	Highest throughput
batch	Block write until sync; sync when batch fills or timeout	Minimal data loss window	Adds ~2ms latency per write
group	Similar to batch but uses group commit	Minimal data loss window	Better than batch for high throughput

commit_log_config.yaml
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
# Commit log configuration in cassandra.yaml
 
# Sync mode: periodic, batch, or group
commitlog_sync: periodic
 
# For periodic mode: how often to sync (default 10s)
commitlog_sync_period_in_ms: 10000
 
# For batch mode: max time to wait before syncing
# commitlog_sync_batch_window_in_ms: 2
 
# Total space for commit log segments
commitlog_total_space_in_mb: 8192
 
# Each segment size (32MB typical)
commitlog_segment_size_in_mb: 32
 
# Compression for commit log (LZ4 recommended)
commitlog_compression:
  - class_name: LZ4Compressor

Recovery Process:

When a Cassandra node restarts:

Scan commit log segments looking for unflushed writes
Replay writes into memtables for each table
Once replay completes, memtables contain all pending data
Normal operation resumes
Old commit log segments are deleted after successful flush

Best Practices:

Place commit log on a separate disk from data directories (especially for HDDs)
Use SSDs for commit log if possible (reduces sync latency)
Monitor commit log disk usage; full disk blocks all writes
Use compression to reduce I/O (LZ4 is fast with good compression)

Commit Log Full = Blocked Writes

Memtables: The In-Memory Write Buffer

Memtable Lifecycle:

Active Memtable: Receives all new writes for its table
Flushing: When triggers are met, memtable becomes immutable and flushes to disk
SSTable: Flush produces an SSTable on disk
Cleanup: Old memtable released; commit log segments can be deleted

Flush Triggers:

When Memtables Flush

•Size threshold: Memtable reaches configured size (memtable_flush_threshold_in_mb)
•Commit log space: Commit log segments need recycling
•Memory pressure: Total memtable size exceeds memtable_heap_space_in_mb or memtable_offheap_space_in_mb
•Manual flush: Operator runs nodetool flush
•Table truncate/drop: Clears memtable as part of operation

memtable_config.yaml
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
# Memtable configuration in cassandra.yaml
 
# Memory allocation for memtables (fraction of heap)
# Default: 0.25 (25% of heap)
memtable_heap_space_in_mb: 2048
 
# Off-heap memtable space (for native memory)  
memtable_offheap_space_in_mb: 2048
 
# Flush threshold for individual memtable
# (Not commonly tuned; global settings usually preferred)
 
# Memtable implementation
# Options: skiplist (default), skiplist_sharded
memtable: 
  class_name: org.apache.cassandra.db.memtable.SkipListMemtable

Memtable Performance Characteristics:

Writes: O(log n) insertion into skip list—very fast
Reads: O(log n) lookup + merge with SSTable reads
Memory: Bounded by configuration; GC pressure if too large on-heap
Flush: Sequential write to disk; produces one SSTable

Off-Heap Memtables:

Cassandra can store memtables off-heap (in native memory outside JVM heap). Benefits:

Reduces GC pressure
Allows larger memtables without GC pauses
Faster flushes (no GC during flush)

Trade-off: Off-heap is slightly slower for writes and requires careful native memory management.

Tuning Memtable Size

SSTables: Immutable On-Disk Storage

When a memtable flushes, it produces a Sorted String Table (SSTable)—an immutable, on-disk file containing a sorted sequence of key-value pairs.

SSTable Structure:

Each SSTable is actually multiple files:

SSTable Component Files
File Extension	Purpose	Description
-Data.db	Data file	Actual row data, sorted by partition and clustering key
-Index.db	Primary index	Partition key → position in Data.db
-Summary.db	Summary index	Sampled entries from Index.db for fast lookup
-Filter.db	Bloom filter	Probabilistic filter to avoid unnecessary reads
-CompressionInfo.db	Compression metadata	Chunk boundaries for compressed data
-Statistics.db	Statistics	Min/max values, tombstone counts, etc.
-TOC.txt	Table of contents	Lists component files for this SSTable

sstable_read.txt
SSTable Read Path
==================
 
Query: SELECT * FROM users WHERE user_id = 'alice-uuid'
 
1. Check Bloom Filter (-Filter.db)
   └─ If filter says "definitely not here" → skip this SSTable
   └─ If filter says "might be here" → continue
 
2. Check Summary Index (-Summary.db)
   └─ Find range that might contain 'alice-uuid'
   └─ Returns offset range in Index.db
 
3. Check Primary Index (-Index.db)
   └─ Binary search within range for 'alice-uuid'
   └─ Returns offset in Data.db
 
4. Read Data (-Data.db)
   └─ Seek to offset, read partition data
   └─ Decompress if needed
   └─ Return to caller
 
Bloom filter can skip entire SSTables!
This is why Bloom filter tuning matters for read performance.

Why Immutability Matters:

No Locking: Since SSTables never change, concurrent reads need no locks
Easy Distribution: Immutable files can be safely copied, streamed, or backed up
Crash Consistent: A complete SSTable is always valid; partially written SSTables are discarded
Cache Friendly: OS page cache can effectively cache SSTable data

SSTable Compaction:

The trade-off of immutability is that updates and deletes don't modify existing data—they write new versions. This means:

Multiple SSTables might contain data for the same partition
Deleted data still occupies space (as tombstones)
Reads must check multiple SSTables

Compaction is the background process that reclaims this space by merging SSTables.

Bloom Filter Effectiveness

Compaction Strategies: Managing SSTable Growth

Cassandra provides several compaction strategies, each optimized for different workloads:

Size-Tiered Compaction Strategy (STCS)

How it works:

Groups SSTables of similar size into "buckets"
When a bucket reaches threshold (default 4), merges into one larger SSTable
Creates progressively larger SSTables over time

Best for:

Write-heavy workloads with few updates/deletes
Time-series data that's rarely modified
Workloads where writes vastly outnumber reads

Trade-offs:

✅ Best write amplification (each byte written ~4x total)
✅ Predictable resource usage
❌ High space amplification (2x during compaction)
❌ Can create very large SSTables
❌ Slower reads (many SSTables to check)

Configuration:

ALTER TABLE mytable WITH compaction = {
    'class': 'SizeTieredCompactionStrategy',
    'min_threshold': 4,
    'max_threshold': 32
};

Compaction Strategy Comparison
Strategy	Write Amplification	Space Amplification	Read Performance	Best Use Case
STCS	Low (~4x)	High (2x)	Variable	Write-heavy, append-only
LCS	High (10-30x)	Low (~10%)	Excellent	Read-heavy, many updates
TWCS	Very Low	Depends on TTL	Good	Time-series with TTL
UCS	Medium	Medium	Good	Mixed/unknown workloads

Changing Compaction Strategy

Write Amplification and the LSM Trade-off

Write amplification measures how many bytes are written to storage for each byte of data received from clients. In LSM trees, write amplification comes primarily from compaction.

Sources of Write Amplification:

Initial Write: Commit log + memtable flush = ~2x
Compaction: Each compaction rewrites data = cumulative effect
Total: STCS ~4x, LCS ~10-30x, depending on workload

Why Write Amplification Matters:

Write Amplification Impact

•SSD Wear: SSDs have limited write cycles. Higher amplification = faster wear. With 10x amplification, a 1TB/day workload writes 10TB/day to disk.
•I/O Bandwidth: Compaction consumes I/O bandwidth that could serve client requests. High amplification = more background I/O.
•CPU Usage: Compaction uses CPU for merging, compression, and checksumming. High amplification = higher steady-state CPU.
•Network (in some cases): If data is recompressed or checksummed differently, it may need re-replication.

The Fundamental Trade-off:

Lower write amplification (STCS): Faster writes, more SSTables, slower reads
Higher write amplification (LCS): Slower writes, fewer SSTables, faster reads

Space Amplification:

Another dimension of the trade-off:

STCS: During compaction, both source and destination SSTables exist. May need 2x data size in disk space.
LCS: Level structure limits concurrent compaction size. Typically ~10% overhead.
TWCS: Depends on time window size and retention. Predictable with proper TTL.

Choosing the Right Strategy:

Consider:

Read/write ratio of your workload
Update frequency (in-place updates favor LCS)
Delete frequency and TTL requirements
Available I/O bandwidth vs. SSD longevity concerns
Latency sensitivity (LCS has more predictable read latencies)

Monitoring Compaction

Tuning Cassandra for Maximum Write Performance

For write-intensive workloads, several tuning knobs can dramatically improve performance:

Write Performance Tuning

•Use SSDs — The single biggest impact. SSDs excel at commit log writes and random reads. NVMe SSDs provide the best performance.
•Separate commit log and data directories — On HDDs, this prevents head contention. On SSDs, still beneficial for I/O accounting and failure isolation.
•Tune commit log sync — For maximum throughput, use 'periodic' sync. For durability, use 'batch' or 'group' and accept the latency.
•Increase memtable size — Larger memtables reduce flush frequency. Use off-heap to avoid GC pressure.
•Enable compression — Reduces I/O at the cost of CPU. LZ4 is fast with good compression; Zstd for better ratio.
•Use CL=ONE or LOCAL_ONE — Lower consistency levels return faster. Accept the consistency trade-off.
•Batch wisely — Unlogged batches for single-partition inserts can help. Avoid large multi-partition batches.
•Tune compaction throughput — Higher compaction_throughput_mb_per_sec prevents SSTable buildup but consumes more I/O.

high_write_config.yaml
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
# cassandra.yaml optimizations for write-heavy workloads
 
# Faster commit log sync (with some durability trade-off)
commitlog_sync: periodic
commitlog_sync_period_in_ms: 10000
 
# Increase memtable space (adjust based on heap size)
memtable_heap_space_in_mb: 4096
memtable_offheap_space_in_mb: 4096
 
# More compaction throughput
compaction_throughput_mb_per_sec: 128
 
# More concurrent compactors (if CPU allows)
concurrent_compactors: 4
 
# Use STCS for write-heavy tables
# ALTER TABLE mytable WITH compaction = {
#     'class': 'SizeTieredCompactionStrategy'
# };
 
# Compression (LZ4 for speed)
# Applied at table level with CREATE TABLE ... WITH compression = {...}

Client-Side Optimizations:

Connection pooling: Reuse connections; don't reconnect per request
Async writes: Use async API and let the driver batch requests
Token awareness: Driver routes to correct node, reducing coordinator overhead
Prepared statements: Reduce parsing overhead
Appropriate driver settings: Tune request timeout, speculative retry based on needs

Benchmark Your Workload

Summary and Next Steps

We've explored Cassandra's write-optimized architecture in depth. Let's consolidate the key concepts:

Key Takeaways

•LSM trees optimize for writes — Sequential writes and no read-before-write enable 100x better write throughput than B-trees.
•Commit log provides durability — Append-only log ensures writes survive crashes before memtable flush.
•Memtables buffer writes in memory — Fast insertion with eventual flush to immutable SSTables.
•SSTables are sorted and immutable — Enables efficient reads, easy replication, and crash safety.
•Compaction strategies trade off writes vs. reads — STCS for writes, LCS for reads, TWCS for time-series with TTL.
•Write amplification is the cost of LSM trees — Compaction rewrites data; strategy choice affects total I/O.
•Tuning matters — SSDs, separate disks, memtable sizing, and compaction configuration all impact performance.
•Client optimizations complement server tuning — Async writes, token awareness, and prepared statements maximize throughput.

What's Next:

Page Complete

5 / 6