Loading learning content...
Cassandra was born from a simple premise: writes are often more important than reads. While traditional databases optimize for reads (B-trees, in-place updates), many modern workloads—time-series data, event logging, user activity tracking, IoT sensors—generate writes at rates that overwhelm read-optimized architectures.
Apache Cassandra's write path is engineered from the ground up for extreme write throughput. A well-tuned Cassandra cluster can sustain hundreds of thousands of writes per second per node, with sub-millisecond latencies. This performance comes from a fundamentally different storage architecture: the Log-Structured Merge Tree (LSM Tree).
By the end of this page, you will understand: (1) The complete write path from client to disk, (2) Why LSM trees outperform B-trees for write-intensive workloads, (3) The role of commit logs, memtables, and SSTables, (4) How compaction works and the different compaction strategies, (5) The trade-offs of write optimization, and (6) How to tune Cassandra for maximum write performance.
When a write arrives at a Cassandra node, it follows a carefully designed path that prioritizes durability and speed:
Write Path Steps:
Cassandra Write Path===================== Client sends write request │ ▼┌─────────────────────────────┐│ Coordinator Node │ ← Receives request, identifies replicas└─────────────────────────────┘ │ ▼┌─────────────────────────────┐│ Forward to Replica Nodes │ ← Parallel write to all replicas└─────────────────────────────┘ │ ▼ (On each replica)┌─────────────────────────────┐│ 1. Write to Commit Log │ ← Sequential append (fast!)│ (append-only, durable) │ Survives crashes└─────────────────────────────┘ │ ▼┌─────────────────────────────┐│ 2. Write to Memtable │ ← In-memory sorted structure│ (in-memory, sorted) │ Extremely fast└─────────────────────────────┘ │ ▼┌─────────────────────────────┐│ 3. Return ACK to Client │ ← Write complete!│ (after CL satisfied) │ (Disk flush is async)└─────────────────────────────┘ Background (async):┌─────────────────────────────┐│ 4. Memtable → SSTable │ ← When memtable fills│ (flush to disk) │ Sequential write└─────────────────────────────┘ │ ▼┌─────────────────────────────┐│ 5. Compaction │ ← Background merge of SSTables│ (merge, remove tombstones)│ Reclaims space└─────────────────────────────┘Why This Is Fast:
No Random I/O: The commit log is append-only. No seeking, just sequential writes—the fastest operation on any storage device.
Memory-First: The memtable is an in-memory data structure. After the commit log write (for durability), the client can return. The actual disk write happens later.
No Read-Before-Write: Unlike B-trees that must read existing data to update in place, Cassandra just writes new data. Conflict resolution (last-write-wins) happens at read time.
Parallel Replication: Writes to replicas happen in parallel, not sequentially. Total latency is the slowest replica, not the sum.
Decoupled Durability and Performance: The commit log ensures durability; memtable flushing is a background operation that doesn't block clients.
Typical write latency: ~0.5-2ms. Commit log append: ~0.1-0.5ms (sequential write). Memtable insert: ~0.01-0.1ms (in-memory). Network overhead: ~0.2-0.5ms. The commit log is the dominant factor; SSDs dramatically improve this.
Traditional databases (PostgreSQL, MySQL) use B-trees for storage. B-trees are excellent for reads but suffer for write-intensive workloads. Cassandra uses Log-Structured Merge Trees (LSM trees), which flip this trade-off.
B-Tree Write Path:
B-Tree Write (Traditional Databases)====================================== 1. Find the page containing the key (multiple disk seeks)2. Read the page into memory3. Modify the page in memory4. Write the page back to disk (random I/O)5. Update index pages if necessary (more random I/O)6. Write to WAL for durability (sequential, but still needed) Total: 2-10+ disk I/O operations per write Random I/O dominates Write amplification from page splits Performance: 1,000 - 10,000 writes/sec per node (typical)LSM Tree Write Path:
LSM Tree Write (Cassandra)=========================== 1. Append to commit log (1 sequential disk write)2. Insert into memtable (1 in-memory operation)3. Done! (Return to client) Background (not on critical path):4. Flush memtable to SSTable when full (1 large sequential write)5. Compaction merges SSTables (sequential reads and writes) Total: 1 disk I/O operation per write 100% sequential I/O No read-before-write Performance: 100,000 - 500,000+ writes/sec per node (typical)| Characteristic | B-Tree | LSM Tree |
|---|---|---|
| Write I/O pattern | Random | Sequential |
| Disk seeks per write | 2-10+ | 0-1 |
| Write amplification | High (page splits) | Lower (compaction) |
| Write throughput | ~10K/sec | ~500K/sec |
| Read complexity | O(log n) | O(log n) + SSTable count |
| Space amplification | Low (in-place) | Higher (temporary during compaction) |
| Update performance | Fast (in-place) | Fast (append; old version stays) |
LSM trees trade read performance for write performance. Reads may need to check multiple SSTables plus the memtable. Compaction mitigates this by merging SSTables, but it consumes CPU and I/O. Cassandra provides configurable compaction strategies to tune this trade-off.
The commit log is Cassandra's durability mechanism. Every write is first appended to the commit log before any other processing. This ensures that if the node crashes, writes can be replayed from the commit log on recovery.
Commit Log Properties:
Commit Log Sync Modes:
| Mode | Behavior | Durability | Performance |
|---|---|---|---|
| periodic (default) | Sync every commitlog_sync_period_in_ms (10s default) | May lose up to sync period of writes on crash | Highest throughput |
| batch | Block write until sync; sync when batch fills or timeout | Minimal data loss window | Adds ~2ms latency per write |
| group | Similar to batch but uses group commit | Minimal data loss window | Better than batch for high throughput |
1234567891011121314151617181920
# Commit log configuration in cassandra.yaml # Sync mode: periodic, batch, or groupcommitlog_sync: periodic # For periodic mode: how often to sync (default 10s)commitlog_sync_period_in_ms: 10000 # For batch mode: max time to wait before syncing# commitlog_sync_batch_window_in_ms: 2 # Total space for commit log segmentscommitlog_total_space_in_mb: 8192 # Each segment size (32MB typical)commitlog_segment_size_in_mb: 32 # Compression for commit log (LZ4 recommended)commitlog_compression: - class_name: LZ4CompressorRecovery Process:
When a Cassandra node restarts:
Best Practices:
If the commit log disk fills up (because memtables aren't flushing fast enough), writes block until space is available. Monitor commitlog_pending_tasks and commitlog_size metrics. If these grow, you may need faster storage, more memory for memtables, or compaction tuning.
A memtable is Cassandra's in-memory write buffer for each table. When writes arrive (after commit log), they go into the appropriate table's memtable. Memtables are sorted data structures (skip lists in modern Cassandra) that enable fast writes and efficient reads.
Memtable Lifecycle:
Flush Triggers:
nodetool flush12345678910111213141516
# Memtable configuration in cassandra.yaml # Memory allocation for memtables (fraction of heap)# Default: 0.25 (25% of heap)memtable_heap_space_in_mb: 2048 # Off-heap memtable space (for native memory) memtable_offheap_space_in_mb: 2048 # Flush threshold for individual memtable# (Not commonly tuned; global settings usually preferred) # Memtable implementation# Options: skiplist (default), skiplist_shardedmemtable: class_name: org.apache.cassandra.db.memtable.SkipListMemtableMemtable Performance Characteristics:
Off-Heap Memtables:
Cassandra can store memtables off-heap (in native memory outside JVM heap). Benefits:
Trade-off: Off-heap is slightly slower for writes and requires careful native memory management.
Larger memtables reduce flush frequency but increase memory pressure and recovery time. Smaller memtables flush more often, creating more SSTables. Balance based on write rate and available memory. For write-heavy workloads, larger memtables with off-heap storage often help.
When a memtable flushes, it produces a Sorted String Table (SSTable)—an immutable, on-disk file containing a sorted sequence of key-value pairs.
SSTable Structure:
Each SSTable is actually multiple files:
| File Extension | Purpose | Description |
|---|---|---|
| -Data.db | Data file | Actual row data, sorted by partition and clustering key |
| -Index.db | Primary index | Partition key → position in Data.db |
| -Summary.db | Summary index | Sampled entries from Index.db for fast lookup |
| -Filter.db | Bloom filter | Probabilistic filter to avoid unnecessary reads |
| -CompressionInfo.db | Compression metadata | Chunk boundaries for compressed data |
| -Statistics.db | Statistics | Min/max values, tombstone counts, etc. |
| -TOC.txt | Table of contents | Lists component files for this SSTable |
SSTable Read Path================== Query: SELECT * FROM users WHERE user_id = 'alice-uuid' 1. Check Bloom Filter (-Filter.db) └─ If filter says "definitely not here" → skip this SSTable └─ If filter says "might be here" → continue 2. Check Summary Index (-Summary.db) └─ Find range that might contain 'alice-uuid' └─ Returns offset range in Index.db 3. Check Primary Index (-Index.db) └─ Binary search within range for 'alice-uuid' └─ Returns offset in Data.db 4. Read Data (-Data.db) └─ Seek to offset, read partition data └─ Decompress if needed └─ Return to caller Bloom filter can skip entire SSTables!This is why Bloom filter tuning matters for read performance.Why Immutability Matters:
SSTable Compaction:
The trade-off of immutability is that updates and deletes don't modify existing data—they write new versions. This means:
Compaction is the background process that reclaims this space by merging SSTables.
Bloom filters are crucial for read performance. They let Cassandra skip SSTables that definitely don't contain the requested data. The bloom_filter_fp_chance setting (default 0.01 = 1% false positive) trades memory for accuracy. Lower values use more memory but skip more unnecessary disk reads.
Compaction merges multiple SSTables into fewer, larger SSTables. It removes tombstones (deleted data), merges cells with different timestamps, and reduces the number of SSTables that reads must check.
Cassandra provides several compaction strategies, each optimized for different workloads:
Size-Tiered Compaction Strategy (STCS)
How it works:
Best for:
Trade-offs:
Configuration:
ALTER TABLE mytable WITH compaction = {
'class': 'SizeTieredCompactionStrategy',
'min_threshold': 4,
'max_threshold': 32
};
| Strategy | Write Amplification | Space Amplification | Read Performance | Best Use Case |
|---|---|---|---|---|
| STCS | Low (~4x) | High (2x) | Variable | Write-heavy, append-only |
| LCS | High (10-30x) | Low (~10%) | Excellent | Read-heavy, many updates |
| TWCS | Very Low | Depends on TTL | Good | Time-series with TTL |
| UCS | Medium | Medium | Good | Mixed/unknown workloads |
Changing compaction strategy on a live table triggers recompaction of all data—a resource-intensive operation. Test in non-production first. During recompaction, expect higher I/O and potential latency spikes. Plan for 2x disk space during the transition.
Write amplification measures how many bytes are written to storage for each byte of data received from clients. In LSM trees, write amplification comes primarily from compaction.
Sources of Write Amplification:
Why Write Amplification Matters:
The Fundamental Trade-off:
Space Amplification:
Another dimension of the trade-off:
Choosing the Right Strategy:
Consider:
Monitor 'nodetool compactionstats' and the pending_compactions metric. High pending compaction counts mean compaction can't keep up with writes. This leads to many SSTables, even slower reads, and eventually disk space issues. Tune concurrency with concurrent_compactors and compaction_throughput_mb_per_sec.
For write-intensive workloads, several tuning knobs can dramatically improve performance:
1234567891011121314151617181920212223
# cassandra.yaml optimizations for write-heavy workloads # Faster commit log sync (with some durability trade-off)commitlog_sync: periodiccommitlog_sync_period_in_ms: 10000 # Increase memtable space (adjust based on heap size)memtable_heap_space_in_mb: 4096memtable_offheap_space_in_mb: 4096 # More compaction throughputcompaction_throughput_mb_per_sec: 128 # More concurrent compactors (if CPU allows)concurrent_compactors: 4 # Use STCS for write-heavy tables# ALTER TABLE mytable WITH compaction = {# 'class': 'SizeTieredCompactionStrategy'# }; # Compression (LZ4 for speed)# Applied at table level with CREATE TABLE ... WITH compression = {...}Client-Side Optimizations:
Optimal settings depend on your specific hardware, data model, and access patterns. Use tools like cassandra-stress or YCSB to benchmark different configurations. What works for one workload may not work for another.
We've explored Cassandra's write-optimized architecture in depth. Let's consolidate the key concepts:
What's Next:
With Cassandra's architecture now understood—from masterless design and gossip to tunable consistency, the wide-column model, and write-optimized storage—the final page brings it all together: When to Use Cassandra. We'll explore the decision framework for choosing Cassandra, common use cases, and situations where other databases might be more appropriate.
You now understand Cassandra's write-optimized architecture—the LSM tree foundation that enables industry-leading write performance. Next, we'll synthesize everything into practical guidance on when to choose Cassandra for your system design.