Loading learning content...
When Netflix streams a video, dozens of data points are generated every second—playback position, buffering events, quality changes, user interactions. When a smart factory operates, thousands of sensors emit readings every millisecond. When a social network processes activity, billions of events flow through the system daily. These workloads share a common characteristic: they are write-dominant, generating far more writes than reads, often by orders of magnitude.
Traditional relational databases struggle with high write throughput because their B-tree storage engines optimize for reads at the expense of writes. Each write must find the correct position in a sorted structure, potentially triggering expensive page splits and random I/O. Wide-column stores use a fundamentally different approach—the Log-Structured Merge Tree (LSM-tree)—that inverts these trade-offs, achieving write throughput that traditional databases cannot match.
This page explores why wide-column stores are the natural choice for write-heavy workloads. We'll examine the LSM-tree architecture that makes this possible, understand the trade-offs involved, and learn patterns for designing systems that leverage high write throughput while maintaining acceptable read performance.
By the end of this page, you will understand why LSM-trees excel at writes, how to quantify and manage write amplification, patterns for time-series and event streaming workloads, and how to tune wide-column stores for optimal write performance while maintaining read efficiency.
Before appreciating why LSM-trees revolutionize write performance, we must understand why traditional databases struggle. The root cause lies in B-tree storage engines and their inherent design trade-offs.
B-Tree Write Operations
B-trees maintain data in sorted order across balanced tree structures. When you insert a new row:
The critical issue is random I/O. Each write requires updating a specific page on disk, which is dramatically slower than sequential writes:
| Storage Type | Sequential Write | Random Write | Ratio |
|---|---|---|---|
| HDD (7200 RPM) | 150 MB/s | ~1 MB/s | 150x |
| SATA SSD | 500 MB/s | ~50 MB/s | 10x |
| NVMe SSD | 3,000 MB/s | ~500 MB/s | 6x |
| Cloud Block Storage (EBS gp3) | 125 MB/s | ~16,000 IOPS × 4KB = 64 MB/s | 2x |
Even on modern NVMe SSDs, sequential writes are significantly faster than random writes. On HDDs, the difference is catastrophic—150x slower for random access due to mechanical seek times.
The Write Amplification Problem
B-trees also suffer from write amplification—a single logical write results in multiple physical writes:
A single row insert might trigger 5-10 physical writes, each requiring random I/O. Under high write load, the storage subsystem becomes saturated.
1234567891011121314151617181920212223242526272829303132333435363738394041
┌────────────────────────────────────────────────────────────────────────────┐│ B-TREE WRITE OPERATION │├────────────────────────────────────────────────────────────────────────────┤│ ││ INSERT INTO users (id, name, email) VALUES (42, 'Alice', 'alice@ex.com') ││ ││ Step 1: Append to WAL [Sequential Write ✓] ││ └─ Write to transaction log ││ ││ Step 2: Navigate B-tree [Random Reads - O(log n)] ││ ┌────────────────┐ ││ │ Root Page │ ││ │ [10, 30, 50] │ Read page 1 (random I/O) ││ └───────┬────────┘ ││ │ 42 > 30 && 42 < 50 ││ ▼ ││ ┌────────────────┐ ││ │ Internal Page │ ││ │ [35, 40, 45] │ Read page 47 (random I/O) ││ └───────┬────────┘ ││ │ 42 > 40 && 42 < 45 ││ ▼ ││ ┌────────────────┐ ││ │ Leaf Page │ ││ │ [40,41,_,_,_] │ Read page 123 (random I/O) ││ └────────────────┘ ││ ││ Step 3: Insert and Write Page [Random Write - Worst Part!] ││ ┌────────────────┐ ││ │ Leaf Page │ ││ │ [40,41,42,_,_] │ Write page 123 back to disk ││ └────────────────┘ ││ ││ Step 4: Update indexes [More Random I/O per index] ││ └─ email_idx: Random seek + write ││ └─ name_idx: Random seek + write ││ ││ Total: 1 sequential write + 3-5 random reads + 2-4 random writes ││ At scale: Random I/O becomes the bottleneck ││ │└────────────────────────────────────────────────────────────────────────────┘B-trees are not bad—they're optimized for read-heavy workloads with moderate writes. Their sorted structure enables efficient range scans, point lookups, and secondary indexes. The problem arises only when write throughput requirements exceed what random I/O can sustain.
The Log-Structured Merge Tree (LSM-tree) inverts B-tree trade-offs to optimize for write throughput. Instead of updating data in place, LSM-trees treat all writes as sequential appends, deferring the cost of sorting to background processes.
Core LSM-Tree Principles
This design converts random writes into sequential writes:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051
┌────────────────────────────────────────────────────────────────────────────┐│ LSM-TREE WRITE OPERATION │├────────────────────────────────────────────────────────────────────────────┤│ ││ INSERT INTO events (id, timestamp, data) VALUES (...) ││ ││ Step 1: Append to Commit Log (WAL) [Sequential Write ✓] ││ └─ Append-only log, no seeks ││ └─ Durability: data survives crash even if MemTable lost ││ ││ Step 2: Insert into MemTable [In-Memory - No I/O!] ││ ┌────────────────────────────────┐ ││ │ MemTable (64-256 MB) │ ││ │ ────────────────── │ ││ │ Skip List / Red-Black Tree │ ││ │ │ ││ │ [event_001] → data │ ││ │ [event_002] → data │ ← Insert here (O(log n)) ││ │ [event_003] → data │ ││ │ ... │ ││ └────────────────────────────────┘ ││ ││ Step 3: Return success to client [Ack after WAL + MemTable] ││ └─ Latency: ~1-5ms (mostly WAL sync time) ││ ││ === BACKGROUND (Asynchronous) === ││ ││ Step 4: MemTable Flush [Sequential Write ✓] ││ When MemTable reaches threshold: ││ ┌────────────────┐ ││ │ MemTable │ ││ └───────┬────────┘ ││ │ (freeze MemTable, new writes go to new MemTable) ││ ▼ ││ ┌────────────────────────────────┐ ││ │ SSTable (Sorted String │ ││ │ Table on disk) │ ││ │ ───────────────────── │ ││ │ [event_001 → data] │ ││ │ [event_002 → data] │ ← Written sequentially! ││ │ [event_003 → data] │ ││ │ + Bloom Filter │ ││ │ + Index Block │ ││ └────────────────────────────────┘ ││ ││ Step 5: Compaction (background) [Sequential Read + Write] ││ Merge SSTables to reduce file count and remove obsolete data ││ ││ Summary: Only sequential I/O in the write path! ││ │└────────────────────────────────────────────────────────────────────────────┘Why Sequential Writes Win
The performance advantage of LSM-trees comes from replacing random I/O with sequential I/O:
Compared to B-trees:
| Operation | B-Tree | LSM-Tree |
|---|---|---|
| Write to storage | Random I/O (slow) | Sequential I/O (fast) |
| In-place updates | Yes (fragmentation) | No (immutable files) |
| Index maintenance | Per insert (slow) | Batch during compaction |
| Write latency | ~5-20ms | ~1-5ms |
| Theoretical max writes/s | ~10,000 (disk-bound) | ~100,000+ (memory-bound) |
A well-tuned LSM-tree database can achieve 10-100x higher write throughput than a B-tree database on the same hardware. This is why Cassandra, HBase, RocksDB, and LevelDB all use LSM-trees—write optimization is their primary design goal.
LSM-trees don't eliminate write amplification—they shift it from the foreground (during writes) to the background (during compaction). Understanding this trade-off is essential for capacity planning and performance tuning.
Write Amplification in LSM-Trees
Write amplification (WA) is the ratio of bytes written to storage versus bytes written by the application:
Write Amplification = Total Bytes Written to Disk / Application Bytes Written
In LSM-trees, data is written multiple times:
For a typical LSM-tree with 10 levels, write amplification can reach 10-30x. A 1 KB write might eventually result in 10-30 KB of disk I/O.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566
┌────────────────────────────────────────────────────────────────────────────┐│ LSM-TREE COMPACTION ARCHITECTURE │├────────────────────────────────────────────────────────────────────────────┤│ ││ ┌───────────────────────────────────────────────────────────────────┐ ││ │ MEMORY │ ││ │ │ ││ │ Active MemTable (64MB) Immutable MemTable (flushing) │ ││ │ ┌─────────────────┐ ┌─────────────────┐ │ ││ │ │ ░░░░░░░░░░░░░ │ │ ████████████ │ │ ││ │ │ (accepting │ │ (being written │ │ ││ │ │ writes) │ │ to L0) │ │ ││ │ └─────────────────┘ └─────────────────┘ │ ││ └───────────────────────────────────────────────────────────────────┘ ││ │ ││ │ Flush (sequential write) ││ ▼ ││ ┌───────────────────────────────────────────────────────────────────┐ ││ │ LEVEL 0 (L0) │ ││ │ SSTable files, possibly overlapping key ranges │ ││ │ ┌──────┐ ┌──────┐ ┌──────┐ ┌──────┐ │ ││ │ │ 64MB │ │ 64MB │ │ 64MB │ │ 64MB │ (4 files, each from flush) │ ││ │ └──────┘ └──────┘ └──────┘ └──────┘ │ ││ └───────────────────────────────────────────────────────────────────┘ ││ │ ││ │ Minor Compaction ││ │ (Merge + Sort → Write Amplification +1) ││ ▼ ││ ┌───────────────────────────────────────────────────────────────────┐ ││ │ LEVEL 1 (L1) │ ││ │ Non-overlapping key ranges, size = L0_size × ratio (10x) │ ││ │ ┌────────────────────────────────────────────────────────────┐ │ ││ │ │ ~640 MB │ │ ││ │ └────────────────────────────────────────────────────────────┘ │ ││ └───────────────────────────────────────────────────────────────────┘ ││ │ ││ │ Compaction (Write Amplification +1) ││ ▼ ││ ┌───────────────────────────────────────────────────────────────────┐ ││ │ LEVEL 2 (L2) │ ││ │ ┌────────────────────────────────────────────────────────────┐ │ ││ │ │ ~6.4 GB │ │ ││ │ └────────────────────────────────────────────────────────────┘ │ ││ └───────────────────────────────────────────────────────────────────┘ ││ │ ││ ▼ ... continues ... ││ ││ ┌───────────────────────────────────────────────────────────────────┐ ││ │ LEVEL N (Largest) │ ││ │ ┌────────────────────────────────────────────────────────────┐ │ ││ │ │ ~1 TB │ │ ││ │ └────────────────────────────────────────────────────────────┘ │ ││ └───────────────────────────────────────────────────────────────────┘ ││ ││ Write Amplification Calculation: ││ ───────────────────────────────── ││ • Data written to MemTable: 1x ││ • Flushed to L0: 1x ││ • Compacted L0 → L1: ~1x (merged with L1 files) ││ • Compacted L1 → L2: ~1x ││ • ... for each level ... ││ ││ Total WA ≈ 1 + (levels × compaction_factor/size_ratio) ││ Typical range: 10-30x ││ │└────────────────────────────────────────────────────────────────────────────┘Compaction Strategies
Different compaction strategies trade off write amplification, read amplification, and space amplification:
Size-Tiered Compaction (STCS)
Leveled Compaction (LCS)
Time-Window Compaction (TWCS)
Write amplification matters especially on SSDs, which have limited write endurance. A 10x write amplification means your 100 TB/day write workload is actually writing 1 PB/day to SSDs. Choose compaction strategies and tune compaction frequency based on your SSD durability budget.
Wide-column stores are the natural choice for time-series and event streaming workloads. Their write optimization, flexible schema, and cell versioning align perfectly with these access patterns.
Time-Series Data Characteristics
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384
// Time-series data modeling for IoT sensor data // PATTERN 1: Time-bucketed partitions// Prevents unbounded partition growth// Enables efficient time-range queries /*CREATE TABLE sensor_data ( sensor_id UUID, time_bucket TEXT, -- e.g., "2024-01-15" (daily bucket) reading_time TIMESTAMP, temperature DOUBLE, humidity DOUBLE, pressure DOUBLE, PRIMARY KEY ((sensor_id, time_bucket), reading_time)) WITH CLUSTERING ORDER BY (reading_time DESC)AND compaction = {'class': 'TimeWindowCompactionStrategy', 'compaction_window_size': 1, 'compaction_window_unit': 'DAYS'};*/ // Why this works:// 1. Partition key (sensor_id, time_bucket) bounds partition size// 2. Clustering on reading_time DESC → recent data first// 3. TWCS compaction → old time windows compact together, expire together // Writing sensor dataasync function writeSensorReading( sensorId: string, reading: SensorReading): Promise<void> { const timeBucket = formatDate(reading.timestamp, 'YYYY-MM-DD'); await cassandra.execute( `INSERT INTO sensor_data (sensor_id, time_bucket, reading_time, temperature, humidity, pressure) VALUES (?, ?, ?, ?, ?, ?) USING TTL 2592000`, // 30 days TTL [sensorId, timeBucket, reading.timestamp, reading.temperature, reading.humidity, reading.pressure], { consistency: ConsistencyLevel.ONE } // High throughput, eventual consistency );} // Querying recent data (single partition read)async function getRecentReadings( sensorId: string, limit: number = 100): Promise<SensorReading[]> { const today = formatDate(new Date(), 'YYYY-MM-DD'); const result = await cassandra.execute( `SELECT * FROM sensor_data WHERE sensor_id = ? AND time_bucket = ? LIMIT ?`, [sensorId, today, limit], { consistency: ConsistencyLevel.ONE } ); return result.rows;} // Querying time range (may span multiple partitions)async function getReadingsInRange( sensorId: string, startTime: Date, endTime: Date): Promise<SensorReading[]> { const buckets = generateDateBuckets(startTime, endTime); // Query each partition and aggregate const allReadings = await Promise.all( buckets.map(bucket => cassandra.execute( `SELECT * FROM sensor_data WHERE sensor_id = ? AND time_bucket = ? AND reading_time >= ? AND reading_time <= ?`, [sensorId, bucket, startTime, endTime] ) ) ); return allReadings.flatMap(r => r.rows);}Event Streaming Patterns
Event streaming workloads share similar characteristics but emphasize ordered processing and at-least-once delivery:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107
// Event sourcing with wide-column store as event log /*CREATE TABLE events ( aggregate_type TEXT, aggregate_id UUID, event_sequence BIGINT, event_time TIMESTAMP, event_type TEXT, payload BLOB, metadata MAP<TEXT, TEXT>, PRIMARY KEY ((aggregate_type, aggregate_id), event_sequence)) WITH CLUSTERING ORDER BY (event_sequence ASC); CREATE TABLE event_log ( time_bucket TEXT, -- e.g., "2024-01-15-14" (hourly) event_time TIMESTAMP, event_id UUID, aggregate_type TEXT, aggregate_id UUID, event_type TEXT, payload BLOB, PRIMARY KEY ((time_bucket), event_time, event_id)) WITH CLUSTERING ORDER BY (event_time ASC);*/ // Event store implementationinterface DomainEvent { eventId: string; aggregateType: string; aggregateId: string; eventType: string; payload: object; timestamp: Date; sequence: number;} class CassandraEventStore { // Append event atomically async appendEvent(event: DomainEvent): Promise<void> { const batch = new BatchStatement(); // Write to aggregate-partitioned table (for reconstruction) batch.add(this.insertEventQuery, [ event.aggregateType, event.aggregateId, event.sequence, event.timestamp, event.eventType, Buffer.from(JSON.stringify(event.payload)), {} ]); // Write to time-partitioned table (for streaming/replay) const timeBucket = formatDate(event.timestamp, 'YYYY-MM-DD-HH'); batch.add(this.insertLogQuery, [ timeBucket, event.timestamp, event.eventId, event.aggregateType, event.aggregateId, event.eventType, Buffer.from(JSON.stringify(event.payload)) ]); // Logged batch for atomicity across tables await this.cassandra.execute(batch, { consistency: ConsistencyLevel.QUORUM }); } // Reconstruct aggregate from events async getEventsForAggregate( aggregateType: string, aggregateId: string, fromSequence: number = 0 ): Promise<DomainEvent[]> { const result = await this.cassandra.execute( `SELECT * FROM events WHERE aggregate_type = ? AND aggregate_id = ? AND event_sequence >= ?`, [aggregateType, aggregateId, fromSequence], { consistency: ConsistencyLevel.LOCAL_QUORUM } ); return result.rows.map(row => this.mapRowToEvent(row)); } // Stream events for processing (CDC-like pattern) async getEventsSince( startTime: Date, batchSize: number = 1000 ): AsyncGenerator<DomainEvent[]> { let currentBucket = formatDate(startTime, 'YYYY-MM-DD-HH'); const endBucket = formatDate(new Date(), 'YYYY-MM-DD-HH'); while (currentBucket <= endBucket) { const result = await this.cassandra.execute( `SELECT * FROM event_log WHERE time_bucket = ? ORDER BY event_time ASC`, [currentBucket] ); yield result.rows.map(row => this.mapRowToEvent(row)); currentBucket = incrementHour(currentBucket); } }}Wide-column stores excel at TTL-based data expiration. With Time-Window Compaction, expired data is cleaned up efficiently—entire SSTables containing only expired data are simply deleted. This is far more efficient than individual row deletes in B-tree databases.
Achieving maximum write throughput requires understanding the tuning knobs available in wide-column stores. While defaults are reasonable, high-throughput workloads benefit from deliberate configuration.
MemTable Configuration
The MemTable is the first staging area for writes. Larger MemTables:
| Workload Type | MemTable Size | Reasoning |
|---|---|---|
| Default/Mixed | 64-128 MB | Balance between memory and flush frequency |
| Write-Heavy | 256-512 MB | Reduce flush frequency, batch more writes |
| Memory-Constrained | 32-64 MB | Sacrifice throughput for memory stability |
| High-Cardinality Columns | 128-256 MB | Larger MemTable to batch column families |
Commit Log / WAL Optimization
The commit log is often the bottleneck for write latency. Configuration options:
123456789101112131415161718192021222324
# cassandra.yaml # Commit log sync modes:# - periodic: Sync every commitlog_sync_period_in_ms (default: 10000ms)# - batch: Sync after each write, wait up to commitlog_sync_batch_window_in_ms for batching# - group: Sync after each write, with group commit for concurrent writers commitlog_sync: periodiccommitlog_sync_period_in_ms: 10000 # Sync every 10 seconds # For lower latency at cost of potential data loss on crash:# commitlog_sync: periodic# commitlog_sync_period_in_ms: 1000 # Sync every 1 second (higher risk) # For guaranteed durability (lowest throughput):# commitlog_sync: batch# commitlog_sync_batch_window_in_ms: 2 # Batch writes for 2ms then sync # Commit log size (should handle burst writes)commitlog_total_space_in_mb: 8192 # 8GB total commit log space # Compression (reduces I/O at cost of CPU)commitlog_compression: - class_name: LZ4CompressorConcurrent Writes Configuration
Cassandra manages concurrency through thread pools. Key settings:
12345678910111213141516
# cassandra.yaml # Native transport (CQL) settingsnative_transport_max_threads: 128 # Threads handling client requests # Concurrent writesconcurrent_writes: 32 # Threads for local writes (MemTable + commit log) # Rule of thumb: 8 × number of cores # Concurrent compactorsconcurrent_compactors: 4 # Threads for background compaction # Rule of thumb: min(cores, max(2, cores / 2)) # Batch size limits (prevent huge batches from causing issues)batch_size_warn_threshold_in_kb: 5 # Log warningbatch_size_fail_threshold_in_kb: 50 # Reject batchWrite-optimized doesn't mean read-neglected. Wide-column stores provide mechanisms to maintain acceptable read performance even with high write throughput.
Read Amplification in LSM-Trees
The trade-off for fast writes is read amplification—potentially checking multiple locations for each read:
Without mitigation, a read might check 10-20 files. Mitigations include:
12345678910111213141516171819202122
# cassandra.yaml - Read Performance Settings # Block cache (OS page cache is primary, this is JVM cache)file_cache_size_in_mb: 512 # JVM file buffer cache # Key cache (partition index entries)key_cache_size_in_mb: 100key_cache_save_period: 14400 # Save every 4 hours # Row cache (use only for hot partition workloads)row_cache_size_in_mb: 0 # Disabled by default, enable carefullyrow_cache_save_period: 0 # Read-ahead for sequential scansdisk_optimization_strategy: ssd # or 'spinning' for HDD # Bloom filter settings (per-table)# CREATE TABLE ... WITH bloom_filter_fp_chance = 0.01; # 1% false positive# Lower = more memory usage, fewer false reads# Default: 0.01 (1%) # sstable_preemptive_open_interval_in_mb: 50 # Preload SSTables for faster readsBalancing Write and Read Performance
The art of wide-column store operation is balancing write throughput with read latency:
| Configuration | Write Impact | Read Impact |
|---|---|---|
| Larger MemTable | Higher throughput | Slower first read (not yet in SSTable) |
| More aggressive compaction | Lower throughput (I/O used) | Faster reads (fewer files) |
| Higher bloom filter accuracy | More memory | Fewer false disk reads |
| Larger block cache | Memory used | Faster repeated reads |
| STCS compaction | Lower write amp | Higher read amp |
| LCS compaction | Higher write amp | Lower read amp |
Track pending compactions, SSTable count per table, read latency percentiles (p99, p999), and cache hit ratios. If pending compactions grow unboundedly, your write rate exceeds compaction capacity—either add nodes or reduce write rate.
We've explored why wide-column stores excel at write-heavy workloads and how to leverage this strength effectively. Let's consolidate the key insights:
What's Next:
We've covered the column-family model, explored Cassandra and HBase implementations, and understood write-optimized workloads. The final page of this module brings it all together with a comprehensive guide to use cases and trade-offs—helping you decide when wide-column stores are the right choice and when to look elsewhere.
You now understand why wide-column stores achieve 10-100x higher write throughput than traditional databases, how LSM-trees enable this, and the trade-offs involved. This knowledge enables you to design systems that leverage high write throughput while maintaining acceptable read performance.