Loading learning content...
Apache Cassandra is the most widely deployed wide-column store in production environments worldwide. Originally developed at Facebook for inbox search and released as open source in 2008, Cassandra has become the go-to choice for applications requiring:
Companies like Netflix, Apple, Instagram, Spotify, and Uber run Cassandra clusters with thousands of nodes, handling petabytes of data. Understanding Cassandra isn't just academic—it's preparation for real-world distributed system challenges.
This page moves from theory to practice. We'll design schemas, write queries, and explore the operational aspects that distinguish hobby projects from production systems.
By the end of this page, you'll understand CQL (Cassandra Query Language) deeply, be able to design Cassandra schemas for common use cases, understand primary key design and its implications, know how to model data for query patterns, and understand operational considerations like compaction and repair.
CQL (Cassandra Query Language) looks superficially like SQL but has fundamentally different semantics. Understanding these differences is crucial for effective Cassandra development.
A keyspace is Cassandra's equivalent of a database schema. It defines:
12345678910111213141516171819202122
-- Development keyspace (single datacenter, low replication)CREATE KEYSPACE IF NOT EXISTS dev_appWITH REPLICATION = { 'class': 'SimpleStrategy', 'replication_factor': 1}; -- Production keyspace (multi-datacenter, high availability)CREATE KEYSPACE IF NOT EXISTS prod_appWITH REPLICATION = { 'class': 'NetworkTopologyStrategy', 'us-east-1': 3, -- 3 replicas in US East 'eu-west-1': 3, -- 3 replicas in EU West 'ap-south-1': 2 -- 2 replicas in Asia Pacific}AND DURABLE_WRITES = true; -- Commit log enabled (default) -- Switch to keyspaceUSE prod_app; -- View keyspace detailsDESCRIBE KEYSPACE prod_app;Unlike relational tables that model entities, Cassandra tables model query patterns. Each table is designed to answer specific queries efficiently.
Key Concepts:
| Term | Description |
|---|---|
| Partition Key | Determines data distribution; all data with same partition key is on same node |
| Clustering Columns | Sort data within a partition; enable range queries within partition |
| Primary Key | Combination of partition key + clustering columns; unique identifier |
| Static Columns | Shared across all rows in a partition |
| Regular Columns | Standard data columns |
1234567891011121314151617181920212223242526272829303132333435363738394041
-- Simple primary key: single column partition keyCREATE TABLE users ( user_id UUID PRIMARY KEY, email TEXT, name TEXT, created_at TIMESTAMP); -- Compound primary key: partition key + clustering column-- Partition key: user_id-- Clustering column: post_time (orders data within partition)CREATE TABLE user_posts ( user_id UUID, post_time TIMESTAMP, post_id UUID, content TEXT, PRIMARY KEY (user_id, post_time)) WITH CLUSTERING ORDER BY (post_time DESC); -- Most recent first -- Composite partition key: multiple columns form partition key-- Partition key: (tenant_id, user_id) -- both required for lookups-- Useful for multi-tenant applicationsCREATE TABLE tenant_user_events ( tenant_id UUID, user_id UUID, event_time TIMESTAMP, event_type TEXT, event_data TEXT, PRIMARY KEY ((tenant_id, user_id), event_time)) WITH CLUSTERING ORDER BY (event_time DESC); -- Multiple clustering columns: hierarchical sortingCREATE TABLE sensor_readings ( sensor_id UUID, year INT, month INT, day INT, reading_time TIMESTAMP, value DECIMAL, PRIMARY KEY (sensor_id, year, month, day, reading_time)) WITH CLUSTERING ORDER BY (year DESC, month DESC, day DESC, reading_time DESC);Primary key columns cannot be updated after insertion. If you need to 'change' a primary key value, you must delete the old row and insert a new one. This is by design—the primary key determines physical location, and changing it would require moving data.
Cassandra supports rich data types including native types, collections, and user-defined types.
| Type | Description | Example |
|---|---|---|
| UUID | Universally unique identifier | 550e8400-e29b-41d4-a716-446655440000 |
| TIMEUUID | Time-ordered UUID (v1) | 50554d6e-29bb-11e5-b345-feff819cdc9f |
| TEXT/VARCHAR | UTF-8 string | 'Hello World' |
| INT/BIGINT | 32/64-bit integer | 42, 9223372036854775807 |
| DECIMAL | Variable precision | 3.14159265359 |
| DOUBLE | 64-bit floating point | 3.14159 |
| BOOLEAN | true/false | true |
| TIMESTAMP | Date+time in milliseconds | '2024-01-15T10:30:00+0000' |
| DATE | Date only | '2024-01-15' |
| TIME | Time only (nanoseconds) | '10:30:00.000000000' |
| BLOB | Binary data | 0xCAFEBABE |
| INET | IP address | '192.168.1.1' |
1234567891011121314151617181920212223242526272829303132333435363738
-- LIST: Ordered collection, allows duplicates-- Use for: Order history, comment threadsCREATE TABLE articles ( article_id UUID PRIMARY KEY, title TEXT, tags LIST<TEXT>, -- ['tech', 'database', 'cassandra'] related_ids LIST<UUID>); -- Update listUPDATE articles SET tags = tags + ['nosql'] WHERE article_id = ?;UPDATE articles SET tags = ['new_tag'] + tags WHERE article_id = ?; -- PrependUPDATE articles SET tags[0] = 'updated' WHERE article_id = ?; -- By index -- SET: Unordered collection, no duplicates-- Use for: Unique tags, roles, permissionsCREATE TABLE user_settings ( user_id UUID PRIMARY KEY, roles SET<TEXT>, -- {'admin', 'user', 'moderator'} favorite_categories SET<TEXT>); -- Update setUPDATE user_settings SET roles = roles + {'editor'} WHERE user_id = ?;UPDATE user_settings SET roles = roles - {'user'} WHERE user_id = ?; -- MAP: Key-value pairs-- Use for: Attributes, metadata, preferencesCREATE TABLE products ( product_id UUID PRIMARY KEY, name TEXT, attributes MAP<TEXT, TEXT>, -- {'color': 'blue', 'size': 'XL'} prices MAP<TEXT, DECIMAL> -- {'USD': 99.99, 'EUR': 89.99}); -- Update mapUPDATE products SET attributes['weight'] = '500g' WHERE product_id = ?;UPDATE products SET attributes = attributes + {'material': 'cotton'} WHERE product_id = ?;For complex nested structures, define custom types:
1234567891011121314151617181920212223242526272829303132333435363738394041
-- Create UDT for addressCREATE TYPE address ( street TEXT, city TEXT, state TEXT, zip_code TEXT, country TEXT); -- Create UDT for contact infoCREATE TYPE contact_info ( email TEXT, phone TEXT, address FROZEN<address> -- Nested UDT must be frozen); -- Use UDT in tableCREATE TABLE customers ( customer_id UUID PRIMARY KEY, name TEXT, billing_address FROZEN<address>, shipping_address FROZEN<address>, contact FROZEN<contact_info>); -- Insert with UDTINSERT INTO customers (customer_id, name, billing_address)VALUES ( uuid(), 'John Doe', { street: '123 Main St', city: 'New York', state: 'NY', zip_code: '10001', country: 'USA' }); -- Query UDT fieldsSELECT name, billing_address.city FROM customers WHERE customer_id = ?;The FROZEN keyword treats the entire collection/UDT as a blob—you must replace it entirely to update. Non-frozen collections allow element-level updates but have size limits and performance implications. Use FROZEN for UDTs in primary keys and when you typically update the whole structure.
Cassandra queries are constrained by primary key structure. Understanding what queries are efficient—and which are impossible—is essential for schema design.
=)123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051
-- Table for examplesCREATE TABLE orders ( customer_id UUID, order_date DATE, order_time TIMESTAMP, order_id UUID, total DECIMAL, status TEXT, PRIMARY KEY (customer_id, order_date, order_time)) WITH CLUSTERING ORDER BY (order_date DESC, order_time DESC); -- ✅ EFFICIENT: Full partition key with equalitySELECT * FROM orders WHERE customer_id = ?; -- ✅ EFFICIENT: Partition key + clustering column rangeSELECT * FROM orders WHERE customer_id = ?AND order_date >= '2024-01-01' AND order_date <= '2024-01-31'; -- ✅ EFFICIENT: Partition key + prefix of clustering columnsSELECT * FROM ordersWHERE customer_id = ?AND order_date = '2024-01-15'; -- ❌ INEFFICIENT: Missing partition key (full cluster scan)SELECT * FROM orders WHERE order_date = '2024-01-15';-- Error: Cannot execute this query... ALLOW FILTERING -- ❌ INEFFICIENT: Skipping clustering columnSELECT * FROM ordersWHERE customer_id = ?AND order_time > '2024-01-15T10:00:00'; -- Skipped order_date!-- Error: PRIMARY KEY column "order_date" cannot be restricted -- ⚠️ DANGEROUS: Non-key column filterSELECT * FROM ordersWHERE customer_id = ?AND status = 'pending'ALLOW FILTERING; -- Scans all rows for this customer! -- ✅ EFFICIENT: Ordering matches clustering orderSELECT * FROM ordersWHERE customer_id = ?ORDER BY order_date DESC, order_time DESC; -- ❌ INVALID: Ordering conflicts with clusteringSELECT * FROM ordersWHERE customer_id = ?ORDER BY order_date ASC; -- Table ordered DESC!-- Error: Order by currently only supports the ordering specified at table definitionALLOW FILTERING enables inefficient queries but is almost never acceptable in production:
When you need ALLOW FILTERING, redesign your schema instead. Create a new table optimized for that query pattern.
If your code contains ALLOW FILTERING, you almost certainly have a schema design problem. The rare exceptions are: small tables (hundreds of rows), one-time admin queries, or analytics on replicas. Never use it for user-facing production queries.
Pattern 1: Time-Series Partitioning
123456789101112131415161718192021222324252627282930313233
-- Problem: IoT sensor data with millions of readings-- Query: Get recent readings for a sensor -- Bad design: All readings in one partitionCREATE TABLE sensor_data_bad ( sensor_id UUID, reading_time TIMESTAMP, value DECIMAL, PRIMARY KEY (sensor_id, reading_time));-- Problem: Partition grows unbounded! -- Good design: Time-bucketed partitionsCREATE TABLE sensor_data ( sensor_id UUID, day DATE, -- Bucket by day reading_time TIMESTAMP, value DECIMAL, PRIMARY KEY ((sensor_id, day), reading_time)) WITH CLUSTERING ORDER BY (reading_time DESC); -- Query today's readings for a sensorSELECT * FROM sensor_dataWHERE sensor_id = ? AND day = toDate(now())LIMIT 100; -- Query last 7 days (requires 7 queries, one per partition)-- Application-side: Loop over date range-- Or use: IN clause (limited to small sets)SELECT * FROM sensor_dataWHERE sensor_id = ?AND day IN ('2024-01-15', '2024-01-14', '2024-01-13', ...)LIMIT 100;Cassandra schema design follows a disciplined process that starts with queries, not entities. This is fundamentally different from relational design.
Step 1: Define Your Queries
Before writing any schema, enumerate all queries your application needs:
Q1: Get user profile by user_id
Q2: Get all posts by a user, ordered by time (most recent first)
Q3: Get all comments on a post, ordered by time
Q4: Get user's timeline (posts from users they follow)
Q5: Get trending posts in a category
Step 2: Design Tables for Each Query
Each query typically becomes one table. Yes, this means denormalization. Embrace it.
Step 3: Analyze Access Patterns
For each table:
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455
-- Q1: Get user profile by user_id-- Simple key-value lookupCREATE TABLE users ( user_id UUID PRIMARY KEY, username TEXT, email TEXT, bio TEXT, profile_image_url TEXT, created_at TIMESTAMP); -- Q2: Get all posts by a user, ordered by time-- Partition by user, cluster by timeCREATE TABLE posts_by_user ( user_id UUID, post_id TIMEUUID, content TEXT, media_urls SET<TEXT>, like_count COUNTER, -- Counter for likes PRIMARY KEY (user_id, post_id)) WITH CLUSTERING ORDER BY (post_id DESC); -- Q3: Get all comments on a post, ordered by time-- Partition by post, cluster by timeCREATE TABLE comments_by_post ( post_id TIMEUUID, comment_id TIMEUUID, user_id UUID, content TEXT, PRIMARY KEY (post_id, comment_id)) WITH CLUSTERING ORDER BY (comment_id ASC); -- Oldest first -- Q4: Get user's timeline-- Denormalized timeline per user (fan-out on write)CREATE TABLE timeline_by_user ( user_id UUID, post_id TIMEUUID, author_id UUID, author_username TEXT, -- Denormalized! Avoids join content_preview TEXT, -- First 200 chars PRIMARY KEY (user_id, post_id)) WITH CLUSTERING ORDER BY (post_id DESC)AND default_time_to_live = 2592000; -- 30 days TTL -- Q5: Get trending posts in a category-- Bucketed by time window + categoryCREATE TABLE trending_posts ( time_bucket TEXT, -- '2024-01-15T10' (hourly bucket) category TEXT, score DECIMAL, -- Trending score post_id TIMEUUID, author_username TEXT, content_preview TEXT, PRIMARY KEY ((time_bucket, category), score, post_id)) WITH CLUSTERING ORDER BY (score DESC, post_id DESC);When data is denormalized across tables, updates require touching multiple tables. Strategies:
1. Logged Batches (Same Partition)
1234567891011121314151617181920
-- Logged batch: Atomic within same partition-- Use when: Updates affect same partitionBEGIN BATCH INSERT INTO user_posts (user_id, post_id, content) VALUES (?, ?, ?); UPDATE users SET post_count = post_count + 1 WHERE user_id = ?;APPLY BATCH; -- UNLOGGED batch: No atomicity, slightly faster-- Use when: Updates affect different partitions but -- you want them sent together (reduces round trips)BEGIN UNLOGGED BATCH INSERT INTO posts_by_user (user_id, post_id, content) VALUES (?, ?, ?); INSERT INTO posts_by_category (category, post_id, user_id, content) VALUES (?, ?, ?, ?);APPLY BATCH;2. Eventual Consistency with Async Writers
For high-throughput scenarios, accept that denormalized views may be slightly behind:
User creates post:
1. Write to posts_by_user (synchronous, visible immediately)
2. Queue message to update timelines, trending, etc.
3. Background workers fan-out updates (eventual)
3. Materialized Views (Use Cautiously)
Cassandra supports materialized views that automatically sync, but they have limitations:
Production recommendation: Prefer explicit denormalization over materialized views. It's more predictable and debuggable.
Always estimate partition sizes. A partition should typically be under 100MB. If you expect unbounded growth (like all user posts forever), bucket by time. Calculate: rows × row_size × expected_growth. If it exceeds limits, add bucketing to partition key.
Cassandra's default eventually-consistent model doesn't guarantee strict ordering of operations. Lightweight Transactions (LWT) provide linearizable consistency—at a cost.
12345678910111213141516171819202122232425262728293031323334353637
-- IF NOT EXISTS: Insert only if row doesn't exist-- Use for: User registration (unique email)INSERT INTO users (user_id, email, username)VALUES (uuid(), 'alice@example.com', 'alice')IF NOT EXISTS; -- Returns: [applied] = true/false -- Result handling:-- [applied] | user_id | email | username-- true | null | null | null <- Success-- false | abc-123 | alice@example.com | alice <- Already exists -- IF condition: Update only if condition matches-- Use for: Optimistic locking, state transitionsUPDATE ordersSET status = 'shipped'WHERE order_id = ?IF status = 'processing'; -- Only if currently 'processing' -- IF EXISTS: Update only if row existsUPDATE usersSET email = 'newemail@example.com'WHERE user_id = ?IF EXISTS; -- Multiple conditionsUPDATE inventorySET quantity = quantity - 1WHERE product_id = ?IF quantity > 0 -- Don't go negativeAND reserved = false; -- Not reserved -- Delete with conditionDELETE FROM sessionsWHERE user_id = ?IF session_token = ?; -- Validate token before deleteLWT uses the Paxos consensus protocol, which is expensive:
| Aspect | Regular Write | LWT Write |
|---|---|---|
| Round Trips | 1 | 4 (prepare, promise, propose, commit) |
| Latency | ~2-5ms | ~20-50ms |
| Throughput | High | Low (contention-limited) |
| Lock Scope | None | Partition-level |
Guidelines for LWT:
Multiple concurrent LWT operations on the same partition serialize, creating contention. A high-traffic registration flow where all users go through one email uniqueness check (impossible if emails are partition keys) would be fine. But if your LWT involves a counter-like structure, you'll create a severe bottleneck.
Native secondary indexes in Cassandra have severe limitations. Understanding these limitations—and alternatives—is crucial.
Cassandra supports creating indexes on non-primary-key columns:
12345678910111213141516
-- Create secondary indexCREATE INDEX users_email_idx ON users (email); -- Now this query works without ALLOW FILTERINGSELECT * FROM users WHERE email = 'alice@example.com'; -- Index on collection columnCREATE INDEX articles_tags_idx ON articles (tags); -- Query collection indexSELECT * FROM articles WHERE tags CONTAINS 'cassandra'; -- Index on map keys or valuesCREATE INDEX product_attrs_keys_idx ON products (KEYS(attributes));CREATE INDEX product_attrs_values_idx ON products (VALUES(attributes));CREATE INDEX product_attrs_entries_idx ON products (ENTRIES(attributes));1. Distributed Index Architecture
Unlike relational B-tree indexes, Cassandra's secondary indexes are local to each node. A query by indexed column must:
This is essentially a full cluster scan.
2. Anti-Patterns
| Scenario | Problem |
|---|---|
| High-cardinality columns | Index becomes as large as data |
| Low-cardinality columns | Few values → hot spots |
| Frequently updated columns | Index maintenance overhead |
| Large result sets | Memory pressure on coordinator |
3. When Secondary Indexes Are Acceptable
WHERE user_id = ? AND status = 'active'1. Denormalized Lookup Tables
Create a table with the query column as partition key:
1234567891011121314151617181920212223242526
-- Primary tableCREATE TABLE users ( user_id UUID PRIMARY KEY, email TEXT, username TEXT); -- Lookup table: email → user_idCREATE TABLE users_by_email ( email TEXT PRIMARY KEY, user_id UUID); -- Insert requires both tables (batch or async)BEGIN BATCH INSERT INTO users (user_id, email, username) VALUES (?, ?, ?); INSERT INTO users_by_email (email, user_id) VALUES (?, ?);APPLY BATCH; -- Query: Find user by email-- Step 1: Get user_id from lookup tableSELECT user_id FROM users_by_email WHERE email = ?;-- Step 2: Get full user recordSELECT * FROM users WHERE user_id = ?;2. SASI Indexes (Experimental)
SASI (SSTable Attached Secondary Index) provides better performance for certain use cases:
3. External Search Systems
For complex search requirements:
Default to denormalized lookup tables. They're explicit, predictable, and scale. Use secondary indexes only when: queries are rare, result sets are small, and the indexed column has medium cardinality. For search use cases, integrate Elasticsearch.
Running Cassandra in production requires understanding key operational aspects.
Compaction merges SSTables and removes obsolete data. Strategy choice impacts read/write performance:
| Strategy | Best For | Trade-offs |
|---|---|---|
| SizeTieredCompactionStrategy (STCS) | Write-heavy, mixed workloads | Higher space amplification |
| LeveledCompactionStrategy (LCS) | Read-heavy, consistent latency | Higher write amplification |
| TimeWindowCompactionStrategy (TWCS) | Time-series with TTL | Efficient TTL, bucketed compaction |
123456789101112131415161718192021
-- Create table with specific compaction strategyCREATE TABLE metrics ( sensor_id UUID, hour TIMESTAMP, value DECIMAL, PRIMARY KEY (sensor_id, hour)) WITH compaction = { 'class': 'TimeWindowCompactionStrategy', 'compaction_window_unit': 'HOURS', 'compaction_window_size': 1}AND default_time_to_live = 604800; -- 7 days TTL -- Change compaction strategy (online migration)ALTER TABLE users WITH compaction = { 'class': 'LeveledCompactionStrategy', 'sstable_size_in_mb': 160}; -- Check compaction statusSELECT * FROM system.compaction_history;Regular repair ensures replicas stay in sync:
# Full repair (Merkle tree comparison)
nodetool repair -full keyspace_name
# Incremental repair (only unrepaired data)
nodetool repair keyspace_name
# Repair specific table
nodetool repair keyspace_name table_name
# Check repair status
nodetool netstats
Repair Scheduling:
gc_grace_seconds (default 10 days)| Metric | Warning Threshold | Critical Threshold |
|---|---|---|
| Read Latency (p99) | > 10ms | > 100ms |
| Write Latency (p99) | > 5ms | > 50ms |
| Compaction Pending | > 50 | > 200 |
| Heap Usage | > 75% | > 90% |
| SSTable Count/Table | > 20 | > 50 |
| Dropped Mutations | > 0 | - |
Cassandra uses tombstones (deletion markers) that persist for gc_grace_seconds (default 10 days). If you don't repair within this window, deleted data can 'resurrect.' For tables with TTL-based deletion, ensure repairs run well within this window. Reduce gc_grace_seconds cautiously—it affects deletion propagation.
We've covered extensive ground on Apache Cassandra, from CQL fundamentals to production operations. Let's consolidate the key insights:
What's Next:
Now that we understand Cassandra's mechanics, the next page explores Time-Series Data—a workload where column-family stores truly shine, powering IoT, monitoring, and analytics applications.
You now have practical, hands-on knowledge of Apache Cassandra. These patterns—query-driven schema design, denormalization, and operational awareness—are directly applicable to production Cassandra deployments at any scale.