Key Value Stores - Learning Module

Loading content...

0/252

Limitations

Understanding the Trade-offs

Every architectural choice involves trade-offs. Key-value stores achieve their remarkable simplicity and performance by deliberately omitting features that traditional databases provide. Understanding these limitations is essential for making sound decisions about when to use—and when to avoid—key-value stores.

The limitations aren't bugs or missing features; they're fundamental consequences of the key-value data model. Attempting to work around them often results in complex, fragile systems that would be better served by a different database type.

What You Will Learn

By the end of this page, you will understand the fundamental limitations of key-value stores, recognize warning signs that a key-value store is the wrong choice, and know how to work within these constraints or choose alternatives.

No Query Language

The most fundamental limitation. Key-value stores provide no way to query data by anything other than the exact key. There's no SQL, no query optimizer, no way to say "find all users where status = 'active'".

What this means in practice:

You cannot search by attribute values
You cannot filter, sort, or aggregate across keys
You must know the exact key to retrieve data
Every access pattern must be explicitly designed into your key structure

query_limitations.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
# What you CAN do:
user = kv_store.get("user:123")  # Exact key lookup
 
# What you CANNOT do:
# SQL equivalent: SELECT * FROM users WHERE email = 'alice@example.com'
# Key-value: No built-in way to do this!
 
# Workaround: Build and maintain secondary indexes manually
class UserStore:
    def create_user(self, user: dict):
        user_id = user["id"]
        # Primary storage
        self.kv.set(f"user:{user_id}", json.dumps(user))
        # Manual secondary index
        self.kv.set(f"user:email:{user['email']}", user_id)
    
    def get_by_email(self, email: str) -> dict:
        # Two lookups required
        user_id = self.kv.get(f"user:email:{email}")
        if not user_id:
            return None
        return json.loads(self.kv.get(f"user:{user_id}"))
    
    def update_email(self, user_id: str, new_email: str):
        user = json.loads(self.kv.get(f"user:{user_id}"))
        old_email = user["email"]
        
        # Must update both primary data AND index
        user["email"] = new_email
        self.kv.set(f"user:{user_id}", json.dumps(user))
        self.kv.delete(f"user:email:{old_email}")
        self.kv.set(f"user:email:{new_email}", user_id)
        # If any step fails, data is inconsistent!

Index Maintenance Burden

Every secondary index you build is a maintenance burden. You must update indexes on every write, handle partial failures, and ensure consistency. The more indexes you need, the more a document database or relational database becomes attractive.

No Joins or Relationships

No support for relational operations. Key-value stores have no concept of relationships between entities. There's no foreign key, no JOIN operation, no referential integrity.

The consequences:

Relationship Limitations

•No automatic joins — To get user + orders, you make multiple queries
•No referential integrity — Deleting a user doesn't delete/update related orders
•Denormalization required — Duplicate data to avoid multiple lookups
•Consistency is your problem — Related data can become inconsistent
•N+1 query patterns — Fetching related items requires many round-trips

no_joins_example.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
# SQL: One query with JOIN
# SELECT u.name, o.id, o.total 
# FROM users u JOIN orders o ON u.id = o.user_id 
# WHERE u.id = 123
 
# Key-value: Multiple queries required
def get_user_with_orders(user_id: str):
    # Query 1: Get user
    user = json.loads(kv.get(f"user:{user_id}"))
    
    # Query 2: Get order IDs list
    order_ids = json.loads(kv.get(f"user:{user_id}:orders") or "[]")
    
    # Query 3+: Get each order (or use MGET)
    orders = []
    for order_id in order_ids:
        order = json.loads(kv.get(f"order:{order_id}"))
        orders.append(order)
    
    return {"user": user, "orders": orders}
    # Minimum 3 round-trips vs 1 SQL query!
 
# Alternative: Denormalize (duplicate user info in orders)
# Trades storage/consistency for fewer queries

When Denormalization Works

Denormalization is acceptable when: (1) the duplicated data rarely changes, (2) eventual consistency is acceptable, (3) read performance is more important than storage efficiency. For frequently-changing relational data, use a relational database.

No Aggregations

No COUNT, SUM, AVG, GROUP BY. Key-value stores cannot compute aggregations across data. Every aggregate value must be pre-computed and maintained manually.

Aggregation Comparison
Operation	SQL	Key-Value Approach
Count users	`SELECT COUNT(*) FROM users`	Maintain `stats:users:count`, increment on create, decrement on delete
Sum revenue	`SELECT SUM(amount) FROM orders`	Maintain `stats:revenue:total`, increment on each order
Group by status	`SELECT status, COUNT(*) GROUP BY status`	Maintain separate counters: `stats:users:status:active`, `stats:users:status:inactive`
Average order	`SELECT AVG(amount) FROM orders`	Maintain sum AND count, compute ratio on read

Problems with pre-computed aggregates:

Drift — Counters can become inaccurate if updates fail partially
Complexity — Every write must update all affected aggregates
Rigidity — Adding new aggregations requires backfilling from source data
Storage — Each aggregation dimension requires separate keys

Analytics Anti-Pattern

If you need ad-hoc analytics, reporting, or business intelligence, key-value stores are fundamentally wrong. Use a relational database, data warehouse, or specialized analytics database.

Memory Constraints

In-memory stores are bounded by RAM. Redis and Memcached keep all data in memory. Your dataset size is limited by available RAM, which is orders of magnitude more expensive than disk storage.

Storage Cost Comparison (Approximate)
Storage Type	Cost per GB/month	Latency	Relative Cost
Redis Cloud	$30-100	~100μs	100x
SSD Cloud Storage	$0.10-0.30	~1ms	1x
HDD Cloud Storage	$0.02-0.05	~10ms	0.1x
Object Storage (S3)	$0.02	~50ms	0.1x

Implications:

Large datasets become prohibitively expensive
Memory pressure leads to eviction or out-of-memory errors
Careful memory management becomes mandatory
Must monitor and plan for memory growth

memory_management.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
# Memory-conscious practices
 
# 1. Always set TTL on ephemeral data
redis.setex("cache:query:abc", 3600, result)  # Expires in 1 hour
 
# 2. Use appropriate data structures
# Hash with small values uses less memory than individual keys
redis.hset("user:123", mapping={"name": "Alice", "email": "a@b.com"})
# vs
redis.set("user:123:name", "Alice")
redis.set("user:123:email", "a@b.com")
 
# 3. Compress large values before storing
import zlib
compressed = zlib.compress(large_json.encode())
redis.set("large:data", compressed)
 
# 4. Use memory-efficient data types
# - Small hashes/sets/lists use ziplist encoding (dense)
# - Configure hash-max-ziplist-entries and similar settings
 
# 5. Monitor memory usage
info = redis.info("memory")
used_memory = info["used_memory"]
maxmemory = info["maxmemory"]
usage_percent = (used_memory / maxmemory) * 100

Eviction is Silent Data Loss

When Redis reaches maxmemory with an eviction policy like 'allkeys-lru', it silently deletes keys. This is fine for cache workloads but disastrous if you're using Redis as a primary database. Always monitor memory and plan for growth.

Limited Transactions

No ACID transactions across keys. While single-key operations are atomic, multi-key operations lack true transactional guarantees. Redis MULTI/EXEC provides atomicity but not isolation or rollback.

What Redis Transactions Provide

•Atomicity — All commands execute or none
•Isolation — No other commands interleave
•Optimistic locking — WATCH for CAS patterns

What Transactions Lack

•Rollback — Partial failures can occur
•Read-your-writes in TX — Can't read data set in same TX
•Distributed TX — No 2PC across cluster nodes

transaction_limitations.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
# Redis transaction limitation: No rollback on command error
 
pipe = redis.pipeline()
pipe.multi()
pipe.set("key1", "value1")        # Will succeed
pipe.incr("key2")                  # Will FAIL if key2 is not a number
pipe.set("key3", "value3")        # Still executes!
results = pipe.execute()
# Results: [True, ResponseError, True]
# key1 and key3 are set, key2 failed
# No automatic rollback!
 
# Workaround: Use Lua for true atomic operations
lua_script = """
local current = redis.call('GET', KEYS[1])
if current then
    local new_value = tonumber(current) + 1
    redis.call('SET', KEYS[1], new_value)
    redis.call('SET', KEYS[2], new_value)
    return new_value
else
    return nil
end
"""
# Entire script executes atomically with ability to abort

Consistency Trade-offs

Replication is asynchronous by default. In distributed key-value stores, writes may be acknowledged before replicating to all nodes. This means:

Reads from replicas may return stale data
Data can be lost if primary fails before replication
Different clients may see different values momentarily

Consistency Options in Distributed Key-Value Stores
System	Default Consistency	Strong Consistency Option
Redis Replication	Async (eventual)	WAIT command (blocks until replicated)
Redis Cluster	Async (eventual)	None built-in
DynamoDB	Eventually consistent	Strongly consistent reads (2x cost)
Cassandra	Tunable	ALL/QUORUM write + read

Failover Can Lose Data

If a Redis primary fails before replicating recent writes to replicas, those writes are lost permanently. For truly critical data, either use synchronous replication (WAIT) at the cost of latency, or use a different database with stronger durability guarantees.

Operational Complexity

Simple interface, complex operations. While the API is simple, running key-value stores in production introduces operational challenges.

Operational Challenges

•Memory monitoring — Must track usage, plan for growth, handle eviction
•Persistence management — Configure RDB/AOF, handle disk space, monitor save status
•Cluster management — Resharding, rebalancing, handling node failures
•Backup/restore — Point-in-time recovery is harder than SQL databases
•Key space analysis — No schema; discovering what data exists is difficult
•No schema migrations — Data format changes require application-level handling

Managed Services Reduce Burden

Consider managed services (AWS ElastiCache, Redis Cloud, Azure Cache) to offload operational complexity. They handle replication, failover, patching, and monitoring, letting you focus on application logic.

Summary: Informed Decision Making

Key Takeaways

•No query language — Only exact key lookups; secondary indexes are manual
•No relationships — JOINs impossible; denormalization or multiple queries required
•No aggregations — COUNT, SUM, AVG must be pre-computed and maintained
•Memory limited — RAM is expensive; large datasets become costly
•Limited transactions — No rollback; multi-key consistency is application responsibility
•Eventual consistency — Async replication means temporary inconsistency
•Choose wisely — Use key-value stores for appropriate patterns; use SQL for complex data needs

The bottom line:

Key-value stores are specialized tools, not universal solutions. They excel at lookup-by-key patterns with simple data models. When you need complex queries, relationships, aggregations, or strong consistency, traditional relational databases remain the better choice.

The best architectures often use both: a relational database as the source of truth, with key-value stores for caching, sessions, and real-time features. This polyglot persistence approach leverages each tool's strengths.

Module Complete

Congratulations! You've completed the Key-Value Stores module. You now understand the fundamental concepts, data modeling, Redis as a canonical example, ideal use cases, and honest limitations. You're equipped to make informed decisions about when key-value stores are the right tool for your architecture.