Sql Vs Nosql - Learning Module

Loading content...

0/273

When to Choose NoSQL

The Right Tool for Specific Jobs

NoSQL databases aren't a rebellion against SQL—they're specialized tools for specific problems. Just as you wouldn't use a screwdriver when you need a hammer, you shouldn't use a key-value store for transactional data or a relational database for time-series metrics.

The key to choosing NoSQL wisely is understanding what problems each category solves better than relational databases. NoSQL wins when its specific optimizations align with your specific requirements—not as a general replacement for SQL, but as the optimal choice for particular use cases.

This page will help you recognize when NoSQL isn't just acceptable but genuinely better. We'll explore the requirements, access patterns, and system characteristics that point decisively toward non-relational databases.

What You Will Learn

By the end of this page, you will have clear frameworks for identifying NoSQL-appropriate use cases. You'll understand which requirements favor each NoSQL category, recognize data models suited for non-relational storage, and be able to articulate why NoSQL is the right choice when it is.

When You Need Massive Horizontal Scale

The original motivation for NoSQL was horizontal scalability—distributing data across many machines to handle loads beyond any single server's capacity. If your scale requirements genuinely exceed what a well-optimized SQL database can handle, NoSQL systems designed for distribution become compelling.

What Massive Scale Looks Like:

Scale Thresholds
Metric	SQL Comfortable Zone	NoSQL Territory
Data volume	< 10 TB per node	10 TB, growing rapidly
Writes per second	< 10,000 per node	100,000 sustained
Reads per second	< 100,000 with replicas	millions per second
Geographic distribution	Single region with replicas	Multi-region active-active
Response time at scale	< 100ms P95	< 10ms P99 guaranteed

Why Horizontal Scale Is Hard for SQL:

Relational databases assume:

Joins can access any table
Transactions can span any rows
Consistency is global

These assumptions make distributed operation complex. Sharding SQL databases requires:

Choosing shard keys carefully
Avoiding cross-shard queries (or accepting performance hits)
Handling distributed transactions (complex and slow)
Managing shard rebalancing

NoSQL databases were designed with distribution as a first principle, not an afterthought. They accept limitations (no joins, limited transactions) in exchange for linear scalability.

Signals for Scale-Driven NoSQL

•Petabyte-scale data — Netflix viewing history, Uber trip data, Facebook interactions. More data than fits on large servers.
•Global user base with latency requirements — Users on every continent need <50ms responses. Data must be geographically distributed.
•Massive write throughput — IoT sensor data, logging, analytics events. Hundreds of thousands of writes per second sustained.
•Cost-effective scaling — Commodity hardware, pay-per-use cloud resources. Scale out on cheap machines rather than up on expensive ones.
•Elastic demand — Traffic spikes during events, seasonal patterns. Need to scale capacity quickly and temporarily.

Honest Scale Assessment

Most companies claiming they need NoSQL for scale could run on a single PostgreSQL instance. Be honest about whether you're at Google/Netflix scale or just anticipating it. Premature optimization for scale you don't have wastes engineering effort.

When Data Structure Is Truly Variable

Schema flexibility is often cited as a NoSQL advantage, but it's frequently misapplied. The question isn't whether you want flexibility—it's whether your domain requires it.

Legitimate Schema Flexibility Use Cases:

When Flexibility Is Genuine

•User-generated schemas — Users define custom fields, forms, or data structures. CRM systems where each company has different lead fields. Survey tools with user-defined questions.
•Event sourcing / audit logs — Events have varying structures by type. A 'UserCreated' event has different fields than 'OrderPlaced'. Storing as documents preserves type-specific data.
•Product catalogs with varying attributes — Electronics have wattage; clothing has size/color; books have ISBN. Same 'products' table but vastly different attributes.
•ETL and data integration — Ingesting data from various sources with different schemas. Store first, normalize later (or never).
•Configuration and settings — Nested, variable configuration structures. Feature flags with type-specific config.
•Content management — Articles, videos, podcasts share some fields but have media-specific attributes.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
// Product catalog with truly different attributes per product type
// Document store (MongoDB) handles this naturally
 
// Electronics product
{
  "_id": "prod_Electronics_001",
  "type": "electronics",
  "name": "4K Smart TV",
  "price": 799.99,
  "brand": "Samsung",
  // Electronics-specific attributes
  "specs": {
    "screen_size_inches": 55,
    "resolution": "3840x2160",
    "refresh_rate_hz": 120,
    "smart_platform": "Tizen",
    "hdmi_ports": 4,
    "power_consumption_watts": 150
  }
}
 
// Clothing product - completely different spec structure
{
  "_id": "prod_Clothing_001",
  "type": "clothing",
  "name": "Cotton T-Shirt",
  "price": 29.99,
  "brand": "Uniqlo",
  // Clothing-specific attributes
  "variants": [
    { "size": "S", "color": "White", "sku": "SHIRT-S-WHT", "stock": 50 },
    { "size": "M", "color": "White", "sku": "SHIRT-M-WHT", "stock": 100 },
    { "size": "L", "color": "Black", "sku": "SHIRT-L-BLK", "stock": 75 }
  ],
  "material": "100% Cotton",
  "care_instructions": "Machine wash cold"
}
 
// Book product - yet another structure
{
  "_id": "prod_Book_001",
  "type": "book",
  "name": "Designing Data-Intensive Applications",
  "price": 45.99,
  "brand": "O'Reilly Media",
  // Book-specific attributes
  "isbn": "978-1449373320",
  "author": "Martin Kleppmann",
  "pages": 624,
  "format": "Paperback",
  "publication_date": "2017-03-16",
  "language": "English"
}

When Flexibility Is Misapplied:

Many teams choose NoSQL for 'flexibility' when their data is actually structured:

// BAD: Using schema flexibility as excuse for laziness
// All users have the same fields—this should be SQL!
{ "_id": "u1", "name": "Alice", "email": "a@x.com" }
{ "_id": "u2", "Name": "Bob", "EMAIL": "b@x.com" }  // Inconsistent!
{ "_id": "u3", "name": "Charlie" }  // Missing email!

This isn't schema flexibility—it's schema chaos. A relational table with well-defined columns would prevent these inconsistencies.

Flexibility Test

Ask: 'If I wrote a SQL schema for this, would 80%+ of columns be either always-null or used differently for different record types?' If yes, document databases make sense. If no, you probably want a relational schema with good modeling.

When Speed Matters Most: Caching and Real-Time Data

Key-value stores like Redis and Memcached deliver sub-millisecond responses through in-memory storage and minimal query overhead. When access speed is the primary concern and data fits specific patterns, these stores are unbeatable.

Where Key-Value Stores Excel:

Key-Value Use Cases

•Caching — Store computed results, database query results, rendered HTML. Avoid repeated expensive operations.
•Session storage — User sessions with server-side state. Fast reads on every request. Expiration for cleanup.
•Rate limiting — Track request counts per user/IP. Increment atomically. Auto-expire windows.
•Feature flags — Quick lookups of feature state. Invalidate across servers instantly.
•Leaderboards — Sorted sets for rankings. Real-time score updates. Efficient range queries.
•Real-time analytics — Counters, HyperLogLog for unique counts. Accumulate before persisting.
•Pub/sub messaging — Real-time notifications. Chat messages. Server-to-server events.
•Queue management — Simple job queues with list operations. Background task coordination.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
# Session storage with expiration
SET session:abc123 '{"user_id":1001,"role":"admin"}' EX 3600
 
# Rate limiting (sliding window)
INCR rate:user:1001:minute:202401151530
EXPIRE rate:user:1001:minute:202401151530 60
 
# Caching with cache-aside pattern
GET cache:user:1001:profile    # Check cache first
# If miss, load from DB and cache
SET cache:user:1001:profile '{...}' EX 300
 
# Leaderboard with sorted set
ZADD leaderboard:global 15000 "player:1001"
ZADD leaderboard:global 14500 "player:1002"
ZREVRANGE leaderboard:global 0 9 WITHSCORES  # Top 10
 
# Real-time counters
INCR stats:page_views:2024-01-15
PFADD unique_visitors:2024-01-15 "user:1001"  # HyperLogLog
 
# Pub/sub for real-time updates
PUBLISH chat:room:42 '{"sender":"alice","message":"Hello!"}'
 
# Simple queue
LPUSH queue:emails '{"to":"user@example.com","subject":"..."}'
BRPOP queue:emails 30  # Blocking pop with timeout

Performance Comparison:

Operation Latency Comparison
Operation	PostgreSQL	Redis (in-memory)
Simple key lookup	~1-5ms	~0.1-0.5ms
Increment counter	~2-10ms	~0.1ms
Write + read	~5-20ms	~0.2-0.5ms
Batch 100 reads	~10-50ms	~1-2ms (pipelining)

Redis Is Not a Primary Database

Despite Redis having persistence options, it's optimized for in-memory workloads. Use it alongside a primary database (SQL or NoSQL), not as a replacement. Data in Redis should be rebuildable from the primary store.

When Data Is Time-Series or Append-Only

Time-series data—metrics, logs, IoT sensor readings, financial ticks—has characteristics that don't fit well in traditional relational models. Wide-column and specialized time-series databases handle these workloads more efficiently.

Time-Series Data Characteristics:

Time-Series Properties

•Append-mostly — Data is written once, rarely updated. Yesterday's server metrics don't change.
•Timestamp-centric queries — Almost every query filters by time range. 'Show me CPU usage between 2pm and 3pm.'
•High write volume — Thousands of metrics per second from each server, sensor, or device.
•Retention policies — Old data loses value. Keep hourly granularity for a year, daily for five years, then delete.
•Aggregation-heavy reads — Rarely read individual points. Usually aggregate: averages, percentiles, rates.
•Downsampling — Convert second-granularity to minute-granularity for long-term storage.

Why SQL Struggles with Time-Series:

-- SQL table for metrics
CREATE TABLE metrics (
    id BIGINT PRIMARY KEY,
    metric_name VARCHAR(100),
    timestamp TIMESTAMP,
    value DOUBLE,
    tags JSONB
);

-- Problem 1: Massive index overhead
-- Every insert updates B-tree indexes

-- Problem 2: Inefficient time-range queries
SELECT * FROM metrics 
WHERE metric_name = 'cpu_usage' 
AND timestamp BETWEEN '2024-01-15' AND '2024-01-16';
-- Scans index, then random I/O to fetch rows

-- Problem 3: Retention requires DELETE
DELETE FROM metrics WHERE timestamp < NOW() - INTERVAL '30 days';
-- Slow, creates vacuum pressure, fragmentation

Time-Series Database Optimizations:

Time-Series DB Optimizations

•Columnar storage — Store each field (timestamp, value, tags) contiguously. Compression is 10-50x better. Aggregations scan minimal data.
•LSM-tree writes — Append to write-ahead log, batch compact. Orders of magnitude faster writes than B-tree updates.
•Time-partitioned chunks — Data organized by time periods. Retention = drop old chunks (instant, not DELETEs).
•Automatic downsampling — Aggregate old data automatically. Keep fine granularity for recent data only.
•Pre-aggregation — Compute rollups at write time. Reads are pre-computed, nearly instant.

Time-Series Database Options
Database	Type	Best For
InfluxDB	Purpose-built time-series	Metrics, IoT, monitoring
TimescaleDB	PostgreSQL extension	SQL + time-series, hybrid needs
Prometheus	Metrics collection	Kubernetes/container metrics
Cassandra	Wide-column	Massive scale time-series
ClickHouse	Columnar analytics	Log analysis, real-time analytics
QuestDB	High-performance TSDB	Financial ticks, low latency

Consider TimescaleDB

TimescaleDB is a PostgreSQL extension that adds time-series optimizations while keeping SQL. If you need time-series features but want to stay in the PostgreSQL ecosystem, it's an excellent middle ground—you get chunks, compression, and retention policies with familiar SQL.

When Relationships Are the Query

Graph databases shine when relationships are the primary query target—not just a means to join data, but the actual answer you're seeking. If your questions are 'how is A connected to B?' or 'what's within 3 hops of C?', graph databases outperform relational by orders of magnitude.

Graph Database Use Cases:

When to Use Graphs

•Social network features — Friends-of-friends, mutual connections, influence propagation, community detection. Queries like 'Find people connected to me through at most 2 people who live in NYC.'
•Recommendation engines — 'Users who bought X also bought Y.' Traverse user→purchase→product→purchase→user paths.
•Fraud detection — Identify suspicious patterns: circular transactions, connected shell accounts, unusual fund flows.
•Knowledge graphs — Entity relationships for search, AI, question answering. 'Who directed the movie where Tom Hanks played a character who worked for NASA?'
•Access control — Complex permission hierarchies. 'Does user A have permission to resource B through any group membership chain?'
•Network/infrastructure mapping — Service dependencies, routing paths, impact analysis. 'If this server fails, what downstream services are affected?'
•Genealogy / org charts — Multi-level hierarchical relationships with arbitrary depth.

Performance Comparison for Graph Queries:

Consider: 'Find all friends-of-friends of user Alice'

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
-- SQL approach for friends-of-friends
SELECT DISTINCT f2.friend_id
FROM friendships f1
JOIN friendships f2 ON f1.friend_id = f2.user_id
WHERE f1.user_id = 'alice'
  AND f2.friend_id != 'alice';
 
-- With 1M users, 100 avg friends each:
-- f1: Fetch 100 friends of Alice
-- f2: For each friend, index scan to find THEIR friends
-- Total: ~100 index lookups, joining potentially 10,000 results
-- Time: 50-500ms depending on indexes and data locality
 
-- For 3 hops (friends of friends of friends):
SELECT DISTINCT f3.friend_id
FROM friendships f1
JOIN friendships f2 ON f1.friend_id = f2.user_id
JOIN friendships f3 ON f2.friend_id = f3.user_id
WHERE f1.user_id = 'alice'
  AND f3.friend_id != 'alice';
 
-- Now we're looking at millions of intermediate rows
-- Time: seconds to minutes
 
-- Graph database (Neo4j Cypher):
-- 2 hops
MATCH (alice:Person {name: 'Alice'})-[:FRIEND*2]-(fof:Person)
WHERE alice <> fof
RETURN DISTINCT fof
 
-- 3 hops
MATCH (alice:Person {name: 'Alice'})-[:FRIEND*3]-(fof:Person)
WHERE alice <> fof
RETURN DISTINCT fof
 
-- Graph DB: Direct pointer traversal, no index lookups per hop
-- Time: milliseconds for 2-3 hops, even with millions of users

Key Insight:

In SQL, relationship queries scale with data size (O(n) for each join). In graph databases, queries scale with result size—the number of relationships traversed, not the total database size. If Alice has 100 friends, finding friends-of-friends takes the same time whether the database has 1,000 users or 1 billion.

When NOT to Use Graph Databases

Graph databases aren't for everything with relationships. Standard one-to-many (user has orders, orders have items) is perfectly served by SQL. Graphs shine for variable-depth traversals, path-finding, and pattern matching across relationships—not for simple joins.

When Your Domain Is Document-Centric

Document databases are ideal when your domain naturally consists of self-contained documents that are typically read and written as units. If you find yourself constantly fetching an entity with all its related data, documents may reduce complexity.

Document-Friendly Patterns:

Document DB Use Cases

•Content management systems — Articles with embedded author info, tags, comments. The article is the natural document boundary.
•User profiles with preferences — Profile, settings, notification preferences, saved items all belong to one user document.
•Product catalogs — Self-contained product documents with varying attributes by category.
•Single-page application state — Complex UI state stored and retrieved atomically.
•Configuration and templates — Nested, variable structure. Don't fit relational tables well.
•Events and logs — Each event is self-contained. Schema varies by event type.
•Mobile/offline sync — Documents synchronize naturally as units.

The Document Boundary Decision:

The key design question is: What's the natural unit of read/write?

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
// GOOD Document Boundary: Blog Post with Comments
// - Posts are always displayed with their comments
// - Comments only matter in context of their post
// - Atomic operations: add comment, update post
{
  "_id": "post_12345",
  "title": "Introduction to MongoDB",
  "content": "...",
  "author": { "id": "user_001", "name": "Alice" }, // Embedded
  "comments": [
    { "author": "bob", "text": "Great article!", "date": "..." },
    { "author": "carol", "text": "Very helpful", "date": "..." }
  ],
  "tags": ["mongodb", "nosql", "tutorial"]
}
 
// BAD Document Boundary: Same structure for heavy commenting
// - Thousands of comments per post
// - Document grows without bound (16MB limit)
// - Can't query individual comments efficiently
// - Better: Store comments in separate collection with post_id reference
 
// GOOD Document Boundary: E-commerce Order
// - Order displayed with all line items
// - Snapshot of prices at order time (denormalized)
// - Rarely updated after creation
{
  "_id": "order_99999",
  "customer_id": "cust_555",
  "status": "shipped",
  "items": [
    { "product_id": "p1", "name": "Widget", "qty": 2, "price": 19.99 },
    { "product_id": "p2", "name": "Gadget", "qty": 1, "price": 49.99 }
  ],
  "shipping_address": { ... },
  "total": 89.97
}
 
// BAD Document Boundary: Inventory in Order
// - Inventory changes frequently, not per-order
// - Would need to update thousands of orders on price change
// - Better: Reference product_id, keep inventory separate

Embedding vs Referencing

Embed when data is read together and updates are rare. Reference when data is shared across documents, grows unbounded, or updates frequently. Getting this wrong leads to either massive duplication or N+1 query problems.

When Availability Trumps Consistency

Some systems prioritize availability over consistency—it's better to return potentially stale data than to refuse a request. If your application can tolerate temporary inconsistencies but cannot tolerate downtime, AP (Availability + Partition tolerance) systems like Cassandra or DynamoDB may be the right choice.

When Eventual Consistency Is Acceptable:

Consistency Requirements by Use Case
Use Case	Consistency Need	Suitable Model
Social media feed	Eventual (delay ok)	AP / NoSQL
View counts	Eventual (approximate ok)	AP / NoSQL
Shopping cart	Read-your-writes	Either with care
User preferences	Eventual (low contention)	AP / NoSQL
Financial transactions	Strong (must be exact)	CP / SQL
Inventory counts	Strong (avoid oversell)	CP / SQL
Session data	Read-your-writes	Either
Audit logs	Eventual (append-only)	AP / NoSQL

Cassandra and DynamoDB: Designed for Availability:

These databases replicate data across multiple nodes and continue operating even when nodes fail:

Cassandra with RF=3 (replication factor 3):
- Data written to 3 nodes
- Can lose 1 node and still have full availability
- Can lose 2 nodes and still serve reads (at CL=ONE)

DynamoDB:
- Automatically replicates across availability zones
- Global tables replicate across regions
- Continues serving even during regional outages

Consistency Levels Trade-off:

Consistency Level Options

•ONE — Fast, highly available, eventually consistent. May read stale data. Good for caching, logs, views.
•QUORUM — Majority of replicas. Balances consistency and availability. Handles single-node failures.
•ALL — All replicas must respond. Strong consistency but low availability. Rarely used.
•LOCAL_QUORUM — Quorum within local datacenter. Low latency + reasonable consistency for multi-region.

Tunable Consistency

Modern NoSQL databases let you tune consistency per-operation. Use strong consistency for critical operations (order placement) and eventual consistency for non-critical reads (product recommendations). You don't have to pick one model for everything.

Decision Framework: When NoSQL Is Right

Let's consolidate into a decision framework. NoSQL wins when specific requirements align with its strengths—not as a general replacement for SQL.

Choose Key-Value Stores When:

Key-Value Signals

•Access is always by known key (no queries on value content)
•Sub-millisecond latency is required
•Use case is caching, sessions, rate limiting, or real-time counters
•Data can be derived/rebuilt from primary source

Choose Document Stores When:

Document Store Signals

•Data has genuinely variable structure per record type
•Entities are self-contained and read/written as units
•Relationships are hierarchical (embedded), not relational
•Rapid schema evolution during product development
•Developer productivity with JSON/BSON is prioritized

Choose Wide-Column Stores When:

Wide-Column Signals

•Write throughput exceeds 100K+ writes/second sustained
•Data is time-series, event logs, or append-heavy
•Petabyte-scale is anticipated or current
•Eventual consistency is acceptable
•Access patterns are known and can be modeled upfront

Choose Graph Databases When:

Graph Signals

•Queries are about relationships, not just entities
•Variable-depth traversals are common (1-6+ hops)
•Pattern matching across relationship paths
•Social networks, fraud detection, recommendations are core features

NoSQL as Addition, Not Replacement

The most common pattern in mature systems is SQL + selective NoSQL: PostgreSQL for core transactional data, Redis for caching and real-time features, Elasticsearch for search, maybe Cassandra for massive event logs. NoSQL extends capabilities; it rarely replaces them entirely.

Summary: When to Choose NoSQL

We've covered the scenarios where NoSQL databases genuinely outperform relational alternatives. Let's consolidate the key takeaways:

Key Takeaways

•Choose NoSQL for genuine massive scale — When you truly need petabyte storage or 100K+ writes/second. Most applications don't.
•Choose NoSQL for genuine schema variability — When different records truly have different structures, not when you want to skip schema design.
•Use key-value stores for caching and real-time data — Sub-millisecond access for sessions, rate limits, counters, queues.
•Use time-series databases for metrics and logs — Optimized for append-only, time-windowed, aggregate-heavy workloads.
•Use graph databases for relationship-centric queries — When the relationship path IS the query, not just a join helper.
•Use document databases for self-contained entities — When an entity with its embedded data is the natural unit of work.
•NoSQL often complements SQL, not replaces it — Most systems use both for different purposes.

What's Next:

Now that we understand when to choose SQL and when to choose NoSQL, we'll explore Polyglot Persistence—the practice of using multiple database types within a single system, each optimized for its specific use case.

Page Complete

You now have decision frameworks for identifying NoSQL-appropriate use cases. This knowledge helps you choose the right database for the right job rather than following trends. Next, we'll see how these choices combine in real-world polyglot architectures.