Loading content...
Data model fit tells you which NoSQL category aligns with your data's shape. But selecting the right database requires a second lens: query pattern analysis. Two applications might have identical data models but radically different query requirements. A user profile system with fast point lookups has different needs than one requiring complex aggregations across millions of profiles.
The database that perfectly matches your data's shape may falter when faced with query patterns it wasn't optimized for. Cassandra handles time-series data brilliantly—unless your queries need ad-hoc aggregations across arbitrary time ranges. MongoDB models documents naturally—unless you need sub-millisecond full-text search across millions of documents.
Understanding query patterns is the second essential filter in database selection. This page provides a systematic framework for analyzing your application's query requirements and matching them to database capabilities.
By the end of this page, you will understand the major categories of query patterns, how each NoSQL database type handles (or fails to handle) each pattern, and how to create a query profile that guides database selection. You'll learn to see beyond 'this database stores my data' to 'this database answers my questions efficiently.'
Every database operation falls into one of several fundamental query pattern categories. Different databases excel at different patterns—understanding this taxonomy is essential for informed selection.
The Seven Fundamental Query Patterns:
| Pattern | Description | Example Query | Optimization Priority |
|---|---|---|---|
| Point Lookup | Retrieve a single entity by unique identifier | Get user by ID | Latency (sub-ms) |
| Range Scan | Retrieve multiple entities within a bounded range | Orders in date range | Throughput, sequential I/O |
| Secondary Index Lookup | Find entities by non-primary attributes | Users by email | Index maintenance cost |
| Aggregation | Compute derived values across entities | Total sales by region | Compute, memory |
| Full-Text Search | Find documents by text content | Articles containing 'kubernetes' | Index size, relevance |
| Graph Traversal | Navigate relationships between entities | Friends of friends | Relationship locality |
| Join/Reference Resolution | Combine data from multiple entities | Orders with customer details | Data locality, round trips |
Why Pattern Classification Matters:
Each database architecture makes fundamental trade-offs that favor certain patterns over others:
The database that's 'best' depends entirely on which patterns dominate your workload. A workload with 90% point lookups and 10% aggregations should choose differently than one with 50% aggregations and 50% range scans.
Before selecting a database, profile your expected query distribution. What percentage of operations fall into each pattern? Which patterns are latency-sensitive? Which can tolerate higher latency? This workload profile guides database selection more than any abstract capability comparison.
Point lookup—retrieving a single entity by its unique identifier—is the most fundamental database operation. Every database supports it, but performance characteristics vary significantly.
Point Lookup Characteristics:
How Different NoSQL Types Handle Point Lookups:
| Database Type | Typical Latency | Mechanism | Scaling Behavior |
|---|---|---|---|
| Key-Value (Redis) | < 0.5ms | In-memory hash table | O(1), memory-bound |
| Key-Value (DynamoDB) | < 5ms | Partitioned B-tree, SSD | O(1), partition-aware |
| Document (MongoDB) | < 5ms | B-tree index on _id | O(log n) per shard |
| Wide-Column (Cassandra) | < 10ms | SSTable + memtable + bloom filter | O(1), partition-key based |
| Graph (Neo4j) | < 5ms | Native ID lookup | O(1) for ID, varies for property lookup |
Point Lookup Optimization Strategies:
1. Key Design for Locality
In distributed databases, key design determines which node handles a lookup. Well-designed keys co-locate related data:
// DynamoDB - Partition key determines node
{
PK: "USER#user123",
SK: "PROFILE"
}
// Cassandra - Partition key for node, clustering key for sorting
PRIMARY KEY ((user_id), created_at)
2. Caching Hot Keys
Point lookups on frequently accessed entities benefit enormously from caching:
// Pattern: Cache-aside for hot entities
async function getUser(userId) {
let user = await cache.get(`user:${userId}`);
if (!user) {
user = await database.findById(userId);
await cache.set(`user:${userId}`, user, ttl=300);
}
return user;
}
3. Batch Lookups
When you need multiple point lookups, batch them to reduce round trips:
// Bad: N sequential lookups
for (const id of userIds) {
users.push(await db.findById(id)); // N round trips
}
// Good: Single batch lookup
const users = await db.findByIds(userIds); // 1 round trip
If your workload is 80%+ point lookups with simple values, key-value stores are nearly optimal. The additional features of document stores (secondary indexes, queries) and relational databases (joins, transactions) add overhead you don't need. Redis, Memcached, or DynamoDB will outperform more feature-rich databases for pure key-based access.
Range scans retrieve multiple entities within a defined boundary—typically time-based or sorted by some ordering key. This pattern is fundamental to analytics, audit logs, time-series data, and any scenario where you need 'all entities between X and Y.'
Range Scan Characteristics:
Range Scan Use Cases:
Range Scan Design Patterns:
1. Partition-Aware Range Queries
In distributed databases, range scans that span partitions require scatter-gather—queries to multiple nodes. Design partitioning to keep common ranges on single partitions:
// Cassandra: Partition by user, cluster by time
CREATE TABLE user_events (
user_id UUID,
event_time TIMESTAMP,
event_type TEXT,
payload BLOB,
PRIMARY KEY ((user_id), event_time)
) WITH CLUSTERING ORDER BY (event_time DESC);
-- Efficient: Ranges within partition
SELECT * FROM user_events
WHERE user_id = ?
AND event_time > '2024-01-01' AND event_time < '2024-01-31';
-- Inefficient: Cross-partition range (full cluster scan)
SELECT * FROM user_events
WHERE event_time > '2024-01-01'; -- No partition key!
2. Time-Bucketing
For time-series data, bucket partition keys by time period to bound partition growth and enable time-range deletes:
// Partition key includes date bucket
PRIMARY KEY ((sensor_id, date_bucket), timestamp)
// Buckets limit partition size and enable efficient purging
date_bucket: '2024-01' // Monthly bucket
3. Pagination for Large Ranges
Never return unbounded result sets. Use pagination tokens:
// Server: Return page with continuation token
{
"data": [...100 items...],
"nextToken": "eyJsYXN0X2lkIjoiYWJjMTIzIn0="
}
// Client: Request next page
GET /events?after=eyJsYXN0X2lkIjoiYWJjMTIzIn0=
Range scans concentrated on recent data create hot partitions. If all queries target 'last 24 hours' for a time-series partitioned by day, today's partition receives all traffic. Strategies include sub-partitioning by additional dimensions (user_id + date) or write-time randomization with read-time scatter-gather.
Applications rarely access data only by primary key. 'Find user by email,' 'get orders by status,' 'list products by category'—these queries require secondary indexes (also called Global Secondary Indexes or GSIs in DynamoDB terminology).
Secondary indexes are the first point where NoSQL databases diverge significantly from relational databases. SQL databases have decades of index optimization; NoSQL databases implement secondary indexes with varying degrees of maturity and different trade-offs.
Secondary Index Trade-offs:
| Factor | Consideration |
|---|---|
| Write amplification | Each write may update multiple indexes, increasing write latency and cost |
| Consistency | Indexes may lag behind primary data (eventual consistency) |
| Query capability | What index types are supported? Unique? Sparse? Compound? |
| Flexibility | Can you add/drop indexes dynamically without downtime? |
| Cost model | Some cloud databases charge per index write (DynamoDB charges for GSI writes) |
Secondary Index Support by Database Type:
| Database | Secondary Index Support | Trade-offs |
|---|---|---|
| Redis | Sorted sets, manual indexing | No automatic indexes; application manages consistency |
| DynamoDB | GSIs (projected), LSIs (within partition) | GSI writes charged separately; eventual consistency |
| MongoDB | Rich indexes (single, compound, multikey, text, geo) | Strong support; index overhead on writes |
| Cassandra | Secondary indexes, materialized views | Limited utility; often requires denormalization instead |
| Neo4j | Node property indexes, full-text indexes | Good for property lookup; relationship-based access preferred |
Secondary Index Design Patterns:
1. Denormalization Over Indexes (Wide-Column)
In Cassandra and similar wide-column stores, secondary indexes perform poorly at scale. The preferred pattern is query-driven denormalization—maintaining multiple tables, each optimized for a specific query:
-- Instead of one table with secondary index on email:
-- users: PRIMARY KEY (user_id) + secondary index on email
-- Create two tables:
CREATE TABLE users_by_id (
user_id UUID PRIMARY KEY,
email TEXT,
name TEXT
);
CREATE TABLE users_by_email (
email TEXT PRIMARY KEY,
user_id UUID,
name TEXT
);
-- Application maintains both on write
2. Sparse Indexes for Rare Attributes
MongoDB sparse indexes include only documents where the indexed field exists:
// Only index documents with 'premium_until' field
db.users.createIndex(
{ premium_until: 1 },
{ sparse: true } // Don't index users without this field
);
3. Covered Queries (Projecting from Index)
Include queried fields in the index to avoid entity lookup:
// DynamoDB GSI with projection
{
IndexName: "email-index",
KeySchema: [{ AttributeName: "email", KeyType: "HASH" }],
Projection: {
ProjectionType: "INCLUDE",
NonKeyAttributes: ["name", "status"] // Project these for covered queries
}
}
If your query patterns require many different secondary indexes, flexible ad-hoc queries, or complex filtering, you may have outgrown key-value and wide-column models. Document databases handle moderate indexing needs; for truly flexible querying, consider adding a search engine (Elasticsearch) or using a relational database for that data subset.
Aggregation queries compute derived values from multiple entities: sums, counts, averages, groupings, and more complex statistical functions. This query pattern is where NoSQL databases diverge most dramatically from relational databases.
Aggregation Characteristics:
Why NoSQL Struggles with Aggregation:
NoSQL databases distribute data across nodes for scalability. Aggregation requires collecting data from all nodes where relevant entities reside:
[Node 1] → partial aggregate 1 ─┐
[Node 2] → partial aggregate 2 ─┼─→ [Coordinator] → final result
[Node 3] → partial aggregate 3 ─┘
This scatter-gather pattern adds latency and network overhead. Relational databases on single nodes can aggregate directly; distributed NoSQL databases pay coordination costs.
| Database Type | Aggregation Capability | Use Case Fit |
|---|---|---|
| Key-Value | None native; application-level or external | Not suitable for aggregation workloads |
| Document (MongoDB) | Aggregation pipeline (powerful) | Moderate aggregation; real-time analytics |
| Wide-Column (Cassandra) | Limited; counter tables, client-side aggregation | Pre-aggregated rollups; not ad-hoc aggregation |
| Time-Series (InfluxDB) | Built-in grouping, windowing, downsampling | Excellent for time-based aggregations |
| Graph (Neo4j) | Aggregation over traversal results | Limited; relationship-centric, not entity aggregation |
Aggregation Design Patterns:
1. Pre-Computed Aggregates (Write-Time Rollups)
Compute aggregates at write time, not query time:
// Instead of: SELECT COUNT(*) FROM orders WHERE date = '2024-01-15'
// Maintain a counter document:
{
_id: "order-count:2024-01-15",
count: 1547 // Incremented on each order insert
}
// On order insert:
db.counters.updateOne(
{ _id: `order-count:${orderDate}` },
{ $inc: { count: 1 } },
{ upsert: true }
);
Trade-off: Write complexity increases; queries become O(1) lookups.
2. Materialized Views (Wide-Column)
Maintain pre-aggregated tables that update on write:
CREATE TABLE daily_sales_by_product (
product_id UUID,
date DATE,
total_quantity INT,
total_revenue DECIMAL,
PRIMARY KEY ((product_id, date))
);
-- Application maintains on each sale
UPDATE daily_sales_by_product
SET total_quantity = total_quantity + ?, total_revenue = total_revenue + ?
WHERE product_id = ? AND date = ?;
3. Lambda Architecture (Batch + Real-Time)
For complex analytics, use a dual-path architecture:
Events ──→ [Kafka] ──┬──→ [Spark Streaming] ──→ [Real-time aggregates]
└──→ [Spark Batch] ──────→ [Historical aggregates]
↓
[Query API] ←── Merge ←────┘
If aggregation queries are central to your application (dashboards, analytics, reporting), NoSQL databases alone may be insufficient. Consider a dedicated analytics store: ClickHouse for real-time analytics, BigQuery/Redshift for complex batch analytics, or TimescaleDB if aggregations are time-centric. Polyglot persistence—operational database + analytics database—is often the right answer.
Full-text search—finding documents by word or phrase within text content—is a specialized query pattern that most NoSQL databases handle poorly as a primary capability.
Full-Text Search Requirements:
NoSQL Full-Text Search Capabilities:
| Database | Full-Text Capability | Appropriate For |
|---|---|---|
| MongoDB | Text indexes with $text operator | Basic search; not production-grade for search apps |
| Redis (RediSearch) | Full-text module with indexing | Simple search with low latency; limited analytics |
| Cassandra | SASI indexes (limited) | Very basic; not recommended for real search |
| Neo4j | Full-text indexes on properties | Supplementary search; graph traversal remains primary |
| Elasticsearch | Native full-text (purpose-built) | Production search applications; complex relevance |
Full-Text Search Architecture Patterns:
1. Dedicated Search Engine (Recommended)
Treat search as a separate concern with its own specialized database:
┌─────────────────────────────────────────────────────────────┐
│ Application Layer │
├──────────────────┬─────────────────┬────────────────────────┤
│ CRUD Writes │ Point Lookups │ Search Queries │
│ ↓ │ ↓ │ ↓ │
│ ┌──────────┐ │ ┌────────┐ │ ┌──────────────┐ │
│ │ MongoDB │────┼───│MongoDB │ │ │Elasticsearch │ │
│ │ (primary)│ │ │ reads │ │ │ (search) │ │
│ └────┬─────┘ │ └────────┘ │ └──────────────┘ │
│ │ │ │ ↑ │
│ └──────────┼─────────────────┼──────────┘ │
│ │ Sync via CDC or Change Streams │
└──────────────────┴─────────────────┴────────────────────────┘
2. Change Data Capture for Sync
Keep search indexes synchronized via CDC:
// MongoDB Change Stream → Elasticsearch sync
const changeStream = collection.watch();
changeStream.on('change', async (change) => {
if (change.operationType === 'insert' || change.operationType === 'update') {
await esClient.index({
index: 'products',
id: change.documentKey._id,
body: transform(change.fullDocument)
});
} else if (change.operationType === 'delete') {
await esClient.delete({
index: 'products',
id: change.documentKey._id
});
}
});
3. Search-Primary for Content Applications
For content-heavy applications (CMS, e-commerce catalogs), Elasticsearch can be the primary store:
Writes → Elasticsearch (primary) → Async backup to cold storage
Reads → Elasticsearch
Search → Elasticsearch
// Use Elasticsearch features for all access patterns
// Accept eventual consistency for writes
// No secondary sync needed
This simplifies architecture but requires accepting Elasticsearch's trade-offs (eventual consistency, manual reindexing on schema changes).
MongoDB's text indexes handle basic search adequately for internal tools, admin panels, and low-volume search features. If search quality isn't critical and you're searching thousands (not millions) of documents without complex relevance requirements, MongoDB text search avoids the complexity of a separate search cluster.
Graph traversal queries navigate relationships between entities: finding paths, discovering connected components, computing relationship-based scores. This pattern is fundamentally different from entity-centric queries and requires specialized handling.
Graph Traversal Characteristics:
Graph Traversal Query Types:
| Pattern | Description | Complexity | Example |
|---|---|---|---|
| Direct neighbors | 1-hop connected nodes | O(degree) | Direct friends of User A |
| Variable-depth traversal | N-hop connected nodes | O(branching^depth) | Friends within 3 degrees |
| Shortest path | Minimum hops between nodes | O(V + E) | Connection path between users |
| Pattern matching | Subgraphs matching structure | Expensive | Users who follow accounts that follow X |
| Centrality/PageRank | Node importance scoring | Iterative | Most influential accounts |
Handling Graph Patterns in Non-Graph Databases:
1. Adjacency List Pattern (Document Store)
Store connections as arrays within documents:
// User document with embedded connections
{
_id: "user-123",
name: "Alice",
following: ["user-456", "user-789", "user-012"],
followers: ["user-333", "user-444"]
}
// Query: Find who Alice follows
db.users.findOne({ _id: "user-123" }).following
// Query: Find mutual followers (2 queries + app logic)
Works for shallow traversals but becomes expensive for multi-hop queries.
2. Edge Table Pattern (Relational/Wide-Column)
Store edges as rows:
CREATE TABLE follows (
follower_id UUID,
following_id UUID,
created_at TIMESTAMP,
PRIMARY KEY ((follower_id), following_id)
);
-- Query: Who does user-123 follow?
SELECT following_id FROM follows WHERE follower_id = 'user-123';
-- Multi-hop requires recursive queries or application-level iteration
3. Graph Database for Complex Traversals (Recommended)
If multi-hop traversals are frequent and depth varies, use a graph database:
// Neo4j: Find friends-of-friends
MATCH (me:User {id: 'user-123'})-[:FOLLOWS*2]->(fof:User)
WHERE NOT (me)-[:FOLLOWS]->(fof)
RETURN DISTINCT fof.name LIMIT 10;
// Shortest path between users
MATCH path = shortestPath(
(a:User {id: 'user-123'})-[:FOLLOWS*]-(b:User {id: 'user-456'})
)
RETURN path;
Many applications have some graph-like queries but are not graph-centric. In these cases, a hybrid approach works: use a document or relational database as the primary store, and sync a subset of data to a graph database for relationship-heavy queries. This avoids forcing all data into a graph model when only 10% of queries need graph traversal.
Query patterns are the second essential lens for database selection. Data model fit determines which database categories can represent your data; query pattern analysis determines which can efficiently answer your questions.
The key insight: no NoSQL database excels at all query patterns. Database selection requires prioritizing which patterns matter most and accepting trade-offs for secondary patterns.
What's next:
With data model fit and query patterns understood, the next page examines consistency requirements—how your application's tolerance for stale reads, write conflicts, and eventual consistency further constrains viable database choices. Consistency is where CAP theorem trade-offs become concrete architectural decisions.
You now understand how to analyze query patterns as a critical input to database selection. The next step is building a query profile for your application—cataloging which patterns dominate your workload and which databases efficiently support them. This analysis, combined with data model fit, narrows the field to databases that genuinely match your requirements.