Choosing The Right Nosql Database - Learning Module

Loading content...

0/273

Query Pattern Requirements: Matching Operations to Databases

Beyond Data Model: What Questions Must You Answer?

Data model fit tells you which NoSQL category aligns with your data's shape. But selecting the right database requires a second lens: query pattern analysis. Two applications might have identical data models but radically different query requirements. A user profile system with fast point lookups has different needs than one requiring complex aggregations across millions of profiles.

The database that perfectly matches your data's shape may falter when faced with query patterns it wasn't optimized for. Cassandra handles time-series data brilliantly—unless your queries need ad-hoc aggregations across arbitrary time ranges. MongoDB models documents naturally—unless you need sub-millisecond full-text search across millions of documents.

Understanding query patterns is the second essential filter in database selection. This page provides a systematic framework for analyzing your application's query requirements and matching them to database capabilities.

What You Will Learn

By the end of this page, you will understand the major categories of query patterns, how each NoSQL database type handles (or fails to handle) each pattern, and how to create a query profile that guides database selection. You'll learn to see beyond 'this database stores my data' to 'this database answers my questions efficiently.'

The Query Pattern Taxonomy

Every database operation falls into one of several fundamental query pattern categories. Different databases excel at different patterns—understanding this taxonomy is essential for informed selection.

The Seven Fundamental Query Patterns:

Query Pattern Categories
Pattern	Description	Example Query	Optimization Priority
Point Lookup	Retrieve a single entity by unique identifier	Get user by ID	Latency (sub-ms)
Range Scan	Retrieve multiple entities within a bounded range	Orders in date range	Throughput, sequential I/O
Secondary Index Lookup	Find entities by non-primary attributes	Users by email	Index maintenance cost
Aggregation	Compute derived values across entities	Total sales by region	Compute, memory
Full-Text Search	Find documents by text content	Articles containing 'kubernetes'	Index size, relevance
Graph Traversal	Navigate relationships between entities	Friends of friends	Relationship locality
Join/Reference Resolution	Combine data from multiple entities	Orders with customer details	Data locality, round trips

Why Pattern Classification Matters:

Each database architecture makes fundamental trade-offs that favor certain patterns over others:

Key-Value stores optimize for point lookups at the expense of everything else
Document stores balance point lookups with secondary indexing and moderate aggregation
Wide-column stores optimize for range scans and time-based queries
Graph databases optimize for traversals but struggle with aggregations

The database that's 'best' depends entirely on which patterns dominate your workload. A workload with 90% point lookups and 10% aggregations should choose differently than one with 50% aggregations and 50% range scans.

Workload Characterization

Before selecting a database, profile your expected query distribution. What percentage of operations fall into each pattern? Which patterns are latency-sensitive? Which can tolerate higher latency? This workload profile guides database selection more than any abstract capability comparison.

Point Lookup: The Universal Baseline

Point lookup—retrieving a single entity by its unique identifier—is the most fundamental database operation. Every database supports it, but performance characteristics vary significantly.

Point Lookup Characteristics:

Access Path: Direct lookup by primary key or unique identifier
Result Size: Single entity (one row, one document, one value)
Latency Expectation: Sub-millisecond to low single-digit milliseconds
Scaling Behavior: Constant time with proper partitioning (O(1) per lookup)

How Different NoSQL Types Handle Point Lookups:

Point Lookup Performance by Database Type
Database Type	Typical Latency	Mechanism	Scaling Behavior
Key-Value (Redis)	< 0.5ms	In-memory hash table	O(1), memory-bound
Key-Value (DynamoDB)	< 5ms	Partitioned B-tree, SSD	O(1), partition-aware
Document (MongoDB)	< 5ms	B-tree index on _id	O(log n) per shard
Wide-Column (Cassandra)	< 10ms	SSTable + memtable + bloom filter	O(1), partition-key based
Graph (Neo4j)	< 5ms	Native ID lookup	O(1) for ID, varies for property lookup

Point Lookup Optimization Strategies:

1. Key Design for Locality

In distributed databases, key design determines which node handles a lookup. Well-designed keys co-locate related data:

// DynamoDB - Partition key determines node
{
  PK: "USER#user123",
  SK: "PROFILE"
}

// Cassandra - Partition key for node, clustering key for sorting
PRIMARY KEY ((user_id), created_at)

2. Caching Hot Keys

Point lookups on frequently accessed entities benefit enormously from caching:

// Pattern: Cache-aside for hot entities
async function getUser(userId) {
  let user = await cache.get(`user:${userId}`);
  if (!user) {
    user = await database.findById(userId);
    await cache.set(`user:${userId}`, user, ttl=300);
  }
  return user;
}

3. Batch Lookups

When you need multiple point lookups, batch them to reduce round trips:

// Bad: N sequential lookups
for (const id of userIds) {
  users.push(await db.findById(id));  // N round trips
}

// Good: Single batch lookup  
const users = await db.findByIds(userIds);  // 1 round trip

When Point Lookups Dominate

If your workload is 80%+ point lookups with simple values, key-value stores are nearly optimal. The additional features of document stores (secondary indexes, queries) and relational databases (joins, transactions) add overhead you don't need. Redis, Memcached, or DynamoDB will outperform more feature-rich databases for pure key-based access.

Range Scans: Sequential Access at Scale

Range scans retrieve multiple entities within a defined boundary—typically time-based or sorted by some ordering key. This pattern is fundamental to analytics, audit logs, time-series data, and any scenario where you need 'all entities between X and Y.'

Range Scan Characteristics:

Access Path: Sequential traversal of sorted data
Result Size: Variable, potentially millions of entities
Latency Expectation: Proportional to result size; throughput matters more than single-query latency
Scaling Behavior: Linear in result size, benefits from sequential I/O optimization

Range Scan Use Cases:

"Get all orders from the last 30 days"
"Retrieve log entries between timestamps"
"Fetch all sensor readings in temperature range 70-80°F"
"List products with price between $50-$100"

Excellent for Range Scans

•Wide-Column (Cassandra, HBase) — Data sorted by clustering key within partitions. Sequential disk reads. Purpose-built for time-series.
•Time-Series DBs (InfluxDB, TimescaleDB) — Optimized specifically for time-bounded queries. Automatic data compaction.
•Document with proper indexes — MongoDB with compound indexes on range fields performs well for moderate ranges.

Poor for Range Scans

•Pure Key-Value — No ordering concept. Range requires scanning all keys or external indexing.
•Graph databases — Optimized for relationship traversal, not ordered entity retrieval.
•Document without indexes — Collection scans are expensive at scale.

Range Scan Design Patterns:

1. Partition-Aware Range Queries

In distributed databases, range scans that span partitions require scatter-gather—queries to multiple nodes. Design partitioning to keep common ranges on single partitions:

// Cassandra: Partition by user, cluster by time
CREATE TABLE user_events (
  user_id UUID,
  event_time TIMESTAMP,
  event_type TEXT,
  payload BLOB,
  PRIMARY KEY ((user_id), event_time)
) WITH CLUSTERING ORDER BY (event_time DESC);

-- Efficient: Ranges within partition
SELECT * FROM user_events 
WHERE user_id = ? 
AND event_time > '2024-01-01' AND event_time < '2024-01-31';

-- Inefficient: Cross-partition range (full cluster scan)
SELECT * FROM user_events 
WHERE event_time > '2024-01-01';  -- No partition key!

2. Time-Bucketing

For time-series data, bucket partition keys by time period to bound partition growth and enable time-range deletes:

// Partition key includes date bucket
PRIMARY KEY ((sensor_id, date_bucket), timestamp)

// Buckets limit partition size and enable efficient purging
date_bucket: '2024-01'  // Monthly bucket

3. Pagination for Large Ranges

Never return unbounded result sets. Use pagination tokens:

// Server: Return page with continuation token
{
  "data": [...100 items...],
  "nextToken": "eyJsYXN0X2lkIjoiYWJjMTIzIn0="
}

// Client: Request next page
GET /events?after=eyJsYXN0X2lkIjoiYWJjMTIzIn0=

The Hot Partition Problem

Range scans concentrated on recent data create hot partitions. If all queries target 'last 24 hours' for a time-series partitioned by day, today's partition receives all traffic. Strategies include sub-partitioning by additional dimensions (user_id + date) or write-time randomization with read-time scatter-gather.

Secondary Indexes: Beyond Primary Key Access

Applications rarely access data only by primary key. 'Find user by email,' 'get orders by status,' 'list products by category'—these queries require secondary indexes (also called Global Secondary Indexes or GSIs in DynamoDB terminology).

Secondary indexes are the first point where NoSQL databases diverge significantly from relational databases. SQL databases have decades of index optimization; NoSQL databases implement secondary indexes with varying degrees of maturity and different trade-offs.

Secondary Index Trade-offs:

Factor	Consideration
Write amplification	Each write may update multiple indexes, increasing write latency and cost
Consistency	Indexes may lag behind primary data (eventual consistency)
Query capability	What index types are supported? Unique? Sparse? Compound?
Flexibility	Can you add/drop indexes dynamically without downtime?
Cost model	Some cloud databases charge per index write (DynamoDB charges for GSI writes)

Secondary Index Support by Database Type:

Secondary Index Capabilities
Database	Secondary Index Support	Trade-offs
Redis	Sorted sets, manual indexing	No automatic indexes; application manages consistency
DynamoDB	GSIs (projected), LSIs (within partition)	GSI writes charged separately; eventual consistency
MongoDB	Rich indexes (single, compound, multikey, text, geo)	Strong support; index overhead on writes
Cassandra	Secondary indexes, materialized views	Limited utility; often requires denormalization instead
Neo4j	Node property indexes, full-text indexes	Good for property lookup; relationship-based access preferred

Secondary Index Design Patterns:

1. Denormalization Over Indexes (Wide-Column)

In Cassandra and similar wide-column stores, secondary indexes perform poorly at scale. The preferred pattern is query-driven denormalization—maintaining multiple tables, each optimized for a specific query:

-- Instead of one table with secondary index on email:
-- users: PRIMARY KEY (user_id) + secondary index on email

-- Create two tables:
CREATE TABLE users_by_id (
  user_id UUID PRIMARY KEY,
  email TEXT,
  name TEXT
);

CREATE TABLE users_by_email (
  email TEXT PRIMARY KEY,
  user_id UUID,
  name TEXT
);

-- Application maintains both on write

2. Sparse Indexes for Rare Attributes

MongoDB sparse indexes include only documents where the indexed field exists:

// Only index documents with 'premium_until' field
db.users.createIndex(
  { premium_until: 1 },
  { sparse: true }  // Don't index users without this field
);

3. Covered Queries (Projecting from Index)

Include queried fields in the index to avoid entity lookup:

// DynamoDB GSI with projection
{
  IndexName: "email-index",
  KeySchema: [{ AttributeName: "email", KeyType: "HASH" }],
  Projection: { 
    ProjectionType: "INCLUDE",
    NonKeyAttributes: ["name", "status"]  // Project these for covered queries
  }
}

When Indexes Aren't Enough

If your query patterns require many different secondary indexes, flexible ad-hoc queries, or complex filtering, you may have outgrown key-value and wide-column models. Document databases handle moderate indexing needs; for truly flexible querying, consider adding a search engine (Elasticsearch) or using a relational database for that data subset.

Aggregation: Computing Across Entities

Aggregation queries compute derived values from multiple entities: sums, counts, averages, groupings, and more complex statistical functions. This query pattern is where NoSQL databases diverge most dramatically from relational databases.

Aggregation Characteristics:

Access Path: Scan entities matching criteria, compute aggregate
Result Size: Small (aggregated values) despite large input set
Latency Expectation: Acceptable to be seconds for complex aggregations; real-time for operational metrics
Compute Pattern: Map-reduce style (scatter-gather across partitions, combine)

Why NoSQL Struggles with Aggregation:

NoSQL databases distribute data across nodes for scalability. Aggregation requires collecting data from all nodes where relevant entities reside:

[Node 1] → partial aggregate 1 ─┐
[Node 2] → partial aggregate 2 ─┼─→ [Coordinator] → final result
[Node 3] → partial aggregate 3 ─┘

This scatter-gather pattern adds latency and network overhead. Relational databases on single nodes can aggregate directly; distributed NoSQL databases pay coordination costs.

Aggregation Support by Database Type
Database Type	Aggregation Capability	Use Case Fit
Key-Value	None native; application-level or external	Not suitable for aggregation workloads
Document (MongoDB)	Aggregation pipeline (powerful)	Moderate aggregation; real-time analytics
Wide-Column (Cassandra)	Limited; counter tables, client-side aggregation	Pre-aggregated rollups; not ad-hoc aggregation
Time-Series (InfluxDB)	Built-in grouping, windowing, downsampling	Excellent for time-based aggregations
Graph (Neo4j)	Aggregation over traversal results	Limited; relationship-centric, not entity aggregation

Aggregation Design Patterns:

1. Pre-Computed Aggregates (Write-Time Rollups)

Compute aggregates at write time, not query time:

// Instead of: SELECT COUNT(*) FROM orders WHERE date = '2024-01-15'
// Maintain a counter document:
{
  _id: "order-count:2024-01-15",
  count: 1547  // Incremented on each order insert
}

// On order insert:
db.counters.updateOne(
  { _id: `order-count:${orderDate}` },
  { $inc: { count: 1 } },
  { upsert: true }
);

Trade-off: Write complexity increases; queries become O(1) lookups.

2. Materialized Views (Wide-Column)

Maintain pre-aggregated tables that update on write:

CREATE TABLE daily_sales_by_product (
  product_id UUID,
  date DATE,
  total_quantity INT,
  total_revenue DECIMAL,
  PRIMARY KEY ((product_id, date))
);

-- Application maintains on each sale
UPDATE daily_sales_by_product 
SET total_quantity = total_quantity + ?, total_revenue = total_revenue + ?
WHERE product_id = ? AND date = ?;

3. Lambda Architecture (Batch + Real-Time)

For complex analytics, use a dual-path architecture:

Speed layer: Real-time aggregates in Redis or Cassandra counters
Batch layer: Accurate historical aggregates computed via Spark/Hadoop
Serving layer: Merge speed and batch results for queries

Events ──→ [Kafka] ──┬──→ [Spark Streaming] ──→ [Real-time aggregates]
                     └──→ [Spark Batch] ──────→ [Historical aggregates]
                                                         ↓
                              [Query API] ←── Merge ←────┘

Aggregation at Scale

If aggregation queries are central to your application (dashboards, analytics, reporting), NoSQL databases alone may be insufficient. Consider a dedicated analytics store: ClickHouse for real-time analytics, BigQuery/Redshift for complex batch analytics, or TimescaleDB if aggregations are time-centric. Polyglot persistence—operational database + analytics database—is often the right answer.

Full-Text Search: Finding Needles in Haystacks

Full-text search—finding documents by word or phrase within text content—is a specialized query pattern that most NoSQL databases handle poorly as a primary capability.

Full-Text Search Requirements:

Token analysis: Breaking text into searchable tokens (stemming, normalization)
Inverted indexes: Mapping tokens to documents containing them
Relevance scoring: Ranking results by match quality (TF-IDF, BM25)
Fuzzy matching: Tolerating typos and variations
Faceted search: Filtering by categories alongside text search

NoSQL Full-Text Search Capabilities:

Full-Text Search Support
Database	Full-Text Capability	Appropriate For
MongoDB	Text indexes with $text operator	Basic search; not production-grade for search apps
Redis (RediSearch)	Full-text module with indexing	Simple search with low latency; limited analytics
Cassandra	SASI indexes (limited)	Very basic; not recommended for real search
Neo4j	Full-text indexes on properties	Supplementary search; graph traversal remains primary
Elasticsearch	Native full-text (purpose-built)	Production search applications; complex relevance

Full-Text Search Architecture Patterns:

1. Dedicated Search Engine (Recommended)

Treat search as a separate concern with its own specialized database:

┌─────────────────────────────────────────────────────────────┐
│                    Application Layer                         │
├──────────────────┬─────────────────┬────────────────────────┤
│   CRUD Writes    │  Point Lookups  │    Search Queries      │
│        ↓         │        ↓        │          ↓             │
│  ┌──────────┐    │   ┌────────┐    │   ┌──────────────┐     │
│  │ MongoDB  │────┼───│MongoDB │    │   │Elasticsearch │     │
│  │ (primary)│    │   │ reads  │    │   │  (search)    │     │
│  └────┬─────┘    │   └────────┘    │   └──────────────┘     │
│       │          │                 │          ↑             │
│       └──────────┼─────────────────┼──────────┘             │
│                  │   Sync via CDC or Change Streams         │
└──────────────────┴─────────────────┴────────────────────────┘

2. Change Data Capture for Sync

Keep search indexes synchronized via CDC:

// MongoDB Change Stream → Elasticsearch sync
const changeStream = collection.watch();

changeStream.on('change', async (change) => {
  if (change.operationType === 'insert' || change.operationType === 'update') {
    await esClient.index({
      index: 'products',
      id: change.documentKey._id,
      body: transform(change.fullDocument)
    });
  } else if (change.operationType === 'delete') {
    await esClient.delete({
      index: 'products', 
      id: change.documentKey._id
    });
  }
});

3. Search-Primary for Content Applications

For content-heavy applications (CMS, e-commerce catalogs), Elasticsearch can be the primary store:

Writes → Elasticsearch (primary) → Async backup to cold storage
Reads → Elasticsearch
Search → Elasticsearch

// Use Elasticsearch features for all access patterns
// Accept eventual consistency for writes
// No secondary sync needed

This simplifies architecture but requires accepting Elasticsearch's trade-offs (eventual consistency, manual reindexing on schema changes).

When Built-In Search Suffices

MongoDB's text indexes handle basic search adequately for internal tools, admin panels, and low-volume search features. If search quality isn't critical and you're searching thousands (not millions) of documents without complex relevance requirements, MongoDB text search avoids the complexity of a separate search cluster.

Graph Traversals: Navigating Relationships

Graph traversal queries navigate relationships between entities: finding paths, discovering connected components, computing relationship-based scores. This pattern is fundamentally different from entity-centric queries and requires specialized handling.

Graph Traversal Characteristics:

Access Path: Follow edges from starting nodes to connected nodes
Depth: Variable, often unknown at query time
Latency: Proportional to traversal depth and fanout
Memory: Large traversals may exhaust memory for visited node tracking

Graph Traversal Query Types:

Graph Traversal Patterns
Pattern	Description	Complexity	Example
Direct neighbors	1-hop connected nodes	O(degree)	Direct friends of User A
Variable-depth traversal	N-hop connected nodes	O(branching^depth)	Friends within 3 degrees
Shortest path	Minimum hops between nodes	O(V + E)	Connection path between users
Pattern matching	Subgraphs matching structure	Expensive	Users who follow accounts that follow X
Centrality/PageRank	Node importance scoring	Iterative	Most influential accounts

Handling Graph Patterns in Non-Graph Databases:

1. Adjacency List Pattern (Document Store)

Store connections as arrays within documents:

// User document with embedded connections
{
  _id: "user-123",
  name: "Alice",
  following: ["user-456", "user-789", "user-012"],
  followers: ["user-333", "user-444"]
}

// Query: Find who Alice follows
db.users.findOne({ _id: "user-123" }).following

// Query: Find mutual followers (2 queries + app logic)

Works for shallow traversals but becomes expensive for multi-hop queries.

2. Edge Table Pattern (Relational/Wide-Column)

Store edges as rows:

CREATE TABLE follows (
  follower_id UUID,
  following_id UUID,
  created_at TIMESTAMP,
  PRIMARY KEY ((follower_id), following_id)
);

-- Query: Who does user-123 follow?
SELECT following_id FROM follows WHERE follower_id = 'user-123';

-- Multi-hop requires recursive queries or application-level iteration

3. Graph Database for Complex Traversals (Recommended)

If multi-hop traversals are frequent and depth varies, use a graph database:

// Neo4j: Find friends-of-friends
MATCH (me:User {id: 'user-123'})-[:FOLLOWS*2]->(fof:User)
WHERE NOT (me)-[:FOLLOWS]->(fof)
RETURN DISTINCT fof.name LIMIT 10;

// Shortest path between users
MATCH path = shortestPath(
  (a:User {id: 'user-123'})-[:FOLLOWS*]-(b:User {id: 'user-456'})
)
RETURN path;

Graph at the Edge

Many applications have some graph-like queries but are not graph-centric. In these cases, a hybrid approach works: use a document or relational database as the primary store, and sync a subset of data to a graph database for relationship-heavy queries. This avoids forcing all data into a graph model when only 10% of queries need graph traversal.

Summary: Query Patterns as Selection Criteria

Query patterns are the second essential lens for database selection. Data model fit determines which database categories can represent your data; query pattern analysis determines which can efficiently answer your questions.

The key insight: no NoSQL database excels at all query patterns. Database selection requires prioritizing which patterns matter most and accepting trade-offs for secondary patterns.

Key Takeaways

•Point lookups are universal but key-value stores are fastest; others add features at cost of latency.
•Range scans favor wide-column and time-series databases optimized for sequential access.
•Secondary indexes vary dramatically across NoSQL types; document stores have strongest support.
•Aggregations are challenging in distributed NoSQL; pre-computation or dedicated analytics stores often required.
•Full-text search generally requires a dedicated search engine (Elasticsearch) for production quality.
•Graph traversals require graph databases for complex, variable-depth queries.
•Polyglot persistence—using multiple specialized databases—is often the right architectural answer.

What's next:

With data model fit and query patterns understood, the next page examines consistency requirements—how your application's tolerance for stale reads, write conflicts, and eventual consistency further constrains viable database choices. Consistency is where CAP theorem trade-offs become concrete architectural decisions.

Page Complete

You now understand how to analyze query patterns as a critical input to database selection. The next step is building a query profile for your application—cataloging which patterns dominate your workload and which databases efficiently support them. This analysis, combined with data model fit, narrows the field to databases that genuinely match your requirements.