Database Management SystemsNoSQL Overview

NoSQL Overview: Understanding the NoSQL Paradigm

LevelIntermediate

Duration60 mins

TopicNoSQL Overview

5 / 5

Use Case Selection: Matching Databases to Requirements

The Art of Database Selection

Choosing a database is one of the most consequential technical decisions in a project. It influences application architecture, operational complexity, scalability paths, and even hiring requirements. A poor choice creates friction for years; a good choice becomes invisible infrastructure that simply works.

Yet database selection is often treated superficially—following trends, copying other companies, or defaulting to familiar technology. The result: teams wrestling with databases unsuited to their actual access patterns.

Effective database selection requires systematic analysis of requirements, honest assessment of trade-offs, and recognition that there's rarely a single "best" answer.

This page provides frameworks and heuristics for matching NoSQL (and relational) databases to real-world requirements.

What You Will Learn

By the end of this page, you will have practical frameworks for database selection, understand the key factors that drive technology choices, and be able to evaluate databases against specific use case requirements. You'll see real-world examples demonstrating how characteristics map to database choices.

The Selection Framework

Before evaluating specific databases, establish a framework for understanding your requirements. The following dimensions drive database selection decisions:

1. Data Model Requirements

Questions to ask:

What does the data look like? Tables, documents, graphs, key-value pairs?
How variable is the schema? Fixed attributes or highly heterogeneous?
Are relationships central to the data model?
Is the data naturally hierarchical/nested?

2. Query Patterns

Questions to ask:

What queries will the application run? Simple lookups, range scans, aggregations, traversals?
Are queries predictable or ad-hoc?
Do queries span multiple entities/tables?
What's the read/write ratio?

3. Consistency Requirements

Questions to ask:

Is strong consistency required (financial transactions) or acceptable eventual consistency (social feeds)?
Does the application need transactions spanning multiple records?
What happens if reads return slightly stale data?

4. Scale Requirements

Questions to ask:

What's the data volume now? In 1 year? In 5 years?
What's the query load (queries per second)?
Is write throughput or read throughput more critical?
Does scale need to be global (multi-region)?

5. Operational Considerations

Questions to ask:

What's the team's expertise?
Should the database be managed or self-hosted?
What's the budget (licensing, infrastructure, personnel)?
What's the existing ecosystem?

Requirements Mapping Matrix
Requirement	Relational	Key-Value	Document	Column-Family	Graph
Complex transactions	★★★★★	★☆☆☆☆	★★★☆☆	★★☆☆☆	★★★☆☆
Flexible schema	★★☆☆☆	★★★★★	★★★★★	★★★★☆	★★★★☆
Simple lookups	★★★☆☆	★★★★★	★★★★☆	★★★★★	★★★☆☆
Complex queries	★★★★★	★☆☆☆☆	★★★★☆	★★★☆☆	★★★☆☆
Relationship queries	★★★☆☆	★☆☆☆☆	★★☆☆☆	★☆☆☆☆	★★★★★
Write throughput	★★★☆☆	★★★★★	★★★★☆	★★★★★	★★★☆☆
Horizontal scale	★★☆☆☆	★★★★★	★★★★☆	★★★★★	★★★☆☆
Strong consistency	★★★★★	★★★☆☆	★★★☆☆	★★★☆☆	★★★★☆

Start with Access Patterns

The most important factor in database selection is access patterns—how you read and write data. A database optimized for your access patterns will outperform a theoretically 'better' database that doesn't match your usage. Start with: 'What are the 5 most common queries?' and 'What's the write pattern?'

When to Choose Relational Databases

Before examining NoSQL choices, recognize that relational databases remain the right choice for many workloads. NoSQL isn't a replacement—it's an alternative for specific scenarios.

Relational Databases Excel When:

Complex transactions are required: Banking, inventory, order processing—any scenario where partial updates are unacceptable.

Ad-hoc querying is common: Business intelligence, reporting, analytics on structured data. SQL's expressiveness is unmatched for exploratory queries.

Data integrity is paramount: Healthcare records, financial audits, regulatory compliance—domains where data validity is non-negotiable.

Relationships are complex but well-defined: When data fits naturally into normalized tables with foreign key relationships.

The team has SQL expertise: Familiarity reduces errors and speeds development.

Strong Relational Use Cases

•Financial systems — Core banking, payment processing, accounting. ACID transactions are mandatory.
•e-commerce transactions — Order processing, inventory management where consistency prevents overselling.
•ERP and CRM — Enterprise systems with complex relationships between entities.
•Content management (structured) — When content has well-defined schemas and complex queries.
•Reporting and BI — Ad-hoc queries, aggregations, joins across multiple tables.

Don't Abandon Relational Without Reason

PostgreSQL and MySQL are incredibly capable, well-understood, and continuously improving. PostgreSQL now supports JSON documents, full-text search, and horizontal scaling (via Citus). Don't switch to NoSQL because it's trendy—switch because your specific requirements demand trade-offs that NoSQL provides.

Selecting Key-Value Stores

Key-value stores are appropriate when access patterns are simple—lookups by known keys—and performance is critical.

Decision Criteria for Key-Value Stores

Choose Key-Value When

•All access is by known key — You always know the identifier before querying.
•Sub-millisecond latency is required — Hot path operations must be blazingly fast.
•Data is naturally key-addressed — Sessions, caches, feature flags, rate limits.
•Value structure doesn't need server-side querying — The database treats values as opaque.
•Simplicity is prioritized — Simple data model, simple operations, simple scaling.

Common Key-Value Use Cases

Session Management

Key: session:{session_id}
Value: {user_id, expires, permissions, metadata}
Operations: GET (validate session), SET (create/update), DELETE (logout)
Requirements: Sub-millisecond reads, TTL expiration, high throughput
Choice: Redis, Memcached, DynamoDB

Caching Layer

Key: cache:{entity}:{id}  (e.g., cache:user:12345)
Value: Serialized entity from primary database
Operations: GET (cache hit/miss), SET (populate cache), DELETE (invalidate)
Requirements: Speed, TTL, eventual consistency acceptable
Choice: Redis, Memcached

Rate Limiting

Key: ratelimit:{user_id}:{window}
Value: Request count in current window
Operations: INCR (atomic increment), GET (check limit), EXPIRE (window reset)
Requirements: Atomic operations, TTL, very high throughput
Choice: Redis

Feature Flags

Key: feature:{feature_name}
Value: {enabled, percentage_rollout, user_whitelist}
Operations: GET (check flag state)
Requirements: Fast reads, rare writes, simple caching
Choice: Redis, etcd (for distributed config)

Key-Value Store Selection Guide
Requirement	Best Choice	Why
In-memory speed, data structures	Redis	Rich data structures: lists, sets, sorted sets, streams
Pure caching, simplicity	Memcached	Simpler, multi-threaded, pure cache semantics
Managed, serverless scaling	DynamoDB	Auto-scaling, no operational overhead, pay-per-request
Distributed config, coordination	etcd	Strong consistency via Raft, Kubernetes-native
High availability, eventual consistency	Riak KV	Dynamo-inspired, masterless architecture

Selecting Document Databases

Document databases are appropriate when data is naturally document-shaped with varied attributes, and queries go beyond simple key lookups.

Decision Criteria for Document Databases

Choose Document DB When

•Data has varied attributes — Product catalogs where electronics and clothing have different fields.
•Schema evolves frequently — Rapid development cycles, changing requirements.
•Queries operate on document fields — Filter by price, search by name, aggregate by category.
•Related data embeds naturally — User with nested address, preferences, roles.
•Development velocity is prioritized — JSON in, JSON out; minimal mapping layers.

Common Document Database Use Cases

Content Management System

{
    "_id": "article_12345",
    "title": "Understanding NoSQL Databases",
    "author": {"name": "Alice", "bio": "..."},
    "body": "...",
    "tags": ["database", "nosql", "tutorial"],
    "metadata": {"views": 1234, "published": "2024-01-15"},
    "comments": [{"user": "bob", "text": "Great article!"}]
}

Queries: By tag, by author, full-text search, recent articles Choice: MongoDB (rich queries), Couchbase (caching + documents)

Product Catalog

{
    "_id": "sku_12345",
    "name": "Wireless Headphones",
    "category": ["electronics", "audio"],
    "price": 149.99,
    "attributes": {
        "battery_life": "40 hours",
        "driver_size": "40mm"
    }
}

Queries: By category, price range, attribute filters, text search Choice: MongoDB, Elasticsearch (if search is primary)

User Profiles

{
    "_id": "user_12345",
    "email": "alice@example.com",
    "preferences": {"theme": "dark", "notifications": true},
    "sessions": [{"device": "mobile", "last_active": "..."}]
}

Access pattern: Usually single-document reads/writes Choice: MongoDB, DynamoDB (if simple access patterns)

Document Database Selection Guide
Requirement	Best Choice	Why
Rich queries, aggregations, transactions	MongoDB	Most complete feature set, ACID transactions
Real-time mobile sync, offline-first	Firestore, CouchDB	Built-in sync, conflict resolution
Hybrid caching + document	Couchbase	Memcached-compatible caching layer built-in
AWS ecosystem, serverless	DynamoDB	Managed, auto-scaling, tight AWS integration
Search-first with documents	Elasticsearch	Optimized for full-text search, analytics

Selecting Column-Family Databases

Column-family databases are appropriate for time-series data, high write throughput, and workloads with well-defined query patterns.

Decision Criteria for Column-Family Databases

Choose Column-Family When

•Write throughput is critical — Ingesting millions of events per second.
•Data has time-series characteristics — Metrics, logs, sensor readings, event streams.
•Data is wide and sparse — Many optional columns that vary by record.
•Query patterns are known upfront — Can model data specifically for queries.
•Massive scale is required — Petabytes of data across hundreds of nodes.
•Eventually consistency is acceptable — Tunable per-query, but AP-default.

Common Column-Family Use Cases

IoT Sensor Data

Primary Key: (device_id), timestamp
Columns: sensor readings, status flags, metadata
Query: "Last 24 hours of readings for device X"
Write: Append-only, millions of records/second
Choice: Cassandra, ScyllaDB, TimescaleDB

Metrics and Monitoring

Primary Key: (metric_name, time_bucket), timestamp
Columns: value, tags, aggregates
Query: "Average CPU for server Y in last hour"
Write: High-cardinality metrics from thousands of hosts
Choice: Cassandra, TimescaleDB, InfluxDB

Activity Feeds

Primary Key: (user_id), timestamp
Columns: activity type, actor, object, metadata
Query: "Recent 50 activities for user X"
Write: Fan-out events across millions of users
Choice: Cassandra (used by Instagram, Netflix)

Messaging and Chat History

Primary Key: (conversation_id), message_timestamp
Columns: sender, content, attachments, read_status
Query: "Messages in conversation X, last 100"
Write: Real-time message delivery
Choice: Cassandra, ScyllaDB

Column-Family Database Selection Guide
Requirement	Best Choice	Why
General-purpose wide-column, proven scale	Apache Cassandra	Battle-tested at Netflix, Apple; large community
Cassandra-compatible, higher performance	ScyllaDB	C++ reimplementation, 10x performance claims
Hadoop ecosystem integration	HBase	Built on HDFS, integrates with Spark, Hive
Managed, Google-scale	Cloud Bigtable	Managed, integrates with GCP data ecosystem
Purpose-built time-series	TimescaleDB, InfluxDB	Optimized for time-series queries, retention policies

Column-Family Requires Expertise

Column-family databases require careful data modeling—you must design tables around query patterns before writing code. This is a different discipline from relational modeling. Invest in learning or hire expertise; poor data modeling in Cassandra leads to full-table scans and performance disasters.

Selecting Graph Databases

Graph databases are appropriate when relationships between entities are the primary focus of queries.

Decision Criteria for Graph Databases

Choose Graph DB When

•Queries traverse relationships — Friends-of-friends, shortest path, pattern matching.
•Relationship depth varies — Variable-length paths are common queries.
•Data is naturally graph-shaped — Social networks, knowledge bases, network topologies.
•Recommendations are central — "People who liked X also liked Y" patterns.
•Fraud detection, impact analysis — Finding connected patterns across entities.

Common Graph Database Use Cases

Social Network Features

Nodes: Person, Post, Group, Event
Edges: FOLLOWS, LIKES, MEMBER_OF, ATTENDS
Queries: 
  - Friends-of-friends not yet connected
  - Influencer identification (high-degree nodes)
  - Community detection
Choice: Neo4j (feature-rich), Neptune (managed AWS)

Recommendation Engine

Nodes: User, Product, Category
Edges: PURCHASED, VIEWED, SIMILAR_TO
Queries:
  - "Products bought by people who bought X"
  - "Shortest path between user preferences and product"
  - Collaborative filtering via graph
Choice: Neo4j, TigerGraph (analytics scale)

Knowledge Graph / Semantic Web

Nodes: Entity (Person, Place, Concept)
Edges: Relationships with types and properties
Queries:
  - "Find all people connected to Company X within 3 hops"
  - Pattern matching for entities
Choice: Neo4j, Neptune (RDF/SPARQL support)

Fraud Detection

Nodes: Account, Device, Transaction, IP Address
Edges: USES, TRANSACTED_WITH, LOGGED_FROM
Queries:
  - "Find accounts sharing devices with known fraudsters"
  - Ring detection among accounts
  - Abnormal relationship patterns
Choice: Neo4j, TigerGraph, Amazon Neptune

Graph Database Selection Guide
Requirement	Best Choice	Why
Enterprise graph, rich features	Neo4j	Most mature, Cypher language, great tooling
AWS managed, multi-model	Amazon Neptune	Managed, supports Gremlin and SPARQL
Real-time analytics at scale	TigerGraph	Optimized for iterative analytics, massive graphs
Multi-model (document + graph)	ArangoDB	Unified AQL for documents and graphs
Open source, distributed	JanusGraph	Supports Cassandra/HBase backends, TinkerPop standard

Real-World Selection Examples

Let's walk through realistic decision processes for common scenarios.

Example 1: E-Commerce Platform

Requirements:

Product catalog: 1M products, varied attributes
Order processing: ACID transactions, inventory consistency
User sessions: Sub-millisecond authentication checks
Product search: Full-text, faceted filtering
Recommendations: "Similar products" based on purchase history

Decision:

PostgreSQL: Orders, inventory (transactions required)
MongoDB: Product catalog (flexible schemas, rich queries)
Redis: Session cache (speed critical)
Elasticsearch: Search (purpose-built for full-text)
Neo4j or PostgreSQL: Recommendations (depends on complexity)

Rationale: Polyglot persistence—each database handles what it does best. Alternatively, use PostgreSQL for everything if scale is modest and team prefers simplicity.

Example 2: IoT Analytics Platform

Requirements:

Sensor ingestion: 100K devices, 10 readings/second each = 1M writes/second
Time-series queries: "Average temperature for device X in last 24 hours"
Real-time alerting: Detecting anomalies as they happen
Device metadata: Configuration, location, ownership
Historical analytics: Aggregate queries across months of data

Decision:

Cassandra or TimescaleDB: Time-series sensor data (write throughput, time-range queries)
Redis: Real-time alerting state, recent values cache
PostgreSQL: Device metadata (relational, rarely changes)
Spark + Cassandra: Historical analytics (batch processing)

Rationale: The write throughput requirement eliminates traditional RDBMS for sensor data. Cassandra's partition design fits time-series naturally.

Example 3: Social Media Startup

Requirements:

User profiles: Varied attributes, profile customization
Social graph: Following, blocking, mutual friends queries
Activity feed: "What are my friends doing?" reverse-chronological
Messaging: Direct messages, group chats
Scale: Plan for 10M users, must scale to 100M

Decision:

MongoDB: User profiles (flexible schema, rich queries)
Neo4j or Cassandra: Social graph (Neo4j if graph queries dominate, Cassandra if scale dominates)
Cassandra: Activity feed (fan-out write, time-ordered reads)
Cassandra: Messaging (time-ordered, partition by conversation)
Redis: Caching layer throughout

Rationale: At startup scale, MongoDB could handle social graph, but planning for 100M users pushes toward specialized solutions. Instagram famously moved social graph to Cassandra.

Start Simple, Specialize Later

Many successful applications start with a single database (often PostgreSQL or MongoDB) and add specialized databases as scaling demands. Don't prematurely optimize with polyglot persistence—it adds operational complexity. Specialize when you have concrete evidence that a specialized database solves a real problem.

Anti-Patterns and Pitfalls

Learning from common mistakes is as valuable as understanding best practices.

Database Selection Anti-Patterns

•Trend-driven selection — Choosing a database because it's popular or a famous company uses it. What works for Google may not work for you.
•Resume-driven development — Selecting technology to learn something new rather than to solve the problem optimally.
•Premature specialization — Using 5 databases when 1 would suffice. Operational complexity has real costs.
•Ignoring operational reality — Choosing a database without considering who will operate it. Self-hosted distributed databases require expertise.
•Underestimating relational — Dismissing RDBMS as 'legacy' when PostgreSQL with JSON columns might solve the problem.
•Overestimating scale needs — Designing for Google scale when you have 1,000 users. Optimize for current reality plus 10x, not 10,000x.
•Ignoring consistency needs — Accepting eventual consistency without understanding the application-level implications.
•Data model mismatch — Forcing a graph into documents or documents into tables because that's what you know.

The Hidden Cost of Complexity

Every database you add requires: monitoring, backups, security configuration, capacity planning, incident response, and team training. A single well-chosen database often beats three "optimal" ones. Measure the complexity cost against the performance benefit.

Summary: Principled Database Selection

We've established frameworks and heuristics for matching databases to requirements. The key is systematic analysis, not intuition or trends.

Key Takeaways

•Start with access patterns — How you query data is the primary driver of database selection.
•Don't dismiss relational — PostgreSQL remains excellent for many workloads; NoSQL isn't always the answer.
•Match the data model — Use key-value for key-addressed data, documents for semi-structured, graphs for relationships, column-family for time-series.
•Consider consistency requirements — The right choice depends on whether eventual consistency is acceptable.
•Factor in operations — A database your team can't operate well is worse than a suboptimal one they understand.
•Start simple, specialize when needed — Begin with one database; add specialized ones when concrete requirements demand it.
•Polyglot has costs — Multiple databases means multiple systems to manage; justify the complexity.

Module Complete:

This concludes our exploration of NoSQL databases at the overview level. You now understand what NoSQL databases are, why they emerged, the theoretical foundations (CAP/BASE), the four primary categories, and how to select the right database for specific use cases.

Subsequent modules will dive deep into each NoSQL category: key-value stores, document databases, column-family databases, and graph databases—exploring their architectures, query languages, and practical implementation patterns.

Module Complete

You now have a comprehensive understanding of the NoSQL landscape and practical frameworks for database selection. You can analyze requirements systematically, evaluate database categories against specific use cases, and avoid common selection pitfalls. You're prepared to dive deeper into specific NoSQL database categories in the following modules.

5 / 5

Loading learning content...

Database Management SystemsNoSQL Overview

NoSQL Overview: Understanding the NoSQL Paradigm

LevelIntermediate

Duration60 mins

TopicNoSQL Overview

5 / 5

Use Case Selection: Matching Databases to Requirements

The Art of Database Selection

Effective database selection requires systematic analysis of requirements, honest assessment of trade-offs, and recognition that there's rarely a single "best" answer.

This page provides frameworks and heuristics for matching NoSQL (and relational) databases to real-world requirements.

What You Will Learn

The Selection Framework

Before evaluating specific databases, establish a framework for understanding your requirements. The following dimensions drive database selection decisions:

1. Data Model Requirements

Questions to ask:

What does the data look like? Tables, documents, graphs, key-value pairs?
How variable is the schema? Fixed attributes or highly heterogeneous?
Are relationships central to the data model?
Is the data naturally hierarchical/nested?

2. Query Patterns

Questions to ask:

What queries will the application run? Simple lookups, range scans, aggregations, traversals?
Are queries predictable or ad-hoc?
Do queries span multiple entities/tables?
What's the read/write ratio?

3. Consistency Requirements

Questions to ask:

Is strong consistency required (financial transactions) or acceptable eventual consistency (social feeds)?
Does the application need transactions spanning multiple records?
What happens if reads return slightly stale data?

4. Scale Requirements

Questions to ask:

What's the data volume now? In 1 year? In 5 years?
What's the query load (queries per second)?
Is write throughput or read throughput more critical?
Does scale need to be global (multi-region)?

5. Operational Considerations

Questions to ask:

What's the team's expertise?
Should the database be managed or self-hosted?
What's the budget (licensing, infrastructure, personnel)?
What's the existing ecosystem?

Requirements Mapping Matrix
Requirement	Relational	Key-Value	Document	Column-Family	Graph
Complex transactions	★★★★★	★☆☆☆☆	★★★☆☆	★★☆☆☆	★★★☆☆
Flexible schema	★★☆☆☆	★★★★★	★★★★★	★★★★☆	★★★★☆
Simple lookups	★★★☆☆	★★★★★	★★★★☆	★★★★★	★★★☆☆
Complex queries	★★★★★	★☆☆☆☆	★★★★☆	★★★☆☆	★★★☆☆
Relationship queries	★★★☆☆	★☆☆☆☆	★★☆☆☆	★☆☆☆☆	★★★★★
Write throughput	★★★☆☆	★★★★★	★★★★☆	★★★★★	★★★☆☆
Horizontal scale	★★☆☆☆	★★★★★	★★★★☆	★★★★★	★★★☆☆
Strong consistency	★★★★★	★★★☆☆	★★★☆☆	★★★☆☆	★★★★☆

Start with Access Patterns

When to Choose Relational Databases

Before examining NoSQL choices, recognize that relational databases remain the right choice for many workloads. NoSQL isn't a replacement—it's an alternative for specific scenarios.

Relational Databases Excel When:

Complex transactions are required: Banking, inventory, order processing—any scenario where partial updates are unacceptable.

Ad-hoc querying is common: Business intelligence, reporting, analytics on structured data. SQL's expressiveness is unmatched for exploratory queries.

Data integrity is paramount: Healthcare records, financial audits, regulatory compliance—domains where data validity is non-negotiable.

Relationships are complex but well-defined: When data fits naturally into normalized tables with foreign key relationships.

The team has SQL expertise: Familiarity reduces errors and speeds development.

Strong Relational Use Cases

•Financial systems — Core banking, payment processing, accounting. ACID transactions are mandatory.
•e-commerce transactions — Order processing, inventory management where consistency prevents overselling.
•ERP and CRM — Enterprise systems with complex relationships between entities.
•Content management (structured) — When content has well-defined schemas and complex queries.
•Reporting and BI — Ad-hoc queries, aggregations, joins across multiple tables.

Don't Abandon Relational Without Reason

Selecting Key-Value Stores

Key-value stores are appropriate when access patterns are simple—lookups by known keys—and performance is critical.

Decision Criteria for Key-Value Stores

Choose Key-Value When

•All access is by known key — You always know the identifier before querying.
•Sub-millisecond latency is required — Hot path operations must be blazingly fast.
•Data is naturally key-addressed — Sessions, caches, feature flags, rate limits.
•Value structure doesn't need server-side querying — The database treats values as opaque.
•Simplicity is prioritized — Simple data model, simple operations, simple scaling.

Common Key-Value Use Cases

Session Management

Key: session:{session_id}
Value: {user_id, expires, permissions, metadata}
Operations: GET (validate session), SET (create/update), DELETE (logout)
Requirements: Sub-millisecond reads, TTL expiration, high throughput
Choice: Redis, Memcached, DynamoDB

Caching Layer

Key: cache:{entity}:{id}  (e.g., cache:user:12345)
Value: Serialized entity from primary database
Operations: GET (cache hit/miss), SET (populate cache), DELETE (invalidate)
Requirements: Speed, TTL, eventual consistency acceptable
Choice: Redis, Memcached

Rate Limiting

Key: ratelimit:{user_id}:{window}
Value: Request count in current window
Operations: INCR (atomic increment), GET (check limit), EXPIRE (window reset)
Requirements: Atomic operations, TTL, very high throughput
Choice: Redis

Feature Flags

Key: feature:{feature_name}
Value: {enabled, percentage_rollout, user_whitelist}
Operations: GET (check flag state)
Requirements: Fast reads, rare writes, simple caching
Choice: Redis, etcd (for distributed config)

Key-Value Store Selection Guide
Requirement	Best Choice	Why
In-memory speed, data structures	Redis	Rich data structures: lists, sets, sorted sets, streams
Pure caching, simplicity	Memcached	Simpler, multi-threaded, pure cache semantics
Managed, serverless scaling	DynamoDB	Auto-scaling, no operational overhead, pay-per-request
Distributed config, coordination	etcd	Strong consistency via Raft, Kubernetes-native
High availability, eventual consistency	Riak KV	Dynamo-inspired, masterless architecture

Selecting Document Databases

Document databases are appropriate when data is naturally document-shaped with varied attributes, and queries go beyond simple key lookups.

Decision Criteria for Document Databases

Choose Document DB When

•Data has varied attributes — Product catalogs where electronics and clothing have different fields.
•Schema evolves frequently — Rapid development cycles, changing requirements.
•Queries operate on document fields — Filter by price, search by name, aggregate by category.
•Related data embeds naturally — User with nested address, preferences, roles.
•Development velocity is prioritized — JSON in, JSON out; minimal mapping layers.

Common Document Database Use Cases

Content Management System

{
    "_id": "article_12345",
    "title": "Understanding NoSQL Databases",
    "author": {"name": "Alice", "bio": "..."},
    "body": "...",
    "tags": ["database", "nosql", "tutorial"],
    "metadata": {"views": 1234, "published": "2024-01-15"},
    "comments": [{"user": "bob", "text": "Great article!"}]
}

Queries: By tag, by author, full-text search, recent articles Choice: MongoDB (rich queries), Couchbase (caching + documents)

Product Catalog

{
    "_id": "sku_12345",
    "name": "Wireless Headphones",
    "category": ["electronics", "audio"],
    "price": 149.99,
    "attributes": {
        "battery_life": "40 hours",
        "driver_size": "40mm"
    }
}

Queries: By category, price range, attribute filters, text search Choice: MongoDB, Elasticsearch (if search is primary)

User Profiles

{
    "_id": "user_12345",
    "email": "alice@example.com",
    "preferences": {"theme": "dark", "notifications": true},
    "sessions": [{"device": "mobile", "last_active": "..."}]
}

Access pattern: Usually single-document reads/writes Choice: MongoDB, DynamoDB (if simple access patterns)

Document Database Selection Guide
Requirement	Best Choice	Why
Rich queries, aggregations, transactions	MongoDB	Most complete feature set, ACID transactions
Real-time mobile sync, offline-first	Firestore, CouchDB	Built-in sync, conflict resolution
Hybrid caching + document	Couchbase	Memcached-compatible caching layer built-in
AWS ecosystem, serverless	DynamoDB	Managed, auto-scaling, tight AWS integration
Search-first with documents	Elasticsearch	Optimized for full-text search, analytics

Selecting Column-Family Databases

Column-family databases are appropriate for time-series data, high write throughput, and workloads with well-defined query patterns.

Decision Criteria for Column-Family Databases

Choose Column-Family When

•Write throughput is critical — Ingesting millions of events per second.
•Data has time-series characteristics — Metrics, logs, sensor readings, event streams.
•Data is wide and sparse — Many optional columns that vary by record.
•Query patterns are known upfront — Can model data specifically for queries.
•Massive scale is required — Petabytes of data across hundreds of nodes.
•Eventually consistency is acceptable — Tunable per-query, but AP-default.

Common Column-Family Use Cases

IoT Sensor Data

Primary Key: (device_id), timestamp
Columns: sensor readings, status flags, metadata
Query: "Last 24 hours of readings for device X"
Write: Append-only, millions of records/second
Choice: Cassandra, ScyllaDB, TimescaleDB

Metrics and Monitoring

Primary Key: (metric_name, time_bucket), timestamp
Columns: value, tags, aggregates
Query: "Average CPU for server Y in last hour"
Write: High-cardinality metrics from thousands of hosts
Choice: Cassandra, TimescaleDB, InfluxDB

Activity Feeds

Primary Key: (user_id), timestamp
Columns: activity type, actor, object, metadata
Query: "Recent 50 activities for user X"
Write: Fan-out events across millions of users
Choice: Cassandra (used by Instagram, Netflix)

Messaging and Chat History

Primary Key: (conversation_id), message_timestamp
Columns: sender, content, attachments, read_status
Query: "Messages in conversation X, last 100"
Write: Real-time message delivery
Choice: Cassandra, ScyllaDB

Column-Family Database Selection Guide
Requirement	Best Choice	Why
General-purpose wide-column, proven scale	Apache Cassandra	Battle-tested at Netflix, Apple; large community
Cassandra-compatible, higher performance	ScyllaDB	C++ reimplementation, 10x performance claims
Hadoop ecosystem integration	HBase	Built on HDFS, integrates with Spark, Hive
Managed, Google-scale	Cloud Bigtable	Managed, integrates with GCP data ecosystem
Purpose-built time-series	TimescaleDB, InfluxDB	Optimized for time-series queries, retention policies

Column-Family Requires Expertise

Selecting Graph Databases

Graph databases are appropriate when relationships between entities are the primary focus of queries.

Decision Criteria for Graph Databases

Choose Graph DB When

•Queries traverse relationships — Friends-of-friends, shortest path, pattern matching.
•Relationship depth varies — Variable-length paths are common queries.
•Data is naturally graph-shaped — Social networks, knowledge bases, network topologies.
•Recommendations are central — "People who liked X also liked Y" patterns.
•Fraud detection, impact analysis — Finding connected patterns across entities.

Common Graph Database Use Cases

Social Network Features

Nodes: Person, Post, Group, Event
Edges: FOLLOWS, LIKES, MEMBER_OF, ATTENDS
Queries: 
  - Friends-of-friends not yet connected
  - Influencer identification (high-degree nodes)
  - Community detection
Choice: Neo4j (feature-rich), Neptune (managed AWS)

Recommendation Engine

Nodes: User, Product, Category
Edges: PURCHASED, VIEWED, SIMILAR_TO
Queries:
  - "Products bought by people who bought X"
  - "Shortest path between user preferences and product"
  - Collaborative filtering via graph
Choice: Neo4j, TigerGraph (analytics scale)

Knowledge Graph / Semantic Web

Nodes: Entity (Person, Place, Concept)
Edges: Relationships with types and properties
Queries:
  - "Find all people connected to Company X within 3 hops"
  - Pattern matching for entities
Choice: Neo4j, Neptune (RDF/SPARQL support)

Fraud Detection

Nodes: Account, Device, Transaction, IP Address
Edges: USES, TRANSACTED_WITH, LOGGED_FROM
Queries:
  - "Find accounts sharing devices with known fraudsters"
  - Ring detection among accounts
  - Abnormal relationship patterns
Choice: Neo4j, TigerGraph, Amazon Neptune

Graph Database Selection Guide
Requirement	Best Choice	Why
Enterprise graph, rich features	Neo4j	Most mature, Cypher language, great tooling
AWS managed, multi-model	Amazon Neptune	Managed, supports Gremlin and SPARQL
Real-time analytics at scale	TigerGraph	Optimized for iterative analytics, massive graphs
Multi-model (document + graph)	ArangoDB	Unified AQL for documents and graphs
Open source, distributed	JanusGraph	Supports Cassandra/HBase backends, TinkerPop standard

Real-World Selection Examples

Let's walk through realistic decision processes for common scenarios.

Example 1: E-Commerce Platform

Requirements:

Product catalog: 1M products, varied attributes
Order processing: ACID transactions, inventory consistency
User sessions: Sub-millisecond authentication checks
Product search: Full-text, faceted filtering
Recommendations: "Similar products" based on purchase history

Decision:

PostgreSQL: Orders, inventory (transactions required)
MongoDB: Product catalog (flexible schemas, rich queries)
Redis: Session cache (speed critical)
Elasticsearch: Search (purpose-built for full-text)
Neo4j or PostgreSQL: Recommendations (depends on complexity)

Rationale: Polyglot persistence—each database handles what it does best. Alternatively, use PostgreSQL for everything if scale is modest and team prefers simplicity.

Example 2: IoT Analytics Platform

Requirements:

Sensor ingestion: 100K devices, 10 readings/second each = 1M writes/second
Time-series queries: "Average temperature for device X in last 24 hours"
Real-time alerting: Detecting anomalies as they happen
Device metadata: Configuration, location, ownership
Historical analytics: Aggregate queries across months of data

Decision:

Cassandra or TimescaleDB: Time-series sensor data (write throughput, time-range queries)
Redis: Real-time alerting state, recent values cache
PostgreSQL: Device metadata (relational, rarely changes)
Spark + Cassandra: Historical analytics (batch processing)

Rationale: The write throughput requirement eliminates traditional RDBMS for sensor data. Cassandra's partition design fits time-series naturally.

Example 3: Social Media Startup

Requirements:

User profiles: Varied attributes, profile customization
Social graph: Following, blocking, mutual friends queries
Activity feed: "What are my friends doing?" reverse-chronological
Messaging: Direct messages, group chats
Scale: Plan for 10M users, must scale to 100M

Decision:

MongoDB: User profiles (flexible schema, rich queries)
Neo4j or Cassandra: Social graph (Neo4j if graph queries dominate, Cassandra if scale dominates)
Cassandra: Activity feed (fan-out write, time-ordered reads)
Cassandra: Messaging (time-ordered, partition by conversation)
Redis: Caching layer throughout

Rationale: At startup scale, MongoDB could handle social graph, but planning for 100M users pushes toward specialized solutions. Instagram famously moved social graph to Cassandra.

Start Simple, Specialize Later

Anti-Patterns and Pitfalls

Learning from common mistakes is as valuable as understanding best practices.

Database Selection Anti-Patterns

•Trend-driven selection — Choosing a database because it's popular or a famous company uses it. What works for Google may not work for you.
•Resume-driven development — Selecting technology to learn something new rather than to solve the problem optimally.
•Premature specialization — Using 5 databases when 1 would suffice. Operational complexity has real costs.
•Ignoring operational reality — Choosing a database without considering who will operate it. Self-hosted distributed databases require expertise.
•Underestimating relational — Dismissing RDBMS as 'legacy' when PostgreSQL with JSON columns might solve the problem.
•Overestimating scale needs — Designing for Google scale when you have 1,000 users. Optimize for current reality plus 10x, not 10,000x.
•Ignoring consistency needs — Accepting eventual consistency without understanding the application-level implications.
•Data model mismatch — Forcing a graph into documents or documents into tables because that's what you know.

The Hidden Cost of Complexity

Summary: Principled Database Selection

We've established frameworks and heuristics for matching databases to requirements. The key is systematic analysis, not intuition or trends.

Key Takeaways

•Start with access patterns — How you query data is the primary driver of database selection.
•Don't dismiss relational — PostgreSQL remains excellent for many workloads; NoSQL isn't always the answer.
•Match the data model — Use key-value for key-addressed data, documents for semi-structured, graphs for relationships, column-family for time-series.
•Consider consistency requirements — The right choice depends on whether eventual consistency is acceptable.
•Factor in operations — A database your team can't operate well is worse than a suboptimal one they understand.
•Start simple, specialize when needed — Begin with one database; add specialized ones when concrete requirements demand it.
•Polyglot has costs — Multiple databases means multiple systems to manage; justify the complexity.

Module Complete:

Module Complete

5 / 5