System Design (HLD)High-Level Design

High-Level Design: From Requirements to Architecture

LevelIntermediate

Duration180 mins

TopicHigh-Level Design

5 / 5

Database Selection: Choosing the Right Storage Technologies

The Persistence Decision

Every system has data that must persist beyond a single request: user profiles, orders, transactions, messages, analytics. How you store this data—which database technologies you choose—has profound implications for performance, scalability, consistency, and operational complexity.

Database selection is one of the hardest architectural decisions to change later. Migrating data between different database types is expensive, risky, and often requires significant downtime or complex synchronization logic. Making the right choice upfront—or at least an informed choice—prevents painful migrations down the road.

This page covers the landscape of database technologies, the criteria for selection, and guidance for matching database types to different use cases. You'll learn to think systematically about data storage decisions rather than defaulting to familiar technologies.

What You Will Learn

By the end of this page, you will understand the major categories of databases and their tradeoffs, evaluate databases based on data models, access patterns, and consistency requirements, apply selection criteria to common system design scenarios, and recognize when to use polyglot persistence (multiple database types).

Database Categories Overview

Before diving into selection criteria, let's survey the major database categories and what they're designed for.

Relational Databases (RDBMS)

Examples: PostgreSQL, MySQL, SQL Server, Oracle

Data model: Tables with rows and columns, relationships through foreign keys, schema-enforced structure.

Strengths:

ACID transactions with strong consistency
Powerful query language (SQL)
Complex joins across tables
Mature, well-understood technology
Rich ecosystem of tools

Weaknesses:

Vertical scaling limits
Schema changes can be disruptive
Less efficient for hierarchical data
Sharding is complex

Best for: Transactional systems, complex queries, data with clear relationships, systems requiring strong consistency.

Document Databases

Examples: MongoDB, CouchDB, Amazon DocumentDB

Data model: JSON/BSON documents with nested structures, flexible schema.

Strengths:

Flexible schema, easy iteration
Natural fit for hierarchical data
Built-in horizontal scaling (sharding)
Document-aligned with application objects

Weaknesses:

Weaker consistency guarantees (by default)
Joins across documents are expensive
Can lead to denormalized, duplicated data
Complex transactions across documents

Best for: Content management, catalogs, user profiles, data with variable structure, rapid development.

Key-Value Stores

Examples: Redis, Amazon DynamoDB, Memcached, etcd

Data model: Simple key → value pairs, values can be strings, JSON, or binary.

Strengths:

Extremely fast reads and writes
Simple API (get, put, delete)
Horizontal scaling is straightforward
Excellent for caching and sessions

Weaknesses:

Limited query capabilities (only by key)
No relationships or joins
Value-level operations may require full retrieval

Best for: Caching, session storage, shopping carts, rate limiting, feature flags.

Wide-Column Stores

Examples: Cassandra, HBase, Google Bigtable, ScyllaDB

Data model: Rows with potentially many columns, column families, designed for sparse data.

Strengths:

Massive scale (petabytes)
High write throughput
Tunable consistency
Geographic distribution

Weaknesses:

Limited query flexibility
No cross-partition transactions
Data modeling requires access pattern knowledge
Eventual consistency by default

Best for: Time-series data, IoT sensor data, logging, recommendation data, high-volume writes.

Graph Databases

Examples: Neo4j, Amazon Neptune, JanusGraph, TigerGraph

Data model: Nodes and edges with properties, optimized for relationship traversal.

Strengths:

Efficient relationship queries
Natural for connected data
Variable-depth traversals
Pattern matching

Weaknesses:

Less efficient for non-graph queries
Scaling can be challenging
Smaller ecosystem than RDBMS
Complex for non-graph use cases

Best for: Social networks, fraud detection, knowledge graphs, recommendation engines, network analysis.

Database Category Quick Reference
Category	Data Model	Scaling	Consistency	Query Pattern
Relational	Tables/Rows	Vertical (primarily)	Strong (ACID)	Complex SQL
Document	JSON documents	Horizontal	Tunable	By field, flexible
Key-Value	Key → Value	Horizontal	Varies	By key only
Wide-Column	Rows/Columns	Horizontal	Eventual (tunable)	By partition key
Graph	Nodes/Edges	Varies	Varies	Traversals, patterns

Database Selection Criteria

Selecting a database requires evaluating multiple dimensions. Different criteria matter more for different systems.

Criterion 1: Data Model Fit

Question: How does my data naturally structure itself?

Highly relational (orders ↔ customers ↔ products) → Relational DB
Hierarchical/document-like (user profiles with nested preferences) → Document DB
Simple key-based access (cache, sessions) → Key-Value store
Connection-centric (social graph, fraud rings) → Graph DB
Wide, sparse, time-ordered (sensors, logs) → Wide-column store

Criterion 2: Access Patterns

Question: How will data be read and written?

Complex queries with joins → Relational DB (SQL power)
Point lookups by ID → Key-Value or Document DB
Range scans on time → Wide-column or Time-series DB
Full-text search → Search engine (Elasticsearch)
Aggregations across large datasets → Analytics DB (ClickHouse, Redshift)

Criterion 3: Consistency Requirements

Question: How critical is immediate consistency?

Financial transactions: Strong consistency required → RDBMS with ACID
User-facing reads: Read-after-write consistency often sufficient
Analytics: Eventual consistency acceptable
Session data: Single-leader consistency typically OK

Criterion 4: Scale Requirements

Question: How much data and traffic do you expect?

GBs, moderate traffic: Almost any database works
TBs, high read traffic: Consider read replicas, caching
TBs, high write traffic: Consider sharded solutions
PBs, extreme throughput: Wide-column stores (Cassandra, Bigtable)

Criterion 5: Operational Considerations

Question: What can your team operate effectively?

Managed services: AWS RDS, DynamoDB, Cloud SQL reduce ops burden
Self-managed: More control but more operational overhead
Team familiarity: Don't underestimate the value of existing expertise
Ecosystem: Tooling, monitoring, backup/restore capabilities

Selection Decision Framework

•Map your data model — What entities exist? How are they related?
•List access patterns — What queries will be most common? What's latency-critical?
•Define consistency needs — What consistency guarantees are required for correctness?
•Estimate scale — Current and projected data volume, reads/writes per second?
•Consider operations — Managed vs self-hosted? Team expertise?
•Evaluate tradeoffs — No database is perfect; which compromises are acceptable?

Relational Databases: When and How

Relational databases remain the default choice for many systems, and for good reason. Let's explore when they excel and their limitations.

When to Choose Relational

Strong consistency is essential:

Financial transactions, inventory management
Accounts, balances, ledgers
Anything where "eventual consistency" could mean incorrect business outcomes

Data has complex relationships:

Many-to-many relationships common
Queries need multi-table joins
Data integrity through foreign keys is valuable

Query flexibility is needed:

Ad-hoc queries for reporting
Complex WHERE clauses, aggregations
Analytics on transactional data

Schema enforcement is desired:

Data quality is critical
Structure is well-understood upfront
Type safety prevents errors

Scaling Relational Databases

Vertical scaling (scale-up):

Add CPU, RAM, faster storage
Simple but has limits (cost, physical limits)
Modern instances can handle significant scale (tens of TB)

Read replicas:

Direct read traffic to replicas
Works well for read-heavy workloads
Replication lag creates eventual consistency for reads

Connection pooling:

PgBouncer, ProxySQL, etc.
Reduces connection overhead
Essential for many concurrent clients

Sharding (horizontal partition):

Split data across databases by key (user_id, region)
Adds significant application complexity
Cross-shard queries become very expensive
Consider only when other options exhausted

Popular Relational Database Comparison
Database	Strengths	Managed Options	Best For
PostgreSQL	Feature-rich, extensible, JSON support, advanced types	RDS, Cloud SQL, Aurora, Supabase	General purpose, geospatial, analytics
MySQL	Widely deployed, simple, fast reads, mature ecosystem	RDS, Cloud SQL, Aurora, PlanetScale	Web applications, read-heavy workloads
SQL Server	Enterprise features, Windows integration, BI tools	Azure SQL, RDS	Enterprise, Microsoft ecosystem
CockroachDB	Distributed SQL, global scale, PostgreSQL compatible	CockroachDB Cloud	Global apps needing SQL semantics

Start with PostgreSQL

When in doubt, PostgreSQL is an excellent default. It handles JSON documents, geospatial data, full-text search, and traditional relational workloads. It's battle-tested, well-documented, and has extensive managed service options.

NoSQL Databases: Types and Use Cases

"NoSQL" encompasses diverse database types. Let's examine when each type shines.

Document Databases (MongoDB, CouchDB)

When to choose:

Schema evolves frequently (early-stage products)
Data naturally forms documents (CMS, user profiles)
Need embedded/nested data structures
Hierarchical data with variable depth

Design considerations:

Model data for how you'll query it (denormalization OK)
Embed frequently-read-together data
Reference data that changes independently
Be mindful of document size limits (16MB in MongoDB)

Example fit: E-commerce product catalog with variable attributes per category

Key-Value Stores (Redis, DynamoDB)

When to choose:

Access by primary key only
Extremely low latency required (sub-millisecond)
High throughput, simple operations
Caching layer over other databases

Redis specifics:

Rich data structures (lists, sets, sorted sets, hashes)
In-memory (blazing fast) with optional persistence
Pub/sub, Lua scripting, transactions
Great for: Sessions, caching, leaderboards, rate limiting, queues

DynamoDB specifics:

Fully managed, auto-scaling
Single-digit millisecond at any scale
Complex data modeling required (single-table design)
Great for: High-scale, serverless, AWS-native architectures

Wide-Column Stores (Cassandra, Bigtable)

When to choose:

Massive write throughput (100K+ writes/sec)
Time-series or event data
High availability across data centers
Can tolerate eventual consistency

Design considerations:

Data modeling is access-pattern-driven
Design partition keys to distribute load evenly
Avoid hot partitions (too much traffic on one key)
Denormalize heavily—no joins available

Example fit: IoT sensor data (billions of readings), messaging history, audit logs

Graph Databases (Neo4j, Neptune)

When to choose:

Relationships are first-class citizens
Need variable-depth traversals
Pattern matching (fraud detection, recommendations)
Network analysis (social graphs, knowledge graphs)

Design considerations:

Model nodes and relationships explicitly
Properties on both nodes and edges
Index frequently-queried properties
Consider read vs write performance tradeoffs

Example fit: Social network friend-of-friend queries, recommendation graph, fraud ring detection

NoSQL Database Selection Guide
Use Case	Best Choice	Rationale
Session storage	Redis	Fast, TTL support, simple key-value
User profiles	MongoDB / DynamoDB	Flexible schema, document-oriented
Leaderboards	Redis	Sorted sets, O(log n) operations
Time-series metrics	Cassandra / TimescaleDB	High write throughput, time-ordered
Social connections	Neo4j	Relationship traversals, graph queries
Shopping cart	Redis / DynamoDB	Fast access, simple structure
Activity feeds	Cassandra	Write-heavy, time-ordered access
Search index	Elasticsearch	Full-text search, aggregations

Specialized Databases

Beyond the major categories, specialized databases address specific use cases with optimized designs.

Time-Series Databases

Examples: InfluxDB, TimescaleDB, Prometheus, QuestDB

Optimized for:

Time-stamped data (metrics, events, sensor readings)
High ingestion rates
Time-range queries and aggregations
Downsampling and retention policies

Use cases: Infrastructure monitoring, IoT analytics, financial tick data

Search Engines

Examples: Elasticsearch, OpenSearch, Meilisearch, Algolia

Optimized for:

Full-text search with relevance ranking
Faceted search and filtering
Log aggregation and analytics
Auto-complete and suggestions

Use cases: E-commerce search, log analysis (ELK stack), content search

Analytics/OLAP Databases

Examples: ClickHouse, Snowflake, BigQuery, Redshift, Apache Druid

Optimized for:

Large-scale aggregations
Columnar storage for analytics
Batch/streaming ingestion
Ad-hoc exploratory queries

Use cases: Business intelligence, data warehousing, real-time analytics dashboards

Message Brokers / Event Stores

Examples: Apache Kafka, Pulsar, Amazon Kinesis

Optimized for:

High-throughput message streaming
Durable, ordered event logs
Consumer groups and replay
Event sourcing patterns

Use cases: Event-driven architecture, stream processing, change data capture

Object/Blob Storage

Examples: Amazon S3, Google Cloud Storage, Azure Blob Storage, MinIO

Optimized for:

Large binary objects (images, videos, documents)
Unlimited scale
Infrequent access patterns (archival)
CDN integration

Use cases: Media storage, backups, data lakes, static assets

The Right Tool for the Job

Specialized databases outperform general-purpose databases for their specific use cases by orders of magnitude. Using PostgreSQL for full-text search 'works' but Elasticsearch does it 100x faster. Using MySQL for time-series 'works' but InfluxDB handles the write volume and query patterns far more efficiently.

Polyglot Persistence: Multiple Databases

Modern systems often use multiple database technologies, each chosen for specific use cases. This approach is called polyglot persistence.

Why Multiple Databases?

Different access patterns require different optimizations:

Transactional data → RDBMS (strong consistency)
Session data → Redis (low latency)
Search → Elasticsearch (full-text capabilities)
Analytics → ClickHouse (columnar aggregations)
Media → S3 (blob storage)

Single-database limitations:

No database excels at everything
Forcing diverse patterns into one DB leads to poor performance
Specialized databases can be 10-100x more efficient for their use cases

Polyglot Persistence Example: E-commerce Platform

┌─────────────────────────────────────────────────────────────┐
│                    E-commerce Platform                      │
├─────────────────────────────────────────────────────────────┤
│                                                             │
│  ┌─────────────┐  ┌─────────────┐  ┌─────────────┐         │
│  │ PostgreSQL  │  │   MongoDB   │  │    Redis    │         │
│  │             │  │             │  │             │         │
│  │ • Orders    │  │ • Product   │  │ • Sessions  │         │
│  │ • Payments  │  │   Catalog   │  │ • Cart      │         │
│  │ • Inventory │  │ • Reviews   │  │ • Cache     │         │
│  └─────────────┘  └─────────────┘  └─────────────┘         │
│                                                             │
│  ┌─────────────┐  ┌─────────────┐  ┌─────────────┐         │
│  │Elasticsearch│  │ ClickHouse  │  │     S3      │         │
│  │             │  │             │  │             │         │
│  │ • Product   │  │ • Analytics │  │ • Product   │         │
│  │   Search    │  │ • Reports   │  │   Images    │         │
│  │ • Logs      │  │ • BI Data   │  │ • Invoices  │         │
│  └─────────────┘  └─────────────┘  └─────────────┘         │
│                                                             │
└─────────────────────────────────────────────────────────────┘

Each database handles what it does best:

PostgreSQL: Financial/transactional data requiring ACID
MongoDB: Flexible product catalog with varying attributes
Redis: Low-latency session and cart access
Elasticsearch: Fast, relevant product search
ClickHouse: Large-scale analytics queries
S3: Scalable binary storage

Polyglot Persistence Challenges

•Operational complexity — Each database requires expertise, monitoring, backups
•Data synchronization — Keeping data consistent across databases is non-trivial
•Transactional boundaries — Cross-database transactions are extremely difficult
•Debugging complexity — Issues might span multiple storage systems
•Team knowledge — Engineers need familiarity with multiple technologies

Sync is Hard

The hardest part of polyglot persistence is keeping data synchronized. If product data lives in both PostgreSQL (inventory) and Elasticsearch (search), you need robust synchronization—typically via events or CDC (Change Data Capture). Plan this carefully.

CAP Theorem and Consistency Trade-offs

The CAP theorem states that in a distributed system experiencing a network partition, you can have either Consistency or Availability, but not both.

CAP Components

Consistency (C): Every read receives the most recent write or an error.

Availability (A): Every request receives a response (not necessarily the latest data).

Partition tolerance (P): System continues operating despite network failures between nodes.

The Reality

Network partitions happen. In distributed systems, P is non-negotiable. So the real choice is between:

CP systems: Prioritize consistency, may become unavailable during partitions
- Examples: Traditional RDBMS with sync replication, HBase, MongoDB (in strong consistency mode)
AP systems: Prioritize availability, may serve stale data during partitions
- Examples: Cassandra, DynamoDB (default), CouchDB

Beyond CAP: PACELC

The PACELC theorem extends CAP: Even when the system is running normally (no partition), there's a trade-off between Latency and Consistency.

PACELC: If (P)artition, choose (A)vailability or (C)onsistency; Else, choose (L)atency or (C)onsistency.

Example categorizations:

DynamoDB (PA/EL): Available during partition, low latency otherwise, eventual consistency
PostgreSQL (PC/EC): Consistent during partition (unavailable), consistent otherwise
Cassandra with tunable consistency: Configurable based on needs

Practical Implications

Choose based on business reality:

Banking transaction → CP: Better to error than to process a duplicate transfer
Social media feed → AP: Better to show slightly stale feed than error
Inventory check → CP: Better to oversell and apologize than block all purchases
Activity stream → AP: Eventual consistency is fine for non-critical updates

Common Databases by CAP Classification
Category	During Partition	Examples	Use When
CP	Consistent but may be unavailable	PostgreSQL, MongoDB (strict), Google Spanner	Financial, inventory, coordination
AP	Available but may be inconsistent	Cassandra, DynamoDB, CouchDB	High availability critical, eventual consistency OK

Tunable Consistency

Many modern databases offer tunable consistency. Cassandra lets you choose consistency level per query. DynamoDB offers strong consistency reads. MongoDB lets you configure write and read concerns. This flexibility lets you make different trade-offs for different operations.

Data Modeling Considerations

How you model data significantly impacts database choice and system performance. Different databases require different modeling approaches.

Relational Data Modeling

Principles:

Normalize to reduce redundancy (3NF typically)
Use foreign keys for relationships
Design for query flexibility (joins are cheap)
Schema is defined upfront

Example: Order System

users: id, name, email
orders: id, user_id (FK), created_at, status
order_items: id, order_id (FK), product_id (FK), quantity, price
products: id, name, description, base_price

Joins combine these at query time for complete order views.

Document Data Modeling

Principles:

Embed frequently-accessed-together data
Design for how you query, not abstract relationships
Accept some duplication for read performance
Schema can evolve per document

Example: Order System

{
  "_id": "order_123",
  "user": {
    "id": "user_456",
    "name": "Alice",
    "email": "alice@example.com"
  },
  "items": [
    {
      "productId": "prod_789",
      "name": "Widget",
      "quantity": 2,
      "price": 29.99
    }
  ],
  "total": 59.98,
  "status": "shipped"
}

No joins needed—one read gets everything. Trade-off: if user changes email, multiple documents need updating.

Wide-Column Data Modeling

Principles:

Design around your access patterns (mandatory)
Partition key determines data distribution
Clustering key determines sort order within partition
Denormalize aggressively—no joins available

Example: User Activity Timeline

Table: user_activity
Partition Key: user_id
Clustering Key: timestamp (DESC)

Row: (user_123, 2024-01-15T10:00:00, "logged_in")
Row: (user_123, 2024-01-15T09:45:00, "viewed_product")

Query: "Get last 20 activities for user_123" is extremely fast.

Graph Data Modeling

Principles:

Nodes represent entities
Edges represent relationships (with properties)
Model for traversal patterns
Index properties you'll query

Example: Social Network

(User:Alice)-[:FOLLOWS]->(User:Bob)
(User:Bob)-[:FOLLOWS]->(User:Carol)
(User:Alice)-[:FOLLOWS]->(User:Carol)

Query: "Find friends of friends of Alice" traverses edges efficiently.

Case Study: Database Selection for a Messaging Platform

Let's apply database selection principles to a real-time messaging platform (like Slack or Discord).

Requirements:

Users send messages in channels (1-to-many)
Users send direct messages (1-to-1)
Messages are stored permanently
Recent messages load instantly (<100ms)
Search across message history
User presence (online/offline status)
Notifications across devices

Analysis by Data Type

1. User and Channel Metadata

Characteristics: Relational, moderate volume, transactional operations

Access patterns:

Get user by ID
Get channels user belongs to
Update user settings
Manage channel membership

Selection: PostgreSQL

Strong consistency for membership
Complex queries for permissions
Familiar, reliable

2. Message Storage

Characteristics: Append-only, time-ordered, high volume, accessed by channel/conversation

Access patterns:

Get recent messages for channel (most common)
Get messages before timestamp (pagination)
Write new messages (high frequency)

Selection: Cassandra or ScyllaDB

Partition by channel_id, cluster by timestamp
Scales to billions of messages
Optimized for time-ordered access

Alternative: PostgreSQL with partitioning works for smaller scale

3. Message Search

Characteristics: Full-text, relevance-ranked, filtered

Access patterns:

Search messages by keyword
Filter by channel, date range, user
Highlighting matches

Selection: Elasticsearch

Purpose-built for full-text search
Faceted filtering
Handles large indexes efficiently

Sync strategy: CDC from Cassandra → Kafka → Elasticsearch

4. User Presence

Characteristics: Highly volatile, ephemeral, sub-second access

Access patterns:

Get online status for users in channel
Update presence on activity
Expire after timeout

Selection: Redis

Sub-millisecond reads/writes
TTL for automatic expiration
Pub/sub for presence broadcasts

5. Message Attachments

Characteristics: Binary, large, write-once-read-many

Access patterns:

Upload attachment
Serve via CDN
Reference by URL in messages

Selection: S3 + CDN

Unlimited scale
Low cost for storage
Integrates with CDNs for fast delivery

┌─────────────────────────────────────────────────────────────────┐
│                     MESSAGING PLATFORM                          │
│                                                                 │
│  ┌─────────────────┐   ┌────────────────┐   ┌───────────────┐  │
│  │   PostgreSQL    │   │   Cassandra    │   │    Redis      │  │
│  │                 │   │                │   │               │  │
│  │ • Users         │   │ • Messages     │   │ • Presence    │  │
│  │ • Channels      │   │ • Reactions    │   │ • Sessions    │  │
│  │ • Memberships   │   │ • Read markers │   │ • Rate limits │  │
│  │ • Permissions   │   │                │   │ • Caching     │  │
│  └─────────────────┘   └────────────────┘   └───────────────┘  │
│                                                                 │
│  ┌─────────────────┐   ┌────────────────┐   ┌───────────────┐  │
│  │  Elasticsearch  │   │     Kafka      │   │      S3       │  │
│  │                 │   │                │   │               │  │
│  │ • Message       │   │ • Events      │   │ • Files       │  │
│  │   search        │   │ • CDC stream  │   │ • Images      │  │
│  │ • Audit logs    │   │ • Sync        │   │ • Avatars     │  │
│  └─────────────────┘   └────────────────┘   └───────────────┘  │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

Summary: Database Selection

Database selection is a critical architectural decision with long-lasting implications. Let's consolidate the key principles:

Key Takeaways

•No universal best database — Different data types and access patterns require different technologies.
•Match database to access patterns — Design around how data will be queried, not just how it's structured.
•Relational databases are excellent defaults — PostgreSQL handles diverse workloads well; start here when unsure.
•NoSQL solves specific problems — Document, key-value, wide-column, and graph databases each have sweet spots.
•Specialized databases outperform generalists — Time-series DBs for metrics, search engines for full-text, analytics DBs for OLAP.
•Polyglot persistence is common — Modern systems often use 3-5 different database technologies.
•Understand CAP/PACELC trade-offs — Choose consistency or availability based on business requirements.
•Data modeling differs by database type — Normalize for RDBMS, denormalize for NoSQL, design for access in wide-column stores.

Module Complete:

You've now covered all aspects of high-level design: component identification, system architecture diagrams, data flow diagrams, API design, and database selection. Together, these skills enable you to translate requirements into a coherent, implementable system architecture—the core deliverable of the high-level design phase.

Page Complete

You now understand how to evaluate and select database technologies for different system requirements. This skill—matching storage solutions to data characteristics and access patterns—is fundamental to building systems that perform well at scale.

5 / 5

Loading learning content...

System Design (HLD)High-Level Design

High-Level Design: From Requirements to Architecture

LevelIntermediate

Duration180 mins

TopicHigh-Level Design

5 / 5

Database Selection: Choosing the Right Storage Technologies

The Persistence Decision

What You Will Learn

Database Categories Overview

Before diving into selection criteria, let's survey the major database categories and what they're designed for.

Relational Databases (RDBMS)

Examples: PostgreSQL, MySQL, SQL Server, Oracle

Data model: Tables with rows and columns, relationships through foreign keys, schema-enforced structure.

Strengths:

ACID transactions with strong consistency
Powerful query language (SQL)
Complex joins across tables
Mature, well-understood technology
Rich ecosystem of tools

Weaknesses:

Vertical scaling limits
Schema changes can be disruptive
Less efficient for hierarchical data
Sharding is complex

Best for: Transactional systems, complex queries, data with clear relationships, systems requiring strong consistency.

Document Databases

Examples: MongoDB, CouchDB, Amazon DocumentDB

Data model: JSON/BSON documents with nested structures, flexible schema.

Strengths:

Flexible schema, easy iteration
Natural fit for hierarchical data
Built-in horizontal scaling (sharding)
Document-aligned with application objects

Weaknesses:

Weaker consistency guarantees (by default)
Joins across documents are expensive
Can lead to denormalized, duplicated data
Complex transactions across documents

Best for: Content management, catalogs, user profiles, data with variable structure, rapid development.

Key-Value Stores

Examples: Redis, Amazon DynamoDB, Memcached, etcd

Data model: Simple key → value pairs, values can be strings, JSON, or binary.

Strengths:

Extremely fast reads and writes
Simple API (get, put, delete)
Horizontal scaling is straightforward
Excellent for caching and sessions

Weaknesses:

Limited query capabilities (only by key)
No relationships or joins
Value-level operations may require full retrieval

Best for: Caching, session storage, shopping carts, rate limiting, feature flags.

Wide-Column Stores

Examples: Cassandra, HBase, Google Bigtable, ScyllaDB

Data model: Rows with potentially many columns, column families, designed for sparse data.

Strengths:

Massive scale (petabytes)
High write throughput
Tunable consistency
Geographic distribution

Weaknesses:

Limited query flexibility
No cross-partition transactions
Data modeling requires access pattern knowledge
Eventual consistency by default

Best for: Time-series data, IoT sensor data, logging, recommendation data, high-volume writes.

Graph Databases

Examples: Neo4j, Amazon Neptune, JanusGraph, TigerGraph

Data model: Nodes and edges with properties, optimized for relationship traversal.

Strengths:

Efficient relationship queries
Natural for connected data
Variable-depth traversals
Pattern matching

Weaknesses:

Less efficient for non-graph queries
Scaling can be challenging
Smaller ecosystem than RDBMS
Complex for non-graph use cases

Best for: Social networks, fraud detection, knowledge graphs, recommendation engines, network analysis.

Database Category Quick Reference
Category	Data Model	Scaling	Consistency	Query Pattern
Relational	Tables/Rows	Vertical (primarily)	Strong (ACID)	Complex SQL
Document	JSON documents	Horizontal	Tunable	By field, flexible
Key-Value	Key → Value	Horizontal	Varies	By key only
Wide-Column	Rows/Columns	Horizontal	Eventual (tunable)	By partition key
Graph	Nodes/Edges	Varies	Varies	Traversals, patterns

Database Selection Criteria

Selecting a database requires evaluating multiple dimensions. Different criteria matter more for different systems.

Criterion 1: Data Model Fit

Question: How does my data naturally structure itself?

Highly relational (orders ↔ customers ↔ products) → Relational DB
Hierarchical/document-like (user profiles with nested preferences) → Document DB
Simple key-based access (cache, sessions) → Key-Value store
Connection-centric (social graph, fraud rings) → Graph DB
Wide, sparse, time-ordered (sensors, logs) → Wide-column store

Criterion 2: Access Patterns

Question: How will data be read and written?

Complex queries with joins → Relational DB (SQL power)
Point lookups by ID → Key-Value or Document DB
Range scans on time → Wide-column or Time-series DB
Full-text search → Search engine (Elasticsearch)
Aggregations across large datasets → Analytics DB (ClickHouse, Redshift)

Criterion 3: Consistency Requirements

Question: How critical is immediate consistency?

Financial transactions: Strong consistency required → RDBMS with ACID
User-facing reads: Read-after-write consistency often sufficient
Analytics: Eventual consistency acceptable
Session data: Single-leader consistency typically OK

Criterion 4: Scale Requirements

Question: How much data and traffic do you expect?

GBs, moderate traffic: Almost any database works
TBs, high read traffic: Consider read replicas, caching
TBs, high write traffic: Consider sharded solutions
PBs, extreme throughput: Wide-column stores (Cassandra, Bigtable)

Criterion 5: Operational Considerations

Question: What can your team operate effectively?

Managed services: AWS RDS, DynamoDB, Cloud SQL reduce ops burden
Self-managed: More control but more operational overhead
Team familiarity: Don't underestimate the value of existing expertise
Ecosystem: Tooling, monitoring, backup/restore capabilities

Selection Decision Framework

•Map your data model — What entities exist? How are they related?
•List access patterns — What queries will be most common? What's latency-critical?
•Define consistency needs — What consistency guarantees are required for correctness?
•Estimate scale — Current and projected data volume, reads/writes per second?
•Consider operations — Managed vs self-hosted? Team expertise?
•Evaluate tradeoffs — No database is perfect; which compromises are acceptable?

Relational Databases: When and How

Relational databases remain the default choice for many systems, and for good reason. Let's explore when they excel and their limitations.

When to Choose Relational

Strong consistency is essential:

Financial transactions, inventory management
Accounts, balances, ledgers
Anything where "eventual consistency" could mean incorrect business outcomes

Data has complex relationships:

Many-to-many relationships common
Queries need multi-table joins
Data integrity through foreign keys is valuable

Query flexibility is needed:

Ad-hoc queries for reporting
Complex WHERE clauses, aggregations
Analytics on transactional data

Schema enforcement is desired:

Data quality is critical
Structure is well-understood upfront
Type safety prevents errors

Scaling Relational Databases

Vertical scaling (scale-up):

Add CPU, RAM, faster storage
Simple but has limits (cost, physical limits)
Modern instances can handle significant scale (tens of TB)

Read replicas:

Direct read traffic to replicas
Works well for read-heavy workloads
Replication lag creates eventual consistency for reads

Connection pooling:

PgBouncer, ProxySQL, etc.
Reduces connection overhead
Essential for many concurrent clients

Sharding (horizontal partition):

Split data across databases by key (user_id, region)
Adds significant application complexity
Cross-shard queries become very expensive
Consider only when other options exhausted

Popular Relational Database Comparison
Database	Strengths	Managed Options	Best For
PostgreSQL	Feature-rich, extensible, JSON support, advanced types	RDS, Cloud SQL, Aurora, Supabase	General purpose, geospatial, analytics
MySQL	Widely deployed, simple, fast reads, mature ecosystem	RDS, Cloud SQL, Aurora, PlanetScale	Web applications, read-heavy workloads
SQL Server	Enterprise features, Windows integration, BI tools	Azure SQL, RDS	Enterprise, Microsoft ecosystem
CockroachDB	Distributed SQL, global scale, PostgreSQL compatible	CockroachDB Cloud	Global apps needing SQL semantics

Start with PostgreSQL

NoSQL Databases: Types and Use Cases

"NoSQL" encompasses diverse database types. Let's examine when each type shines.

Document Databases (MongoDB, CouchDB)

When to choose:

Schema evolves frequently (early-stage products)
Data naturally forms documents (CMS, user profiles)
Need embedded/nested data structures
Hierarchical data with variable depth

Design considerations:

Model data for how you'll query it (denormalization OK)
Embed frequently-read-together data
Reference data that changes independently
Be mindful of document size limits (16MB in MongoDB)

Example fit: E-commerce product catalog with variable attributes per category

Key-Value Stores (Redis, DynamoDB)

When to choose:

Access by primary key only
Extremely low latency required (sub-millisecond)
High throughput, simple operations
Caching layer over other databases

Redis specifics:

Rich data structures (lists, sets, sorted sets, hashes)
In-memory (blazing fast) with optional persistence
Pub/sub, Lua scripting, transactions
Great for: Sessions, caching, leaderboards, rate limiting, queues

DynamoDB specifics:

Fully managed, auto-scaling
Single-digit millisecond at any scale
Complex data modeling required (single-table design)
Great for: High-scale, serverless, AWS-native architectures

Wide-Column Stores (Cassandra, Bigtable)

When to choose:

Massive write throughput (100K+ writes/sec)
Time-series or event data
High availability across data centers
Can tolerate eventual consistency

Design considerations:

Data modeling is access-pattern-driven
Design partition keys to distribute load evenly
Avoid hot partitions (too much traffic on one key)
Denormalize heavily—no joins available

Example fit: IoT sensor data (billions of readings), messaging history, audit logs

Graph Databases (Neo4j, Neptune)

When to choose:

Relationships are first-class citizens
Need variable-depth traversals
Pattern matching (fraud detection, recommendations)
Network analysis (social graphs, knowledge graphs)

Design considerations:

Model nodes and relationships explicitly
Properties on both nodes and edges
Index frequently-queried properties
Consider read vs write performance tradeoffs

Example fit: Social network friend-of-friend queries, recommendation graph, fraud ring detection

NoSQL Database Selection Guide
Use Case	Best Choice	Rationale
Session storage	Redis	Fast, TTL support, simple key-value
User profiles	MongoDB / DynamoDB	Flexible schema, document-oriented
Leaderboards	Redis	Sorted sets, O(log n) operations
Time-series metrics	Cassandra / TimescaleDB	High write throughput, time-ordered
Social connections	Neo4j	Relationship traversals, graph queries
Shopping cart	Redis / DynamoDB	Fast access, simple structure
Activity feeds	Cassandra	Write-heavy, time-ordered access
Search index	Elasticsearch	Full-text search, aggregations

Specialized Databases

Beyond the major categories, specialized databases address specific use cases with optimized designs.

Time-Series Databases

Examples: InfluxDB, TimescaleDB, Prometheus, QuestDB

Optimized for:

Time-stamped data (metrics, events, sensor readings)
High ingestion rates
Time-range queries and aggregations
Downsampling and retention policies

Use cases: Infrastructure monitoring, IoT analytics, financial tick data

Search Engines

Examples: Elasticsearch, OpenSearch, Meilisearch, Algolia

Optimized for:

Full-text search with relevance ranking
Faceted search and filtering
Log aggregation and analytics
Auto-complete and suggestions

Use cases: E-commerce search, log analysis (ELK stack), content search

Analytics/OLAP Databases

Examples: ClickHouse, Snowflake, BigQuery, Redshift, Apache Druid

Optimized for:

Large-scale aggregations
Columnar storage for analytics
Batch/streaming ingestion
Ad-hoc exploratory queries

Use cases: Business intelligence, data warehousing, real-time analytics dashboards

Message Brokers / Event Stores

Examples: Apache Kafka, Pulsar, Amazon Kinesis

Optimized for:

High-throughput message streaming
Durable, ordered event logs
Consumer groups and replay
Event sourcing patterns

Use cases: Event-driven architecture, stream processing, change data capture

Object/Blob Storage

Examples: Amazon S3, Google Cloud Storage, Azure Blob Storage, MinIO

Optimized for:

Large binary objects (images, videos, documents)
Unlimited scale
Infrequent access patterns (archival)
CDN integration

Use cases: Media storage, backups, data lakes, static assets

The Right Tool for the Job

Polyglot Persistence: Multiple Databases

Modern systems often use multiple database technologies, each chosen for specific use cases. This approach is called polyglot persistence.

Why Multiple Databases?

Different access patterns require different optimizations:

Transactional data → RDBMS (strong consistency)
Session data → Redis (low latency)
Search → Elasticsearch (full-text capabilities)
Analytics → ClickHouse (columnar aggregations)
Media → S3 (blob storage)

Single-database limitations:

No database excels at everything
Forcing diverse patterns into one DB leads to poor performance
Specialized databases can be 10-100x more efficient for their use cases

Polyglot Persistence Example: E-commerce Platform

┌─────────────────────────────────────────────────────────────┐
│                    E-commerce Platform                      │
├─────────────────────────────────────────────────────────────┤
│                                                             │
│  ┌─────────────┐  ┌─────────────┐  ┌─────────────┐         │
│  │ PostgreSQL  │  │   MongoDB   │  │    Redis    │         │
│  │             │  │             │  │             │         │
│  │ • Orders    │  │ • Product   │  │ • Sessions  │         │
│  │ • Payments  │  │   Catalog   │  │ • Cart      │         │
│  │ • Inventory │  │ • Reviews   │  │ • Cache     │         │
│  └─────────────┘  └─────────────┘  └─────────────┘         │
│                                                             │
│  ┌─────────────┐  ┌─────────────┐  ┌─────────────┐         │
│  │Elasticsearch│  │ ClickHouse  │  │     S3      │         │
│  │             │  │             │  │             │         │
│  │ • Product   │  │ • Analytics │  │ • Product   │         │
│  │   Search    │  │ • Reports   │  │   Images    │         │
│  │ • Logs      │  │ • BI Data   │  │ • Invoices  │         │
│  └─────────────┘  └─────────────┘  └─────────────┘         │
│                                                             │
└─────────────────────────────────────────────────────────────┘

Each database handles what it does best:

PostgreSQL: Financial/transactional data requiring ACID
MongoDB: Flexible product catalog with varying attributes
Redis: Low-latency session and cart access
Elasticsearch: Fast, relevant product search
ClickHouse: Large-scale analytics queries
S3: Scalable binary storage

Polyglot Persistence Challenges

•Operational complexity — Each database requires expertise, monitoring, backups
•Data synchronization — Keeping data consistent across databases is non-trivial
•Transactional boundaries — Cross-database transactions are extremely difficult
•Debugging complexity — Issues might span multiple storage systems
•Team knowledge — Engineers need familiarity with multiple technologies

Sync is Hard

CAP Theorem and Consistency Trade-offs

The CAP theorem states that in a distributed system experiencing a network partition, you can have either Consistency or Availability, but not both.

CAP Components

Consistency (C): Every read receives the most recent write or an error.

Availability (A): Every request receives a response (not necessarily the latest data).

Partition tolerance (P): System continues operating despite network failures between nodes.

The Reality

Network partitions happen. In distributed systems, P is non-negotiable. So the real choice is between:

CP systems: Prioritize consistency, may become unavailable during partitions
- Examples: Traditional RDBMS with sync replication, HBase, MongoDB (in strong consistency mode)
AP systems: Prioritize availability, may serve stale data during partitions
- Examples: Cassandra, DynamoDB (default), CouchDB

Beyond CAP: PACELC

The PACELC theorem extends CAP: Even when the system is running normally (no partition), there's a trade-off between Latency and Consistency.

PACELC: If (P)artition, choose (A)vailability or (C)onsistency; Else, choose (L)atency or (C)onsistency.

Example categorizations:

DynamoDB (PA/EL): Available during partition, low latency otherwise, eventual consistency
PostgreSQL (PC/EC): Consistent during partition (unavailable), consistent otherwise
Cassandra with tunable consistency: Configurable based on needs

Practical Implications

Choose based on business reality:

Banking transaction → CP: Better to error than to process a duplicate transfer
Social media feed → AP: Better to show slightly stale feed than error
Inventory check → CP: Better to oversell and apologize than block all purchases
Activity stream → AP: Eventual consistency is fine for non-critical updates

Common Databases by CAP Classification
Category	During Partition	Examples	Use When
CP	Consistent but may be unavailable	PostgreSQL, MongoDB (strict), Google Spanner	Financial, inventory, coordination
AP	Available but may be inconsistent	Cassandra, DynamoDB, CouchDB	High availability critical, eventual consistency OK

Tunable Consistency

Data Modeling Considerations

How you model data significantly impacts database choice and system performance. Different databases require different modeling approaches.

Relational Data Modeling

Principles:

Normalize to reduce redundancy (3NF typically)
Use foreign keys for relationships
Design for query flexibility (joins are cheap)
Schema is defined upfront

Example: Order System

users: id, name, email
orders: id, user_id (FK), created_at, status
order_items: id, order_id (FK), product_id (FK), quantity, price
products: id, name, description, base_price

Joins combine these at query time for complete order views.

Document Data Modeling

Principles:

Embed frequently-accessed-together data
Design for how you query, not abstract relationships
Accept some duplication for read performance
Schema can evolve per document

Example: Order System

{
  "_id": "order_123",
  "user": {
    "id": "user_456",
    "name": "Alice",
    "email": "alice@example.com"
  },
  "items": [
    {
      "productId": "prod_789",
      "name": "Widget",
      "quantity": 2,
      "price": 29.99
    }
  ],
  "total": 59.98,
  "status": "shipped"
}

No joins needed—one read gets everything. Trade-off: if user changes email, multiple documents need updating.

Wide-Column Data Modeling

Principles:

Design around your access patterns (mandatory)
Partition key determines data distribution
Clustering key determines sort order within partition
Denormalize aggressively—no joins available

Example: User Activity Timeline

Table: user_activity
Partition Key: user_id
Clustering Key: timestamp (DESC)

Row: (user_123, 2024-01-15T10:00:00, "logged_in")
Row: (user_123, 2024-01-15T09:45:00, "viewed_product")

Query: "Get last 20 activities for user_123" is extremely fast.

Graph Data Modeling

Principles:

Nodes represent entities
Edges represent relationships (with properties)
Model for traversal patterns
Index properties you'll query

Example: Social Network

(User:Alice)-[:FOLLOWS]->(User:Bob)
(User:Bob)-[:FOLLOWS]->(User:Carol)
(User:Alice)-[:FOLLOWS]->(User:Carol)

Query: "Find friends of friends of Alice" traverses edges efficiently.

Case Study: Database Selection for a Messaging Platform

Let's apply database selection principles to a real-time messaging platform (like Slack or Discord).

Requirements:

Users send messages in channels (1-to-many)
Users send direct messages (1-to-1)
Messages are stored permanently
Recent messages load instantly (<100ms)
Search across message history
User presence (online/offline status)
Notifications across devices

Analysis by Data Type

1. User and Channel Metadata

Characteristics: Relational, moderate volume, transactional operations

Access patterns:

Get user by ID
Get channels user belongs to
Update user settings
Manage channel membership

Selection: PostgreSQL

Strong consistency for membership
Complex queries for permissions
Familiar, reliable

2. Message Storage

Characteristics: Append-only, time-ordered, high volume, accessed by channel/conversation

Access patterns:

Get recent messages for channel (most common)
Get messages before timestamp (pagination)
Write new messages (high frequency)

Selection: Cassandra or ScyllaDB

Partition by channel_id, cluster by timestamp
Scales to billions of messages
Optimized for time-ordered access

Alternative: PostgreSQL with partitioning works for smaller scale

3. Message Search

Characteristics: Full-text, relevance-ranked, filtered

Access patterns:

Search messages by keyword
Filter by channel, date range, user
Highlighting matches

Selection: Elasticsearch

Purpose-built for full-text search
Faceted filtering
Handles large indexes efficiently

Sync strategy: CDC from Cassandra → Kafka → Elasticsearch

4. User Presence

Characteristics: Highly volatile, ephemeral, sub-second access

Access patterns:

Get online status for users in channel
Update presence on activity
Expire after timeout

Selection: Redis

Sub-millisecond reads/writes
TTL for automatic expiration
Pub/sub for presence broadcasts

5. Message Attachments

Characteristics: Binary, large, write-once-read-many

Access patterns:

Upload attachment
Serve via CDN
Reference by URL in messages

Selection: S3 + CDN

Unlimited scale
Low cost for storage
Integrates with CDNs for fast delivery

┌─────────────────────────────────────────────────────────────────┐
│                     MESSAGING PLATFORM                          │
│                                                                 │
│  ┌─────────────────┐   ┌────────────────┐   ┌───────────────┐  │
│  │   PostgreSQL    │   │   Cassandra    │   │    Redis      │  │
│  │                 │   │                │   │               │  │
│  │ • Users         │   │ • Messages     │   │ • Presence    │  │
│  │ • Channels      │   │ • Reactions    │   │ • Sessions    │  │
│  │ • Memberships   │   │ • Read markers │   │ • Rate limits │  │
│  │ • Permissions   │   │                │   │ • Caching     │  │
│  └─────────────────┘   └────────────────┘   └───────────────┘  │
│                                                                 │
│  ┌─────────────────┐   ┌────────────────┐   ┌───────────────┐  │
│  │  Elasticsearch  │   │     Kafka      │   │      S3       │  │
│  │                 │   │                │   │               │  │
│  │ • Message       │   │ • Events      │   │ • Files       │  │
│  │   search        │   │ • CDC stream  │   │ • Images      │  │
│  │ • Audit logs    │   │ • Sync        │   │ • Avatars     │  │
│  └─────────────────┘   └────────────────┘   └───────────────┘  │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

Summary: Database Selection

Database selection is a critical architectural decision with long-lasting implications. Let's consolidate the key principles:

Key Takeaways

•No universal best database — Different data types and access patterns require different technologies.
•Match database to access patterns — Design around how data will be queried, not just how it's structured.
•Relational databases are excellent defaults — PostgreSQL handles diverse workloads well; start here when unsure.
•NoSQL solves specific problems — Document, key-value, wide-column, and graph databases each have sweet spots.
•Specialized databases outperform generalists — Time-series DBs for metrics, search engines for full-text, analytics DBs for OLAP.
•Polyglot persistence is common — Modern systems often use 3-5 different database technologies.
•Understand CAP/PACELC trade-offs — Choose consistency or availability based on business requirements.
•Data modeling differs by database type — Normalize for RDBMS, denormalize for NoSQL, design for access in wide-column stores.

Module Complete:

Page Complete

5 / 5