Loading learning content...
Every system has data that must persist beyond a single request: user profiles, orders, transactions, messages, analytics. How you store this data—which database technologies you choose—has profound implications for performance, scalability, consistency, and operational complexity.
Database selection is one of the hardest architectural decisions to change later. Migrating data between different database types is expensive, risky, and often requires significant downtime or complex synchronization logic. Making the right choice upfront—or at least an informed choice—prevents painful migrations down the road.
This page covers the landscape of database technologies, the criteria for selection, and guidance for matching database types to different use cases. You'll learn to think systematically about data storage decisions rather than defaulting to familiar technologies.
By the end of this page, you will understand the major categories of databases and their tradeoffs, evaluate databases based on data models, access patterns, and consistency requirements, apply selection criteria to common system design scenarios, and recognize when to use polyglot persistence (multiple database types).
Before diving into selection criteria, let's survey the major database categories and what they're designed for.
Examples: PostgreSQL, MySQL, SQL Server, Oracle
Data model: Tables with rows and columns, relationships through foreign keys, schema-enforced structure.
Strengths:
Weaknesses:
Best for: Transactional systems, complex queries, data with clear relationships, systems requiring strong consistency.
Examples: MongoDB, CouchDB, Amazon DocumentDB
Data model: JSON/BSON documents with nested structures, flexible schema.
Strengths:
Weaknesses:
Best for: Content management, catalogs, user profiles, data with variable structure, rapid development.
Examples: Redis, Amazon DynamoDB, Memcached, etcd
Data model: Simple key → value pairs, values can be strings, JSON, or binary.
Strengths:
Weaknesses:
Best for: Caching, session storage, shopping carts, rate limiting, feature flags.
Examples: Cassandra, HBase, Google Bigtable, ScyllaDB
Data model: Rows with potentially many columns, column families, designed for sparse data.
Strengths:
Weaknesses:
Best for: Time-series data, IoT sensor data, logging, recommendation data, high-volume writes.
Examples: Neo4j, Amazon Neptune, JanusGraph, TigerGraph
Data model: Nodes and edges with properties, optimized for relationship traversal.
Strengths:
Weaknesses:
Best for: Social networks, fraud detection, knowledge graphs, recommendation engines, network analysis.
| Category | Data Model | Scaling | Consistency | Query Pattern |
|---|---|---|---|---|
| Relational | Tables/Rows | Vertical (primarily) | Strong (ACID) | Complex SQL |
| Document | JSON documents | Horizontal | Tunable | By field, flexible |
| Key-Value | Key → Value | Horizontal | Varies | By key only |
| Wide-Column | Rows/Columns | Horizontal | Eventual (tunable) | By partition key |
| Graph | Nodes/Edges | Varies | Varies | Traversals, patterns |
Selecting a database requires evaluating multiple dimensions. Different criteria matter more for different systems.
Question: How does my data naturally structure itself?
Question: How will data be read and written?
Question: How critical is immediate consistency?
Question: How much data and traffic do you expect?
Question: What can your team operate effectively?
Relational databases remain the default choice for many systems, and for good reason. Let's explore when they excel and their limitations.
Strong consistency is essential:
Data has complex relationships:
Query flexibility is needed:
Schema enforcement is desired:
Vertical scaling (scale-up):
Read replicas:
Connection pooling:
Sharding (horizontal partition):
| Database | Strengths | Managed Options | Best For |
|---|---|---|---|
| PostgreSQL | Feature-rich, extensible, JSON support, advanced types | RDS, Cloud SQL, Aurora, Supabase | General purpose, geospatial, analytics |
| MySQL | Widely deployed, simple, fast reads, mature ecosystem | RDS, Cloud SQL, Aurora, PlanetScale | Web applications, read-heavy workloads |
| SQL Server | Enterprise features, Windows integration, BI tools | Azure SQL, RDS | Enterprise, Microsoft ecosystem |
| CockroachDB | Distributed SQL, global scale, PostgreSQL compatible | CockroachDB Cloud | Global apps needing SQL semantics |
When in doubt, PostgreSQL is an excellent default. It handles JSON documents, geospatial data, full-text search, and traditional relational workloads. It's battle-tested, well-documented, and has extensive managed service options.
"NoSQL" encompasses diverse database types. Let's examine when each type shines.
When to choose:
Design considerations:
Example fit: E-commerce product catalog with variable attributes per category
When to choose:
Redis specifics:
DynamoDB specifics:
When to choose:
Design considerations:
Example fit: IoT sensor data (billions of readings), messaging history, audit logs
When to choose:
Design considerations:
Example fit: Social network friend-of-friend queries, recommendation graph, fraud ring detection
| Use Case | Best Choice | Rationale |
|---|---|---|
| Session storage | Redis | Fast, TTL support, simple key-value |
| User profiles | MongoDB / DynamoDB | Flexible schema, document-oriented |
| Leaderboards | Redis | Sorted sets, O(log n) operations |
| Time-series metrics | Cassandra / TimescaleDB | High write throughput, time-ordered |
| Social connections | Neo4j | Relationship traversals, graph queries |
| Shopping cart | Redis / DynamoDB | Fast access, simple structure |
| Activity feeds | Cassandra | Write-heavy, time-ordered access |
| Search index | Elasticsearch | Full-text search, aggregations |
Beyond the major categories, specialized databases address specific use cases with optimized designs.
Examples: InfluxDB, TimescaleDB, Prometheus, QuestDB
Optimized for:
Use cases: Infrastructure monitoring, IoT analytics, financial tick data
Examples: Elasticsearch, OpenSearch, Meilisearch, Algolia
Optimized for:
Use cases: E-commerce search, log analysis (ELK stack), content search
Examples: ClickHouse, Snowflake, BigQuery, Redshift, Apache Druid
Optimized for:
Use cases: Business intelligence, data warehousing, real-time analytics dashboards
Examples: Apache Kafka, Pulsar, Amazon Kinesis
Optimized for:
Use cases: Event-driven architecture, stream processing, change data capture
Examples: Amazon S3, Google Cloud Storage, Azure Blob Storage, MinIO
Optimized for:
Use cases: Media storage, backups, data lakes, static assets
Specialized databases outperform general-purpose databases for their specific use cases by orders of magnitude. Using PostgreSQL for full-text search 'works' but Elasticsearch does it 100x faster. Using MySQL for time-series 'works' but InfluxDB handles the write volume and query patterns far more efficiently.
Modern systems often use multiple database technologies, each chosen for specific use cases. This approach is called polyglot persistence.
Different access patterns require different optimizations:
Single-database limitations:
┌─────────────────────────────────────────────────────────────┐
│ E-commerce Platform │
├─────────────────────────────────────────────────────────────┤
│ │
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │
│ │ PostgreSQL │ │ MongoDB │ │ Redis │ │
│ │ │ │ │ │ │ │
│ │ • Orders │ │ • Product │ │ • Sessions │ │
│ │ • Payments │ │ Catalog │ │ • Cart │ │
│ │ • Inventory │ │ • Reviews │ │ • Cache │ │
│ └─────────────┘ └─────────────┘ └─────────────┘ │
│ │
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │
│ │Elasticsearch│ │ ClickHouse │ │ S3 │ │
│ │ │ │ │ │ │ │
│ │ • Product │ │ • Analytics │ │ • Product │ │
│ │ Search │ │ • Reports │ │ Images │ │
│ │ • Logs │ │ • BI Data │ │ • Invoices │ │
│ └─────────────┘ └─────────────┘ └─────────────┘ │
│ │
└─────────────────────────────────────────────────────────────┘
Each database handles what it does best:
The hardest part of polyglot persistence is keeping data synchronized. If product data lives in both PostgreSQL (inventory) and Elasticsearch (search), you need robust synchronization—typically via events or CDC (Change Data Capture). Plan this carefully.
The CAP theorem states that in a distributed system experiencing a network partition, you can have either Consistency or Availability, but not both.
Consistency (C): Every read receives the most recent write or an error.
Availability (A): Every request receives a response (not necessarily the latest data).
Partition tolerance (P): System continues operating despite network failures between nodes.
Network partitions happen. In distributed systems, P is non-negotiable. So the real choice is between:
CP systems: Prioritize consistency, may become unavailable during partitions
AP systems: Prioritize availability, may serve stale data during partitions
The PACELC theorem extends CAP: Even when the system is running normally (no partition), there's a trade-off between Latency and Consistency.
PACELC: If (P)artition, choose (A)vailability or (C)onsistency; Else, choose (L)atency or (C)onsistency.
Example categorizations:
Choose based on business reality:
| Category | During Partition | Examples | Use When |
|---|---|---|---|
| CP | Consistent but may be unavailable | PostgreSQL, MongoDB (strict), Google Spanner | Financial, inventory, coordination |
| AP | Available but may be inconsistent | Cassandra, DynamoDB, CouchDB | High availability critical, eventual consistency OK |
Many modern databases offer tunable consistency. Cassandra lets you choose consistency level per query. DynamoDB offers strong consistency reads. MongoDB lets you configure write and read concerns. This flexibility lets you make different trade-offs for different operations.
How you model data significantly impacts database choice and system performance. Different databases require different modeling approaches.
Principles:
Example: Order System
users: id, name, email
orders: id, user_id (FK), created_at, status
order_items: id, order_id (FK), product_id (FK), quantity, price
products: id, name, description, base_price
Joins combine these at query time for complete order views.
Principles:
Example: Order System
{
"_id": "order_123",
"user": {
"id": "user_456",
"name": "Alice",
"email": "alice@example.com"
},
"items": [
{
"productId": "prod_789",
"name": "Widget",
"quantity": 2,
"price": 29.99
}
],
"total": 59.98,
"status": "shipped"
}
No joins needed—one read gets everything. Trade-off: if user changes email, multiple documents need updating.
Principles:
Example: User Activity Timeline
Table: user_activity
Partition Key: user_id
Clustering Key: timestamp (DESC)
Row: (user_123, 2024-01-15T10:00:00, "logged_in")
Row: (user_123, 2024-01-15T09:45:00, "viewed_product")
Query: "Get last 20 activities for user_123" is extremely fast.
Principles:
Example: Social Network
(User:Alice)-[:FOLLOWS]->(User:Bob)
(User:Bob)-[:FOLLOWS]->(User:Carol)
(User:Alice)-[:FOLLOWS]->(User:Carol)
Query: "Find friends of friends of Alice" traverses edges efficiently.
Let's apply database selection principles to a real-time messaging platform (like Slack or Discord).
Requirements:
1. User and Channel Metadata
Characteristics: Relational, moderate volume, transactional operations
Access patterns:
Selection: PostgreSQL
2. Message Storage
Characteristics: Append-only, time-ordered, high volume, accessed by channel/conversation
Access patterns:
Selection: Cassandra or ScyllaDB
Alternative: PostgreSQL with partitioning works for smaller scale
3. Message Search
Characteristics: Full-text, relevance-ranked, filtered
Access patterns:
Selection: Elasticsearch
Sync strategy: CDC from Cassandra → Kafka → Elasticsearch
4. User Presence
Characteristics: Highly volatile, ephemeral, sub-second access
Access patterns:
Selection: Redis
5. Message Attachments
Characteristics: Binary, large, write-once-read-many
Access patterns:
Selection: S3 + CDN
123456789101112131415161718192021
┌─────────────────────────────────────────────────────────────────┐│ MESSAGING PLATFORM ││ ││ ┌─────────────────┐ ┌────────────────┐ ┌───────────────┐ ││ │ PostgreSQL │ │ Cassandra │ │ Redis │ ││ │ │ │ │ │ │ ││ │ • Users │ │ • Messages │ │ • Presence │ ││ │ • Channels │ │ • Reactions │ │ • Sessions │ ││ │ • Memberships │ │ • Read markers │ │ • Rate limits │ ││ │ • Permissions │ │ │ │ • Caching │ ││ └─────────────────┘ └────────────────┘ └───────────────┘ ││ ││ ┌─────────────────┐ ┌────────────────┐ ┌───────────────┐ ││ │ Elasticsearch │ │ Kafka │ │ S3 │ ││ │ │ │ │ │ │ ││ │ • Message │ │ • Events │ │ • Files │ ││ │ search │ │ • CDC stream │ │ • Images │ ││ │ • Audit logs │ │ • Sync │ │ • Avatars │ ││ └─────────────────┘ └────────────────┘ └───────────────┘ ││ │└─────────────────────────────────────────────────────────────────┘Database selection is a critical architectural decision with long-lasting implications. Let's consolidate the key principles:
Module Complete:
You've now covered all aspects of high-level design: component identification, system architecture diagrams, data flow diagrams, API design, and database selection. Together, these skills enable you to translate requirements into a coherent, implementable system architecture—the core deliverable of the high-level design phase.
You now understand how to evaluate and select database technologies for different system requirements. This skill—matching storage solutions to data characteristics and access patterns—is fundamental to building systems that perform well at scale.