Database Management SystemsSystem Design Questions

System Design Questions for Database Systems

LevelAdvanced

Duration90 mins

TopicSystem Design Questions

1 / 5

Database Selection

The Most Critical System Design Decision

In any system design interview, the question of which database to use often emerges early and carries enormous weight. This isn't a superficial technology choice—it's a foundational architectural decision that influences every subsequent aspect of your design: data models, query patterns, consistency guarantees, scaling strategies, operational complexity, and ultimately, whether the system succeeds or fails under load.

The interviewer isn't looking for you to name-drop technologies. They want to see systematic reasoning—a structured approach that weighs requirements against capabilities, understands trade-offs, and arrives at a justified recommendation. This is what separates junior candidates who pick databases based on familiarity from senior engineers who select them based on principled analysis.

What You Will Learn

By the end of this page, you will possess a comprehensive framework for database selection in system design interviews. You'll understand how to analyze requirements, map them to database capabilities, evaluate multiple candidates, articulate trade-offs, and present a well-reasoned recommendation that demonstrates principal-engineer-level thinking.

The Database Selection Framework

Effective database selection requires a structured methodology. The DARTS Framework provides a systematic approach that interviewers respect because it demonstrates both technical depth and engineering judgment:

Data Model Analysis
Access Pattern Evaluation
Requirements Mapping
Trade-off Assessment
Selection Justification

Each phase of this framework builds upon the previous, creating a logical chain of reasoning that leads to a defensible database choice. Let's examine each phase in detail.

Interview Strategy

When asked 'What database would you use?', resist the urge to immediately name a technology. Instead, say: 'Let me walk through my analysis systematically.' This signals maturity and thoroughness—qualities interviewers value highly in senior candidates.

Phase 1: Data Model Analysis

Before selecting a database, you must deeply understand the shape and nature of your data:

Entities and Relationships: What are the core entities? How are they related? Are relationships simple (1:N) or complex (M:N with attributes)?
Schema Flexibility: Is the schema rigidly defined upfront, or will it evolve frequently? Do different records of the same 'type' have varying attributes?
Data Size Estimates: How large is each record? How many records total? What's the expected growth rate?
Temporal Characteristics: Is data time-series in nature? Does it have a natural time dimension for partitioning?
Hierarchical/Graph Structures: Does data exhibit deep hierarchies or interconnected graph-like relationships?

Data Model to Database Type Mapping
Data Model Characteristic	Favorable Database Type	Rationale
Highly relational with complex joins	Relational (PostgreSQL, MySQL)	ACID compliance, foreign keys, SQL joins are fundamental
Flexible schema, document-like records	Document (MongoDB, CouchDB)	Schema-less design accommodates varying structures
Simple key-value with high throughput	Key-Value (Redis, DynamoDB)	Minimal overhead, optimized for direct access
Time-stamped metrics/events	Time-Series (InfluxDB, TimescaleDB)	Optimized for temporal queries, efficient compression
Deeply interconnected entities	Graph (Neo4j, Amazon Neptune)	Native traversal operations, path-finding algorithms
Wide-column analytical data	Column-Family (Cassandra, HBase)	Optimized for write-heavy, time-partitioned workloads

Phase 2: Access Pattern Evaluation

The way your application accesses data is often more important than the data model itself. Different databases optimize for radically different access patterns:

Query Types: Are queries simple lookups by primary key? Range scans? Complex aggregations? Full-text searches? Graph traversals?
Read vs. Write Ratio: Is the workload read-heavy (100:1), balanced, or write-heavy?
Latency Requirements: What's the acceptable p50/p95/p99 latency? Are there hard real-time constraints?
Throughput Needs: How many operations per second at peak? What's the burst capacity requirement?
Consistency Needs per Operation: Do some operations require strong consistency while others can tolerate eventual?

Access Pattern Red Flags

•Complex joins in a key-value store — You've chosen the wrong tool; relational would serve better
•Graph traversals in a relational database — Recursive CTEs won't scale; consider a graph database
•High-volume time-series in a document store — Missing crucial time-based optimizations
•OLAP queries on an OLTP database — Will saturate resources; consider a dedicated analytics layer
•Random small reads in a column-store — Column stores excel at scans, not point lookups

Database Categories: Strengths and Limitations

A principal-level understanding requires going beyond surface-level categorization into the architectural principles that give each database type its characteristics. Let's examine the major categories through this lens.

Relational Database Management Systems (RDBMS)

Relational databases have remained the backbone of enterprise applications for 50+ years because they solve the hardest problems in data management: maintaining consistency in the face of concurrent modifications and system failures.

Architectural Foundation:

Declarative Query Language (SQL): Express what you want, not how to get it; optimizer handles execution
ACID Transactions: Atomicity, Consistency, Isolation, Durability guarantee data integrity
Schema Enforcement: Strict typing prevents data quality issues at write time
Relational Model: Mathematical foundation enables powerful query optimization

When to Select:

Financial transactions requiring atomic operations across entities
Systems with complex, evolving query requirements
Data with strong referential integrity needs
Workloads requiring true ACID compliance
Applications where data consistency cannot be compromised

Production Examples:

PostgreSQL: Complex applications, geospatial, JSON hybrid workloads
MySQL: Web applications, read-heavy with replication
SQL Server: Enterprise systems with Windows integration
Oracle: Mission-critical financial/ERP systems

Scaling Considerations

Relational databases scale vertically more easily than horizontally. Horizontal scaling (sharding) breaks joins across shards and complicates transactions. Consider this fundamental limitation when evaluating for massive scale.

Multi-Database Architectures

Real-world systems rarely use a single database. Polyglot persistence—using different databases for different use cases within the same system—is the norm at scale. Understanding when and how to combine databases demonstrates senior-level architectural thinking.

Common Multi-Database Patterns

•Cache + Primary Store: Redis caches hot data from PostgreSQL, reducing latency 100x for repeated reads
•OLTP + OLAP Split: MySQL serves transactions while ClickHouse handles analytics via CDC
•Search + Source of Truth: Elasticsearch enables full-text search, PostgreSQL remains the authoritative store
•Session Store + User Store: Redis holds ephemeral sessions, DynamoDB stores user profiles
•Event Log + Materialized Views: Kafka stores events, various databases materialize different views
•Hot/Warm/Cold Tiering: Recent data in fast storage, historical data in cheaper archival systems

The Command Query Responsibility Segregation (CQRS) Pattern

CQRS separates read and write models, often backed by different databases. This acknowledges that read and write workloads have different optimization requirements:

┌─────────────┐     ┌──────────────────┐     ┌─────────────────┐
│   Writes    │────►│  Event Store /   │────►│  Read Database  │
│  (Commands) │     │  Write Database  │     │  (Denormalized) │
└─────────────┘     └──────────────────┘     └─────────────────┘
                                                      │
                                                      ▼
                                               ┌─────────────┐
                                               │    Reads    │
                                               │  (Queries)  │
                                               └─────────────┘

Write-Optimized Store (e.g., PostgreSQL, Event Store):

Normalized for update efficiency
ACID transactions for write consistency
Captures intent as events/facts

Read-Optimized Store (e.g., Elasticsearch, Redis, MongoDB):

Denormalized for query performance
Multiple projections for different query patterns
Eventually consistent with write store

Multi-Database Complexity

Every additional database increases operational burden: more monitoring, more backups, more failure modes, more expertise needed. Only introduce additional databases when the benefit clearly outweighs the complexity. Start simple, add databases when specific pain points emerge.

The Selection Decision Process

When presenting database selection in an interview, structure your reasoning as a decision tree that progressively narrows options based on requirements.

Database Selection Decision Tree

Decision Flow

START: What is the primary data model?
├── Highly Relational (complex joins, referential integrity)
│   └── RDBMS: PostgreSQL, MySQL, SQL Server
│       ├── Need global distribution? → CockroachDB, Spanner
│       └── Need extreme write scale? → Consider sharding or Vitess
│
├── Documents with Varying Schema
│   └── Document DB: MongoDB, CouchDB
│       ├── Need search? → Add Elasticsearch
│       └── Need transactions? → MongoDB 4.0+
│
├── Simple Key-Value Access
│   └── Key-Value: Redis, DynamoDB
│       ├── Need persistence? → Redis with AOF/RDB, DynamoDB
│       └── Pure cache? → Memcached, Redis
│
├── Connected/Graph Data
│   └── Graph DB: Neo4j, Neptune
│       ├── Need scale? → JanusGraph, Neptune
│       └── ACID required? → Neo4j
│
├── Time-Series / Metrics
│   └── Time-Series: InfluxDB, TimescaleDB
│       ├── Need SQL? → TimescaleDB
│       └── High cardinality? → QuestDB, ClickHouse
│
└── Wide-Column / Analytics
    └── Column-Family: Cassandra, HBase, ClickHouse
        ├── Real-time analytics? → ClickHouse, Druid
        └── Write-heavy time-series? → Cassandra

Requirement Prioritization Matrix

Not all requirements are equal. Prioritize them and map to databases:

Priority	Requirement Type	Example	Database Implication
P0	Must Have	ACID for payments	Eliminate non-ACID options
P1	Should Have	Sub-100ms reads	Influences caching layer
P2	Nice to Have	Full-text search	May add secondary database
P3	Future Consideration	ML integration	Impacts long-term choice

Start by identifying P0 requirements—these are non-negotiable filters. Then evaluate remaining candidates against P1 and P2 requirements to differentiate.

The 'It Depends' Answer

Interviewers often appreciate answers that start with 'it depends' followed by a clear framework. This shows you understand that context matters. Say: 'My recommendation depends on X, Y, and Z. Let me walk through how each affects the choice.'

Evaluation Criteria Deep Dive

Beyond matching data models and access patterns, several operational and strategic factors influence database selection. Senior engineers consider these carefully.

Technical Factors

•Consistency Model: Strong consistency vs. eventual consistency and their implications
•Durability Guarantees: fsync behavior, replication before acknowledgment
•Replication Strategy: Synchronous vs. async, quorum-based vs. leader-based
•Schemaless vs. Schema-enforced: Runtime flexibility vs. compile-time safety
•Query Capabilities: Joins, aggregations, full-text search, geospatial
•Transaction Support: ACID, BASE, saga patterns

Operational Factors

•Team Expertise: Does your team know this technology, or is there learning overhead?
•Managed vs. Self-Hosted: RDS/DocumentDB vs. self-managed, cost vs. control
•Monitoring & Tooling: Available observability, alerting, debugging tools
•Community & Support: Stackoverflow presence, vendor support quality
•License Model: Open source vs. proprietary, cost at scale
•Vendor Lock-in: Portability to alternatives if needed

Scaling Characteristics Comparison

Understanding how different databases scale is crucial for system design:

Database Scaling Characteristics
Database Type	Vertical Scaling	Horizontal Scaling	Scaling Complexity
PostgreSQL	Excellent	Difficult (Citus, sharding)	Moderate-High
MySQL	Excellent	Moderate (Vitess, ProxySQL)	Moderate
MongoDB	Good	Built-in sharding	Low-Moderate
Cassandra	Limited	Excellent (designed for it)	Low
DynamoDB	N/A (managed)	Automatic	Very Low
Redis	Excellent (large instances)	Redis Cluster	Moderate
Neo4j	Good	Limited (Enterprise)	High
ClickHouse	Excellent	Good (sharding)	Moderate

Interview Question Walkthrough

Let's apply the DARTS framework to a realistic interview scenario.

Sample Question

Design a system for a social media platform that allows users to post updates, follow other users, and view a personalized feed. The system should handle 100 million daily active users and prioritize feed delivery latency.

Step 1: Data Model Analysis

Core entities:

Users: Profile information, credentials, preferences (fixed schema, ~5KB per user)
Posts: Text + media references, timestamps, author (semi-structured, variable size)
Follows: User A follows User B (relationship data, billions of edges)
Feed: Materialized view of posts from followed users (denormalized for read performance)

Relationship complexity:

User-Post: 1:N (simple)
User-User (follows): M:N (high volume, traversal-heavy for feed generation)
Post-Interactions: 1:N (likes, comments)

Step 2: Access Pattern Evaluation

Post Creation: Write-heavy, needs durability, moderate latency tolerance
Feed Retrieval: Read-heavy, latency-critical (p99 < 200ms), massive fanout
Profile Lookup: Read-heavy, point lookups by user ID
Follow Operations: Write moderate, needs consistency (prevent duplicate follows)
Search: Full-text search on posts, real-time or near-real-time

Step 3: Requirements Mapping

Requirement	Priority	Implication
Feed latency < 200ms	P0	Pre-computed or cached feeds
100M DAU scale	P0	Horizontal scaling required
Durability (no lost posts)	P0	Persistent storage with replication
Real-time feed updates	P1	Async updates acceptable
Search functionality	P2	Can be eventually consistent

Step 4: Trade-off Assessment

Given the requirements, we need:

High write throughput for posts → rules out single-node RDBMS
Graph-like access for follows → but graph DB won't scale to billions of edges with low latency
Pre-computed feeds for latency → requires denormalization and caching
ACID for critical operations → user accounts, financial (ads) transactions

Step 5: Selection Justification

I recommend a polyglot architecture:

PostgreSQL for user accounts, authentication, and ad billing
- ACID transactions for sensitive operations
- Proven at scale with read replicas
Cassandra for posts and feeds
- Horizontal scaling for write throughput
- Time-series-like access (recent posts)
- Partition by user_id for feed retrieval
Redis for hot data caching
- Pre-computed feeds for active users
- Session storage
- Real-time counters (likes, views)
Elasticsearch for search
- Full-text indexing of posts
- Eventually consistent is acceptable

This combination balances consistency, availability, and partition tolerance according to the specific needs of each data type.

Common Selection Pitfalls

Even experienced engineers make database selection mistakes. Awareness of common pitfalls helps you avoid them in interviews and production.

Selection Anti-Patterns

•Resume-Driven Selection — Choosing a database because you want to learn it, not because it fits the problem. In interviews, this signals poor judgment.
•Premature Optimization — Selecting a complex distributed database when a single PostgreSQL instance would handle the load for years. Start simple.
•Ignoring Operational Burden — Choosing a database your team can't operate. A database you can't monitor, backup, and recover is a liability.
•Over-Engineering for Scale — Designing for 100M users when you have 1,000. Scale decisions should match realistic growth projections.
•Underestimating CAP Trade-offs — Not understanding that choosing availability over consistency has real user-facing consequences.
•Single-Technology Mindset — Forcing one database to serve all use cases when polyglot persistence would be simpler.
•Ignoring Query Evolution — Choosing based on current queries without considering how query patterns might change.

The Mature Response

When uncertain, it's mature to say: 'Given what I know, I'd start with X because it handles our P0 requirements. But I'd monitor closely and be prepared to introduce Y if query pattern Z emerges.' This shows pragmatism and adaptability.

Summary: Database Selection Mastery

Database selection is both science and art. Let's consolidate the key takeaways:

Key Takeaways

•Use a structured framework (DARTS) — Data model, Access patterns, Requirements, Trade-offs, Selection justification
•Match data models to database strengths — Relational for joins, document for flexibility, graph for traversals, time-series for temporal data
•Consider access patterns as primary — How you query data matters more than how it's structured
•Embrace polyglot persistence — Different databases for different use cases is the norm at scale
•Evaluate operational factors — Team expertise, managed services, monitoring capabilities
•Articulate trade-offs clearly — Show you understand what you're giving up with each choice
•Start simple, scale as needed — Premature optimization is the root of all database evil
•Be prepared to evolve — Initial choices may need revisiting as requirements change

What's Next:

With database selection mastered, the next page dives into Schema Design—how to structure your data within the chosen database to optimize for performance, maintainability, and evolution. You'll learn to design schemas that support both current needs and future growth.

Page Complete

You now have a comprehensive framework for database selection in system design interviews. Remember: interviewers care less about which database you choose and more about how systematically and thoughtfully you arrive at your recommendation.

1 / 5

Loading learning content...

Database Management SystemsSystem Design Questions

System Design Questions for Database Systems

LevelAdvanced

Duration90 mins

TopicSystem Design Questions

1 / 5

Database Selection

The Most Critical System Design Decision

What You Will Learn

The Database Selection Framework

Data Model Analysis
Access Pattern Evaluation
Requirements Mapping
Trade-off Assessment
Selection Justification

Each phase of this framework builds upon the previous, creating a logical chain of reasoning that leads to a defensible database choice. Let's examine each phase in detail.

Interview Strategy

Phase 1: Data Model Analysis

Before selecting a database, you must deeply understand the shape and nature of your data:

Entities and Relationships: What are the core entities? How are they related? Are relationships simple (1:N) or complex (M:N with attributes)?
Schema Flexibility: Is the schema rigidly defined upfront, or will it evolve frequently? Do different records of the same 'type' have varying attributes?
Data Size Estimates: How large is each record? How many records total? What's the expected growth rate?
Temporal Characteristics: Is data time-series in nature? Does it have a natural time dimension for partitioning?
Hierarchical/Graph Structures: Does data exhibit deep hierarchies or interconnected graph-like relationships?

Data Model to Database Type Mapping
Data Model Characteristic	Favorable Database Type	Rationale
Highly relational with complex joins	Relational (PostgreSQL, MySQL)	ACID compliance, foreign keys, SQL joins are fundamental
Flexible schema, document-like records	Document (MongoDB, CouchDB)	Schema-less design accommodates varying structures
Simple key-value with high throughput	Key-Value (Redis, DynamoDB)	Minimal overhead, optimized for direct access
Time-stamped metrics/events	Time-Series (InfluxDB, TimescaleDB)	Optimized for temporal queries, efficient compression
Deeply interconnected entities	Graph (Neo4j, Amazon Neptune)	Native traversal operations, path-finding algorithms
Wide-column analytical data	Column-Family (Cassandra, HBase)	Optimized for write-heavy, time-partitioned workloads

Phase 2: Access Pattern Evaluation

The way your application accesses data is often more important than the data model itself. Different databases optimize for radically different access patterns:

Query Types: Are queries simple lookups by primary key? Range scans? Complex aggregations? Full-text searches? Graph traversals?
Read vs. Write Ratio: Is the workload read-heavy (100:1), balanced, or write-heavy?
Latency Requirements: What's the acceptable p50/p95/p99 latency? Are there hard real-time constraints?
Throughput Needs: How many operations per second at peak? What's the burst capacity requirement?
Consistency Needs per Operation: Do some operations require strong consistency while others can tolerate eventual?

Access Pattern Red Flags

•Complex joins in a key-value store — You've chosen the wrong tool; relational would serve better
•Graph traversals in a relational database — Recursive CTEs won't scale; consider a graph database
•High-volume time-series in a document store — Missing crucial time-based optimizations
•OLAP queries on an OLTP database — Will saturate resources; consider a dedicated analytics layer
•Random small reads in a column-store — Column stores excel at scans, not point lookups

Database Categories: Strengths and Limitations

Relational Database Management Systems (RDBMS)

Architectural Foundation:

Declarative Query Language (SQL): Express what you want, not how to get it; optimizer handles execution
ACID Transactions: Atomicity, Consistency, Isolation, Durability guarantee data integrity
Schema Enforcement: Strict typing prevents data quality issues at write time
Relational Model: Mathematical foundation enables powerful query optimization

When to Select:

Financial transactions requiring atomic operations across entities
Systems with complex, evolving query requirements
Data with strong referential integrity needs
Workloads requiring true ACID compliance
Applications where data consistency cannot be compromised

Production Examples:

PostgreSQL: Complex applications, geospatial, JSON hybrid workloads
MySQL: Web applications, read-heavy with replication
SQL Server: Enterprise systems with Windows integration
Oracle: Mission-critical financial/ERP systems

Scaling Considerations

Multi-Database Architectures

Common Multi-Database Patterns

•Cache + Primary Store: Redis caches hot data from PostgreSQL, reducing latency 100x for repeated reads
•OLTP + OLAP Split: MySQL serves transactions while ClickHouse handles analytics via CDC
•Search + Source of Truth: Elasticsearch enables full-text search, PostgreSQL remains the authoritative store
•Session Store + User Store: Redis holds ephemeral sessions, DynamoDB stores user profiles
•Event Log + Materialized Views: Kafka stores events, various databases materialize different views
•Hot/Warm/Cold Tiering: Recent data in fast storage, historical data in cheaper archival systems

The Command Query Responsibility Segregation (CQRS) Pattern

CQRS separates read and write models, often backed by different databases. This acknowledges that read and write workloads have different optimization requirements:

┌─────────────┐     ┌──────────────────┐     ┌─────────────────┐
│   Writes    │────►│  Event Store /   │────►│  Read Database  │
│  (Commands) │     │  Write Database  │     │  (Denormalized) │
└─────────────┘     └──────────────────┘     └─────────────────┘
                                                      │
                                                      ▼
                                               ┌─────────────┐
                                               │    Reads    │
                                               │  (Queries)  │
                                               └─────────────┘

Write-Optimized Store (e.g., PostgreSQL, Event Store):

Normalized for update efficiency
ACID transactions for write consistency
Captures intent as events/facts

Read-Optimized Store (e.g., Elasticsearch, Redis, MongoDB):

Denormalized for query performance
Multiple projections for different query patterns
Eventually consistent with write store

Multi-Database Complexity

The Selection Decision Process

When presenting database selection in an interview, structure your reasoning as a decision tree that progressively narrows options based on requirements.

Database Selection Decision Tree

Decision Flow

START: What is the primary data model?
├── Highly Relational (complex joins, referential integrity)
│   └── RDBMS: PostgreSQL, MySQL, SQL Server
│       ├── Need global distribution? → CockroachDB, Spanner
│       └── Need extreme write scale? → Consider sharding or Vitess
│
├── Documents with Varying Schema
│   └── Document DB: MongoDB, CouchDB
│       ├── Need search? → Add Elasticsearch
│       └── Need transactions? → MongoDB 4.0+
│
├── Simple Key-Value Access
│   └── Key-Value: Redis, DynamoDB
│       ├── Need persistence? → Redis with AOF/RDB, DynamoDB
│       └── Pure cache? → Memcached, Redis
│
├── Connected/Graph Data
│   └── Graph DB: Neo4j, Neptune
│       ├── Need scale? → JanusGraph, Neptune
│       └── ACID required? → Neo4j
│
├── Time-Series / Metrics
│   └── Time-Series: InfluxDB, TimescaleDB
│       ├── Need SQL? → TimescaleDB
│       └── High cardinality? → QuestDB, ClickHouse
│
└── Wide-Column / Analytics
    └── Column-Family: Cassandra, HBase, ClickHouse
        ├── Real-time analytics? → ClickHouse, Druid
        └── Write-heavy time-series? → Cassandra

Requirement Prioritization Matrix

Not all requirements are equal. Prioritize them and map to databases:

Priority	Requirement Type	Example	Database Implication
P0	Must Have	ACID for payments	Eliminate non-ACID options
P1	Should Have	Sub-100ms reads	Influences caching layer
P2	Nice to Have	Full-text search	May add secondary database
P3	Future Consideration	ML integration	Impacts long-term choice

Start by identifying P0 requirements—these are non-negotiable filters. Then evaluate remaining candidates against P1 and P2 requirements to differentiate.

The 'It Depends' Answer

Evaluation Criteria Deep Dive

Beyond matching data models and access patterns, several operational and strategic factors influence database selection. Senior engineers consider these carefully.

Technical Factors

•Consistency Model: Strong consistency vs. eventual consistency and their implications
•Durability Guarantees: fsync behavior, replication before acknowledgment
•Replication Strategy: Synchronous vs. async, quorum-based vs. leader-based
•Schemaless vs. Schema-enforced: Runtime flexibility vs. compile-time safety
•Query Capabilities: Joins, aggregations, full-text search, geospatial
•Transaction Support: ACID, BASE, saga patterns

Operational Factors

•Team Expertise: Does your team know this technology, or is there learning overhead?
•Managed vs. Self-Hosted: RDS/DocumentDB vs. self-managed, cost vs. control
•Monitoring & Tooling: Available observability, alerting, debugging tools
•Community & Support: Stackoverflow presence, vendor support quality
•License Model: Open source vs. proprietary, cost at scale
•Vendor Lock-in: Portability to alternatives if needed

Scaling Characteristics Comparison

Understanding how different databases scale is crucial for system design:

Database Scaling Characteristics
Database Type	Vertical Scaling	Horizontal Scaling	Scaling Complexity
PostgreSQL	Excellent	Difficult (Citus, sharding)	Moderate-High
MySQL	Excellent	Moderate (Vitess, ProxySQL)	Moderate
MongoDB	Good	Built-in sharding	Low-Moderate
Cassandra	Limited	Excellent (designed for it)	Low
DynamoDB	N/A (managed)	Automatic	Very Low
Redis	Excellent (large instances)	Redis Cluster	Moderate
Neo4j	Good	Limited (Enterprise)	High
ClickHouse	Excellent	Good (sharding)	Moderate

Interview Question Walkthrough

Let's apply the DARTS framework to a realistic interview scenario.

Sample Question

Step 1: Data Model Analysis

Core entities:

Users: Profile information, credentials, preferences (fixed schema, ~5KB per user)
Posts: Text + media references, timestamps, author (semi-structured, variable size)
Follows: User A follows User B (relationship data, billions of edges)
Feed: Materialized view of posts from followed users (denormalized for read performance)

Relationship complexity:

User-Post: 1:N (simple)
User-User (follows): M:N (high volume, traversal-heavy for feed generation)
Post-Interactions: 1:N (likes, comments)

Step 2: Access Pattern Evaluation

Post Creation: Write-heavy, needs durability, moderate latency tolerance
Feed Retrieval: Read-heavy, latency-critical (p99 < 200ms), massive fanout
Profile Lookup: Read-heavy, point lookups by user ID
Follow Operations: Write moderate, needs consistency (prevent duplicate follows)
Search: Full-text search on posts, real-time or near-real-time

Step 3: Requirements Mapping

Requirement	Priority	Implication
Feed latency < 200ms	P0	Pre-computed or cached feeds
100M DAU scale	P0	Horizontal scaling required
Durability (no lost posts)	P0	Persistent storage with replication
Real-time feed updates	P1	Async updates acceptable
Search functionality	P2	Can be eventually consistent

Step 4: Trade-off Assessment

Given the requirements, we need:

High write throughput for posts → rules out single-node RDBMS
Graph-like access for follows → but graph DB won't scale to billions of edges with low latency
Pre-computed feeds for latency → requires denormalization and caching
ACID for critical operations → user accounts, financial (ads) transactions

Step 5: Selection Justification

I recommend a polyglot architecture:

PostgreSQL for user accounts, authentication, and ad billing
- ACID transactions for sensitive operations
- Proven at scale with read replicas
Cassandra for posts and feeds
- Horizontal scaling for write throughput
- Time-series-like access (recent posts)
- Partition by user_id for feed retrieval
Redis for hot data caching
- Pre-computed feeds for active users
- Session storage
- Real-time counters (likes, views)
Elasticsearch for search
- Full-text indexing of posts
- Eventually consistent is acceptable

This combination balances consistency, availability, and partition tolerance according to the specific needs of each data type.

Common Selection Pitfalls

Even experienced engineers make database selection mistakes. Awareness of common pitfalls helps you avoid them in interviews and production.

Selection Anti-Patterns

•Resume-Driven Selection — Choosing a database because you want to learn it, not because it fits the problem. In interviews, this signals poor judgment.
•Premature Optimization — Selecting a complex distributed database when a single PostgreSQL instance would handle the load for years. Start simple.
•Ignoring Operational Burden — Choosing a database your team can't operate. A database you can't monitor, backup, and recover is a liability.
•Over-Engineering for Scale — Designing for 100M users when you have 1,000. Scale decisions should match realistic growth projections.
•Underestimating CAP Trade-offs — Not understanding that choosing availability over consistency has real user-facing consequences.
•Single-Technology Mindset — Forcing one database to serve all use cases when polyglot persistence would be simpler.
•Ignoring Query Evolution — Choosing based on current queries without considering how query patterns might change.

The Mature Response

Summary: Database Selection Mastery

Database selection is both science and art. Let's consolidate the key takeaways:

Key Takeaways

•Use a structured framework (DARTS) — Data model, Access patterns, Requirements, Trade-offs, Selection justification
•Match data models to database strengths — Relational for joins, document for flexibility, graph for traversals, time-series for temporal data
•Consider access patterns as primary — How you query data matters more than how it's structured
•Embrace polyglot persistence — Different databases for different use cases is the norm at scale
•Evaluate operational factors — Team expertise, managed services, monitoring capabilities
•Articulate trade-offs clearly — Show you understand what you're giving up with each choice
•Start simple, scale as needed — Premature optimization is the root of all database evil
•Be prepared to evolve — Initial choices may need revisiting as requirements change

What's Next:

Page Complete

1 / 5