Loading content...
We've now explored five distinct data models: relational, document, key-value, graph, and column-family. Each was designed by engineers solving specific problems under specific constraints. None is universally superior—each excels in its domain and falters when misapplied.
The challenge facing every software architect is model selection: given a problem domain, which data model (or combination of models) best serves the application's requirements? This decision is among the most consequential in system design. Choosing incorrectly means either:
This page synthesizes everything we've learned into a practical decision framework—not a rigid flowchart, but a structured approach to evaluating options and making informed choices.
This page delivers a comprehensive model selection framework. You'll learn to analyze data characteristics, evaluate query patterns, assess scalability requirements, consider consistency needs, and factor in operational and organizational constraints. By the end, you'll have a systematic approach to database selection decisions.
Model selection isn't a single decision—it's a structured analysis across multiple dimensions. Consider each dimension as a lens through which to evaluate candidate models:
The Five Dimensions of Model Selection:
| Dimension | Key Questions | Impact on Choice |
|---|---|---|
| Data Structure | Is data tabular, hierarchical, connected, or key-addressed? | Determines natural model fit |
| Query Patterns | CRUD? Analytics? Traversals? Full-text search? | Determines performance characteristics |
| Scale Requirements | GB or PB? 100 QPS or 1M QPS? Growth trajectory? | Filters viable options by capacity |
| Consistency Needs | ACID required? Eventual consistency acceptable? | Eliminates models that can't guarantee requirements |
| Operational Context | Team skills? Cloud vs on-prem? Budget? | Practical feasibility of operation |
The Process:
Relational databases are the most versatile and well-understood option. When uncertain, start with PostgreSQL or MySQL. Only move to specialized models when you can clearly articulate why relational databases cannot meet your requirements. The burden of proof should be on the alternative.
The first dimension examines the natural shape of your data. Different data models represent different structures efficiently.
Structural Characteristics to Evaluate:
| Data Characteristic | Best Model Fit | Example Domain |
|---|---|---|
| Highly regular, tabular data | Relational | Financial transactions, ERP systems |
| Nested, varying structure, self-contained entities | Document | Product catalogs, CMS, user profiles |
| Simple key-addressed, opaque values | Key-Value | Sessions, caches, feature flags |
| Densely connected, relationship-centric | Graph | Social networks, fraud detection, knowledge graphs |
| Wide, sparse, time-oriented | Column-Family | IoT telemetry, event logs, analytics |
12345678910111213141516171819202122232425262728293031
# Structure Analysis: E-commerce Platform ## Order Management (Core Transaction)- Fixed structure: order_id, customer_id, date, status, total- Strong relationships: order -> customer, order -> items -> products- Referential integrity critical→ RELATIONAL: Orders, OrderItems, Customers, Products ## Product Catalog (Varying Attributes)- Electronics have specifications; clothing has sizes/colors- Nested: product -> variants -> images- Attributes vary wildly by category→ DOCUMENT: Each product type has different fields ## User Sessions (Ephemeral, Key-Based)- Lookup by session_id only- Expires after timeout- No relationships or queries on content→ KEY-VALUE: Redis with TTL ## Recommendation Engine (Connection-Centric)- "Users who bought X also bought Y"- "Friends of friends who like Z"- Multi-hop relationship traversals→ GRAPH: User-Product purchase graph ## Analytics Events (High Volume, Time-Series)- Millions of events per second- Queries by time range, aggregated- Retention policy (delete after 90 days)→ COLUMN-FAMILY: Cassandra with time-bucketed partitionsHybrid Structures:
Real applications rarely fit perfectly into one model. The e-commerce example above touches five different models—a realistic outcome for complex systems. Rather than forcing all data into one model, consider polyglot persistence: use each database where it excels.
The challenge then becomes coordination: maintaining consistency across databases, managing operational complexity, and handling cross-database queries. We'll address polyglot architectures later in this page.
Modern relational databases (PostgreSQL, MySQL) support JSON columns with indexing and querying. This enables document-like flexibility within a relational database for portions of your schema. Consider this before adding a separate document database.
Data structure matters, but query patterns—how data is accessed—often dominate the selection decision. A model that elegantly represents your data but cannot efficiently answer your queries is worse than useless.
Query Pattern Categories:
| Query Pattern | Relational | Document | Key-Value | Graph | Column-Family |
|---|---|---|---|---|---|
| Point Query | ✅ Excellent | ✅ Excellent | ✅ Excellent | ✅ Excellent | ✅ Excellent |
| Range Query | ✅ Excellent | ✅ Good | ⚠️ Limited* | ⚠️ Limited | ✅ Excellent** |
| Secondary Lookup | ✅ Excellent | ✅ Good | ❌ Not supported | ✅ Good | ⚠️ Limited*** |
| Aggregations | ✅ Excellent | ✅ Good | ❌ Not supported | ⚠️ Limited | ⚠️ Limited |
| Joins | ✅ Excellent | ⚠️ Limited**** | ❌ Not supported | ✅ Native | ❌ Not supported |
| Traversals | ⚠️ Expensive***** | ❌ Not supported | ❌ Not supported | ✅ Excellent | ❌ Not supported |
| Full-Text Search | ⚠️ Basic | ⚠️ Basic | ❌ Not supported | ❌ Not supported | ❌ Not supported |
Legend:
The 80/20 Rule:
Identify the queries that will constitute 80% of your traffic. Optimize your model for these. Less frequent queries can tolerate suboptimal performance or be handled by secondary systems (search engines, analytics databases).
If you don't know your query patterns, be cautious about NoSQL. Relational databases handle ad-hoc queries gracefully. NoSQL models require designing tables around known queries. Discovering a new critical query pattern after deployment may require significant schema changes in NoSQL systems.
Scale requirements filter out models that simply cannot handle your projected load. This dimension includes data volume, throughput, and latency requirements.
Data Volume:
Throughput (Operations per Second):
| Model | Vertical Scaling | Horizontal Scaling | Max Practical Scale |
|---|---|---|---|
| Relational (single) | Excellent | Limited (read replicas) | Low TB / 10K QPS |
| Relational (sharded) | Good | Moderate (manual) | High TB / 100K QPS |
| Document (MongoDB) | Good | Good (auto-sharding) | PB / 1M+ QPS |
| Key-Value (Redis) | Excellent | Good (cluster mode) | TB / 1M+ QPS* |
| Graph (Neo4j) | Excellent | Limited (read replicas) | Billions of nodes with care |
| Column-Family | Good | Excellent (native) | PB / Millions QPS |
Latency Requirements:
The "Google Scale" Trap:
Many teams choose complex distributed databases because they might grow to Google's scale. This is premature optimization. PostgreSQL on a well-provisioned server handles more than most applications will ever need. Choose for your actual scale trajectory, not hypothetical exponential growth.
123456789101112131415161718192021222324252627282930313233
# Scale Estimation Checklist ## Current State- Total data volume: _____ GB/TB- Annual growth rate: _____ %- Peak read QPS: _____- Peak write QPS: _____- p99 latency requirement: _____ ms ## 3-Year Projection- Projected data volume: _____ TB- Projected peak read QPS: _____- Projected peak write QPS: _____ ## Example: SaaS Application Current:- 50 GB data- 20% YoY growth- 500 read QPS peak- 50 write QPS peak- 50ms p99 latency required 3-Year Projection:- ~85 GB data (50 * 1.2^3)- ~850 read QPS- ~85 write QPS Analysis:- A single PostgreSQL instance handles this easily- No need for distributed databases- Focus on proper indexing and connection pooling→ Recommendation: PostgreSQL with read replica for redundancyScalability isn't just about handling data volume. Consider: Can you efficiently shard your workload? Are there hot spots (celebrity accounts, popular products)? Is your access pattern read-heavy or write-heavy? A system that handles 1PB of cold data differs vastly from one handling 1M writes/second.
Consistency guarantees determine what applications can assume about data state after operations. Different models offer different consistency capabilities—this is often a deal-breaker for certain use cases.
Consistency Spectrum:
| Database Type | Consistency Model | ACID Transactions | Notes |
|---|---|---|---|
| Relational (single node) | Strong | Full ACID | Industry-leading transactional guarantees |
| Relational (replicated) | Varies* | Partial** | Depends on replication mode (sync vs async) |
| Document (MongoDB) | Tunable | Multi-document (4.0+) | Read/write concern configurable |
| Key-Value (Redis) | Varies*** | Single-key only | Cluster mode: eventually consistent |
| Graph (Neo4j) | Strong | Full ACID | One of Neo4j's key advantages |
| Column-Family | Tunable (typically eventual) | Row-level**** | Cassandra: tunable consistency levels |
Legend:
Use Case Requirements:
Eventual consistency forces application developers to handle race conditions, stale reads, and conflict resolution. This complexity is often underestimated. Unless you have specific scale requirements that demand eventual consistency, prefer strongly consistent systems and simpler application code.
Technical capabilities are only half the story. Operational context—your team's skills, infrastructure, budget, and organizational constraints—significantly impacts which databases are viable choices.
Team Skills:
A database your team cannot operate reliably is worse than a suboptimal one they can. Consider:
| Database Type | Ops Complexity | Managed Options | Skills Required |
|---|---|---|---|
| Relational (single) | Low | Excellent (RDS, Cloud SQL, etc.) | Standard DBA skills |
| Relational (sharded) | High | Moderate (Vitess, Citus) | Sharding expertise, custom tooling |
| Document (MongoDB) | Moderate | Excellent (Atlas) | Understanding of document modeling |
| Key-Value (Redis) | Low-Moderate | Excellent (ElastiCache, etc.) | Memory management, cluster ops |
| Graph (Neo4j) | Moderate | Good (Aura) | Graph modeling, Cypher proficiency |
| Column-Family | High | Moderate (Astra, DynamoDB*) | Deep understanding of data modeling, compaction, consistency |
Cloud vs Self-Hosted:
Budget Considerations:
AWS DynamoDB deserves special mention: it's a fully managed key-value/document database with essentially zero operations overhead. If you're on AWS and can model your data for DynamoDB's access patterns, it eliminates operational concerns entirely. But its pricing and data modeling constraints require careful evaluation.
Polyglot persistence is the practice of using multiple databases within a single application, each chosen for specific workloads. Rather than forcing all data into one model, you leverage the strengths of specialized systems.
A Typical Polyglot Architecture:
1234567891011121314151617181920212223242526272829303132
# E-Commerce Platform: Polyglot Architecture ┌─────────────────────────────────────────────────────────────────┐│ Application Layer │└────────────┬──────────────┬──────────────┬──────────────┬───────┘ │ │ │ │ ▼ ▼ ▼ ▼ ┌────────────┐ ┌────────────┐ ┌────────────┐ ┌────────────┐ │ PostgreSQL │ │ MongoDB │ │ Redis │ │Elasticsearch│ │ (Primary) │ │ (Catalog) │ │ (Cache) │ │ (Search) │ └────────────┘ └────────────┘ └────────────┘ └────────────┘ │ │ │ │ │ │ │ │ • Orders • Products • Sessions • Product Search • Customers • Reviews • Cart • Autocomplete • Inventory • Content • Rate Limits • Filters • Transactions • Promotions • Real-time • Faceted Nav Counters ┌─────────────────────────────────────────────────────────────────┐│ Analytics Layer │└────────────┬──────────────┬─────────────────────────────────────┘ │ │ ▼ ▼ ┌────────────┐ ┌────────────┐ │ ClickHouse │ │ Neo4j │ │ (Analytics)│ │ (RecEngine)│ └────────────┘ └────────────┘ │ │ • Event Logs • User Graph • Aggregations • "Customers also • Dashboards bought"Managing Data Synchronization:
The central challenge in polyglot architectures is keeping data consistent across databases. Common patterns:
Tools like Debezium (CDC), Apache Kafka (event streaming), and platform-specific solutions simplify synchronization.
Don't adopt polyglot persistence prematurely. Start with a single database (usually relational). Add specialized databases only when clear requirements emerge that the primary database cannot satisfy. Each additional database multiplies operational and consistency complexity.
While no flowchart can capture all nuances, the following provides a starting point for model selection. Use it as a conversation starter, not an absolute prescription.
Primary Decision Points:
12345678910111213141516171819202122232425262728293031323334353637383940414243444546
# Data Model Selection Flowchart START │ ▼┌───────────────────────────────────────────────────────┐│ Q1: Is your primary access pattern key-based lookup ││ with no need to query on value contents? │└───────────────────────────────────────────────────────┘ │ │ YES NO │ │ ▼ ▼ ┌───────────────┐ ┌─────────────────────────────────┐ │ KEY-VALUE │ │ Q2: Is relationship traversal │ │ (Redis, │ │ central to your queries? │ │ DynamoDB) │ │ (3+ hops common) │ └───────────────┘ └─────────────────────────────────┘ │ │ YES NO │ │ ▼ ▼ ┌───────────────┐ ┌───────────────────────────┐ │ GRAPH │ │ Q3: Do entities have │ │ (Neo4j, │ │ varying structure? │ │ Neptune) │ │ (product types differ)│ └───────────────┘ └───────────────────────────┘ │ │ YES NO │ │ ▼ ▼ ┌───────────────┐ ┌─────────────────────────┐ │ DOCUMENT │ │ Q4: Is write volume │ │ (MongoDB, │ │ extreme (>100K/sec) │ │ Couchbase) │ │ or data >10TB? │ └───────────────┘ └─────────────────────────┘ │ │ YES NO │ │ ▼ ▼ ┌─────────────┐ ┌──────────────┐ │ COLUMN- │ │ RELATIONAL │ │ FAMILY │ │ (PostgreSQL, │ │(Cassandra, │ │ MySQL) │ │ HBase) │ │ │ └─────────────┘ └──────────────┘Important Caveats:
This flowchart oversimplifies complex decisions. Real considerations include:
Use the flowchart as a starting hypothesis, then validate through the full framework analysis.
Notice how the flowchart reaches 'Relational' only after ruling out specialized needs. This is intentional. Relational databases are the safest default: well-understood, widely supported, ACID-compliant, and increasingly capable (JSON, full-text search, etc.). Choose alternatives when you have specific requirements relational cannot satisfy.
Data model selection is one of the most consequential architectural decisions in software development. The right choice enables natural data expression, efficient queries, and smooth scaling. The wrong choice creates constant friction, technical debt, and eventual costly migrations.
Let's consolidate the framework we've developed:
| If your priority is... | Consider... |
|---|---|
| Strong transactions, complex queries, structured data | Relational (PostgreSQL, MySQL) |
| Flexible schemas, rapid iteration, nested objects | Document (MongoDB, Couchbase) |
| Maximum speed, simple lookups, caching | Key-Value (Redis, DynamoDB) |
| Deep relationships, recommendations, fraud detection | Graph (Neo4j, Neptune) |
| Massive write throughput, time-series, IoT | Column-Family (Cassandra, HBase) |
Module Complete:
You've now completed a comprehensive exploration of modern data models beyond the relational paradigm. From documents to key-values, graphs to column-families, you understand not just how each model works, but when to apply each one.
This knowledge transforms you from a developer who uses whatever database is familiar to an architect who deliberately selects the right tool for each job. As you encounter new projects, apply this framework: analyze systematically, justify alternatives explicitly, and always keep operational reality in mind.
Congratulations! You've mastered the Document and Other Data Models module. You understand five major data paradigms—their strengths, limitations, and ideal use cases. More importantly, you possess a systematic framework for selecting the right model for any application. This architectural judgment distinguishes senior engineers from those who simply follow convention.