Document And Other Models - Learning Module

Loading content...

0/241

Model Selection

The Art of Choosing

We've now explored five distinct data models: relational, document, key-value, graph, and column-family. Each was designed by engineers solving specific problems under specific constraints. None is universally superior—each excels in its domain and falters when misapplied.

The challenge facing every software architect is model selection: given a problem domain, which data model (or combination of models) best serves the application's requirements? This decision is among the most consequential in system design. Choosing incorrectly means either:

Forcing round pegs into square holes: Constant workarounds, query complexity, performance problems
Expensive migrations later: Rewriting core data access logic after reaching scale limits

This page synthesizes everything we've learned into a practical decision framework—not a rigid flowchart, but a structured approach to evaluating options and making informed choices.

What You Will Learn

This page delivers a comprehensive model selection framework. You'll learn to analyze data characteristics, evaluate query patterns, assess scalability requirements, consider consistency needs, and factor in operational and organizational constraints. By the end, you'll have a systematic approach to database selection decisions.

The Selection Framework

Model selection isn't a single decision—it's a structured analysis across multiple dimensions. Consider each dimension as a lens through which to evaluate candidate models:

The Five Dimensions of Model Selection:

Data Structure: What shape does your data naturally take?
Query Patterns: How will the data be accessed and analyzed?
Scale Requirements: How large will the data grow? How many operations per second?
Consistency Needs: What guarantees does the application require?
Operational Context: What can your team build, operate, and afford?

Model Selection Dimensions Overview
Dimension	Key Questions	Impact on Choice
Data Structure	Is data tabular, hierarchical, connected, or key-addressed?	Determines natural model fit
Query Patterns	CRUD? Analytics? Traversals? Full-text search?	Determines performance characteristics
Scale Requirements	GB or PB? 100 QPS or 1M QPS? Growth trajectory?	Filters viable options by capacity
Consistency Needs	ACID required? Eventual consistency acceptable?	Eliminates models that can't guarantee requirements
Operational Context	Team skills? Cloud vs on-prem? Budget?	Practical feasibility of operation

The Process:

Enumerate Requirements: List all data entities, relationships, and queries
Apply Each Lens: Score candidate models against each dimension
Identify Deal Breakers: Some requirements eliminate models entirely
Evaluate Trade-offs: Among remaining options, which trade-offs are acceptable?
Consider Polyglot: Would multiple specialized databases outperform a single general-purpose one?
Prototype and Validate: Test assumptions with representative data and workloads

Start with Relational, Justify Alternatives

Relational databases are the most versatile and well-understood option. When uncertain, start with PostgreSQL or MySQL. Only move to specialized models when you can clearly articulate why relational databases cannot meet your requirements. The burden of proof should be on the alternative.

Data Structure Analysis

The first dimension examines the natural shape of your data. Different data models represent different structures efficiently.

Structural Characteristics to Evaluate:

Regularity: Do all entities of a type have the same attributes?
Nesting: Is data naturally hierarchical with embedded sub-structures?
Relationships: How entities connect—are relationships simple or central to the domain?
Schema Stability: Will the structure evolve frequently or remain stable?

Data Structure to Model Mapping
Data Characteristic	Best Model Fit	Example Domain
Highly regular, tabular data	Relational	Financial transactions, ERP systems
Nested, varying structure, self-contained entities	Document	Product catalogs, CMS, user profiles
Simple key-addressed, opaque values	Key-Value	Sessions, caches, feature flags
Densely connected, relationship-centric	Graph	Social networks, fraud detection, knowledge graphs
Wide, sparse, time-oriented	Column-Family	IoT telemetry, event logs, analytics

structure-analysis-examples.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
# Structure Analysis: E-commerce Platform
 
## Order Management (Core Transaction)
- Fixed structure: order_id, customer_id, date, status, total
- Strong relationships: order -> customer, order -> items -> products
- Referential integrity critical
→ RELATIONAL: Orders, OrderItems, Customers, Products
 
## Product Catalog (Varying Attributes)
- Electronics have specifications; clothing has sizes/colors
- Nested: product -> variants -> images
- Attributes vary wildly by category
→ DOCUMENT: Each product type has different fields
 
## User Sessions (Ephemeral, Key-Based)
- Lookup by session_id only
- Expires after timeout
- No relationships or queries on content
→ KEY-VALUE: Redis with TTL
 
## Recommendation Engine (Connection-Centric)
- "Users who bought X also bought Y"
- "Friends of friends who like Z"
- Multi-hop relationship traversals
→ GRAPH: User-Product purchase graph
 
## Analytics Events (High Volume, Time-Series)
- Millions of events per second
- Queries by time range, aggregated
- Retention policy (delete after 90 days)
→ COLUMN-FAMILY: Cassandra with time-bucketed partitions

Hybrid Structures:

Real applications rarely fit perfectly into one model. The e-commerce example above touches five different models—a realistic outcome for complex systems. Rather than forcing all data into one model, consider polyglot persistence: use each database where it excels.

The challenge then becomes coordination: maintaining consistency across databases, managing operational complexity, and handling cross-database queries. We'll address polyglot architectures later in this page.

Document Inside Relational

Modern relational databases (PostgreSQL, MySQL) support JSON columns with indexing and querying. This enables document-like flexibility within a relational database for portions of your schema. Consider this before adding a separate document database.

Query Pattern Analysis

Data structure matters, but query patterns—how data is accessed—often dominate the selection decision. A model that elegantly represents your data but cannot efficiently answer your queries is worse than useless.

Query Pattern Categories:

Point Queries: Fetch a single entity by key (all models handle well)
Range Queries: Fetch entities within a range (time, numeric, lexicographic)
Secondary Lookups: Query by non-primary attributes
Aggregations: COUNT, SUM, AVG, GROUP BY operations
Joins: Combine data from multiple entities
Traversals: Follow relationships across multiple hops
Full-Text Search: Find entities containing search terms
Pattern Matching: Find entities matching complex conditions

Query Pattern Support by Model
Query Pattern	Relational	Document	Key-Value	Graph	Column-Family
Point Query	✅ Excellent	✅ Excellent	✅ Excellent	✅ Excellent	✅ Excellent
Range Query	✅ Excellent	✅ Good	⚠️ Limited*	⚠️ Limited	✅ Excellent**
Secondary Lookup	✅ Excellent	✅ Good	❌ Not supported	✅ Good	⚠️ Limited***
Aggregations	✅ Excellent	✅ Good	❌ Not supported	⚠️ Limited	⚠️ Limited
Joins	✅ Excellent	⚠️ Limited****	❌ Not supported	✅ Native	❌ Not supported
Traversals	⚠️ Expensive*****	❌ Not supported	❌ Not supported	✅ Excellent	❌ Not supported
Full-Text Search	⚠️ Basic	⚠️ Basic	❌ Not supported	❌ Not supported	❌ Not supported

Legend:

*Key-value range queries: Only if keys are sorted (Redis sorted sets, DynamoDB with sort keys)
**Column-family range: Excellent within partitions; cross-partition is expensive
***Column-family secondary: Requires denormalization or materialized views
****Document joins: Application-side or $lookup (limited); not a strength
*****Relational traversals: Recursive CTEs work but scale poorly beyond 3-4 levels

The 80/20 Rule:

Identify the queries that will constitute 80% of your traffic. Optimize your model for these. Less frequent queries can tolerate suboptimal performance or be handled by secondary systems (search engines, analytics databases).

Query-Driven Selection Criteria

•Heavy JOIN workloads → Relational shines; avoid key-value and column-family entirely
•Deep traversals (3+ hops) → Graph is likely required; relational struggles
•High-volume aggregations → Relational or columnar analytics; NoSQL models are weak
•Primarily point reads/writes → Key-value offers best performance; relational works fine
•Complex filtering on nested data → Document databases with rich query support
•Full-text search → Dedicated search engine (Elasticsearch) alongside primary store
•Time-range scans → Column-family or time-series database

Unknown Queries Are Dangerous

If you don't know your query patterns, be cautious about NoSQL. Relational databases handle ad-hoc queries gracefully. NoSQL models require designing tables around known queries. Discovering a new critical query pattern after deployment may require significant schema changes in NoSQL systems.

Scale Requirements Analysis

Scale requirements filter out models that simply cannot handle your projected load. This dimension includes data volume, throughput, and latency requirements.

Data Volume:

MB to GB: Almost any database works
GB to low TB: Single-node relational databases handle well
TB range: Requires careful planning; consider sharding or distributed systems
PB range: Column-family stores, distributed object storage, or data lakes

Throughput (Operations per Second):

Hundreds to thousands: Most databases handle comfortably
Tens of thousands: Requires optimization, connection pooling, potentially read replicas
Hundreds of thousands to millions: Distributed systems, caching layers, specialized databases

Scalability Characteristics by Model
Model	Vertical Scaling	Horizontal Scaling	Max Practical Scale
Relational (single)	Excellent	Limited (read replicas)	Low TB / 10K QPS
Relational (sharded)	Good	Moderate (manual)	High TB / 100K QPS
Document (MongoDB)	Good	Good (auto-sharding)	PB / 1M+ QPS
Key-Value (Redis)	Excellent	Good (cluster mode)	TB / 1M+ QPS*
Graph (Neo4j)	Excellent	Limited (read replicas)	Billions of nodes with care
Column-Family	Good	Excellent (native)	PB / Millions QPS

Latency Requirements:

Sub-millisecond: In-memory key-value (Redis, Memcached) or local caches
Single-digit milliseconds: Well-tuned databases with proper indexing
Tens of milliseconds: Acceptable for most applications
Hundreds of milliseconds: Only acceptable for batch operations; problematic for user-facing

The "Google Scale" Trap:

Many teams choose complex distributed databases because they might grow to Google's scale. This is premature optimization. PostgreSQL on a well-provisioned server handles more than most applications will ever need. Choose for your actual scale trajectory, not hypothetical exponential growth.

scale-estimation.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
# Scale Estimation Checklist
 
## Current State
- Total data volume: _____ GB/TB
- Annual growth rate: _____ %
- Peak read QPS: _____
- Peak write QPS: _____
- p99 latency requirement: _____ ms
 
## 3-Year Projection
- Projected data volume: _____ TB
- Projected peak read QPS: _____
- Projected peak write QPS: _____
 
## Example: SaaS Application
 
Current:
- 50 GB data
- 20% YoY growth
- 500 read QPS peak
- 50 write QPS peak
- 50ms p99 latency required
 
3-Year Projection:
- ~85 GB data (50 * 1.2^3)
- ~850 read QPS
- ~85 write QPS
 
Analysis:
- A single PostgreSQL instance handles this easily
- No need for distributed databases
- Focus on proper indexing and connection pooling
→ Recommendation: PostgreSQL with read replica for redundancy

Scale ≠ Complexity

Scalability isn't just about handling data volume. Consider: Can you efficiently shard your workload? Are there hot spots (celebrity accounts, popular products)? Is your access pattern read-heavy or write-heavy? A system that handles 1PB of cold data differs vastly from one handling 1M writes/second.

Consistency Requirements Analysis

Consistency guarantees determine what applications can assume about data state after operations. Different models offer different consistency capabilities—this is often a deal-breaker for certain use cases.

Consistency Spectrum:

Strong Consistency (Linearizability): Reads always see the most recent write. Operations appear instantaneous.
Sequential Consistency: All processes see operations in the same order (though that order might lag reality).
Eventual Consistency: If updates stop, all replicas will eventually converge. Reads may return stale data.
Causal Consistency: Causally related operations appear in order; concurrent operations may appear in any order.

Consistency Capabilities by Database Type
Database Type	Consistency Model	ACID Transactions	Notes
Relational (single node)	Strong	Full ACID	Industry-leading transactional guarantees
Relational (replicated)	Varies*	Partial**	Depends on replication mode (sync vs async)
Document (MongoDB)	Tunable	Multi-document (4.0+)	Read/write concern configurable
Key-Value (Redis)	Varies***	Single-key only	Cluster mode: eventually consistent
Graph (Neo4j)	Strong	Full ACID	One of Neo4j's key advantages
Column-Family	Tunable (typically eventual)	Row-level****	Cassandra: tunable consistency levels

Legend:

*Relational replicated: Sync replication is strong; async is eventually consistent for reads
**Distributed relational: Solutions like CockroachDB, Spanner offer distributed ACID
***Redis: Single node is strong; cluster uses gossip (eventual)
****Column-family: Most support lightweight transactions or row-level atomicity

Use Case Requirements:

Consistency Requirements by Domain

•Financial transactions → Strong consistency + ACID required; relational or distributed SQL (CockroachDB, Spanner)
•User authentication → Strong consistency for security; stale reads can allow unauthorized access
•Social media feeds → Eventual consistency acceptable; slight delays in seeing posts is tolerable
•Shopping carts → Eventual consistency acceptable; merge conflicts on rare collisions
•Inventory management → Strong consistency to prevent overselling; need atomic decrement
•Analytics dashboards → Eventual consistency fine; data is already aggregated/delayed
•Collaborative editing → Causal consistency with conflict resolution (CRDTs often used)

Eventual Consistency Is Hard

Eventual consistency forces application developers to handle race conditions, stale reads, and conflict resolution. This complexity is often underestimated. Unless you have specific scale requirements that demand eventual consistency, prefer strongly consistent systems and simpler application code.

Operational Context

Technical capabilities are only half the story. Operational context—your team's skills, infrastructure, budget, and organizational constraints—significantly impacts which databases are viable choices.

Team Skills:

A database your team cannot operate reliably is worse than a suboptimal one they can. Consider:

Do you have experience with distributed systems?
Is the team comfortable with eventual consistency?
Do you have database administrators or is it developer-operated?
What's the learning curve for the database's query language and tools?

Operational Complexity by Database Type
Database Type	Ops Complexity	Managed Options	Skills Required
Relational (single)	Low	Excellent (RDS, Cloud SQL, etc.)	Standard DBA skills
Relational (sharded)	High	Moderate (Vitess, Citus)	Sharding expertise, custom tooling
Document (MongoDB)	Moderate	Excellent (Atlas)	Understanding of document modeling
Key-Value (Redis)	Low-Moderate	Excellent (ElastiCache, etc.)	Memory management, cluster ops
Graph (Neo4j)	Moderate	Good (Aura)	Graph modeling, Cypher proficiency
Column-Family	High	Moderate (Astra, DynamoDB*)	Deep understanding of data modeling, compaction, consistency

Cloud vs Self-Hosted:

Managed services (RDS, MongoDB Atlas, Astra) reduce operational burden but cost more and limit customization
Self-hosted offers control and cost savings at scale but requires significant ops expertise
Hybrid (development on managed, production self-hosted) can provide flexibility

Budget Considerations:

Open-source databases (PostgreSQL, MySQL, Cassandra) have no license costs
Commercial databases (Oracle, Neo4j Enterprise) have significant licensing fees
Managed services charge for convenience; costs can surprise at scale
In-memory databases (Redis) require more expensive hardware (RAM vs SSD)

Operational Decision Factors

•On-call burden — Some databases require 24/7 attention to compaction, rebalancing, and failures
•Backup and recovery — How complex is point-in-time recovery? Cross-region replication?
•Monitoring and observability — Are there mature tools for your database? Prometheus exporters? APM integration?
•Upgrade path — How painful are major version upgrades? Can you do zero-downtime upgrades?
•Community and support — Is there an active community, good documentation, enterprise support options?
•Vendor lock-in — Managed services ease ops but may create dependency; evaluate exit strategies

The DynamoDB Path

AWS DynamoDB deserves special mention: it's a fully managed key-value/document database with essentially zero operations overhead. If you're on AWS and can model your data for DynamoDB's access patterns, it eliminates operational concerns entirely. But its pricing and data modeling constraints require careful evaluation.

Polyglot Persistence

Polyglot persistence is the practice of using multiple databases within a single application, each chosen for specific workloads. Rather than forcing all data into one model, you leverage the strengths of specialized systems.

A Typical Polyglot Architecture:

polyglot-architecture.txt

# E-Commerce Platform: Polyglot Architecture
 
┌─────────────────────────────────────────────────────────────────┐
│                       Application Layer                         │
└────────────┬──────────────┬──────────────┬──────────────┬───────┘
             │              │              │              │
             ▼              ▼              ▼              ▼
    ┌────────────┐   ┌────────────┐   ┌────────────┐   ┌────────────┐
    │ PostgreSQL │   │  MongoDB   │   │   Redis    │   │Elasticsearch│
    │ (Primary)  │   │ (Catalog)  │   │ (Cache)    │   │ (Search)   │
    └────────────┘   └────────────┘   └────────────┘   └────────────┘
         │                 │                │                │
         │                 │                │                │
    • Orders          • Products       • Sessions       • Product Search
    • Customers       • Reviews        • Cart           • Autocomplete
    • Inventory       • Content        • Rate Limits    • Filters
    • Transactions    • Promotions     • Real-time      • Faceted Nav
                                         Counters
 
┌─────────────────────────────────────────────────────────────────┐
│                       Analytics Layer                           │
└────────────┬──────────────┬─────────────────────────────────────┘
             │              │              
             ▼              ▼              
    ┌────────────┐   ┌────────────┐   
    │ ClickHouse │   │  Neo4j     │   
    │ (Analytics)│   │ (RecEngine)│   
    └────────────┘   └────────────┘   
         │                 │          
    • Event Logs      • User Graph
    • Aggregations    • "Customers also
    • Dashboards        bought"

Polyglot Benefits

•Each database used for its strengths
•Optimal performance for each workload
•Can scale components independently
•Avoid compromising on any one model
•Future flexibility to swap components

Polyglot Challenges

•Data synchronization complexity
•No cross-database transactions
•Increased operational burden
•Team must master multiple systems
•Debugging spans multiple data stores

Managing Data Synchronization:

The central challenge in polyglot architectures is keeping data consistent across databases. Common patterns:

Event Sourcing: All changes published as events; each database subscribes and updates independently
Change Data Capture (CDC): Monitor primary database's transaction log; propagate changes to secondary stores
Dual Writes: Application writes to multiple databases (risky—can fail inconsistently)
Eventual Sync: Background jobs reconcile differences periodically

Tools like Debezium (CDC), Apache Kafka (event streaming), and platform-specific solutions simplify synchronization.

Start Simple, Add Complexity Judiciously

Don't adopt polyglot persistence prematurely. Start with a single database (usually relational). Add specialized databases only when clear requirements emerge that the primary database cannot satisfy. Each additional database multiplies operational and consistency complexity.

Decision Flowchart

While no flowchart can capture all nuances, the following provides a starting point for model selection. Use it as a conversation starter, not an absolute prescription.

Primary Decision Points:

decision-flowchart.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
# Data Model Selection Flowchart
 
START
  │
  ▼
┌───────────────────────────────────────────────────────┐
│ Q1: Is your primary access pattern key-based lookup   │
│     with no need to query on value contents?          │
└───────────────────────────────────────────────────────┘
          │                              │
         YES                             NO
          │                              │
          ▼                              ▼
  ┌───────────────┐           ┌─────────────────────────────────┐
  │  KEY-VALUE    │           │ Q2: Is relationship traversal   │
  │  (Redis,      │           │     central to your queries?    │
  │   DynamoDB)   │           │     (3+ hops common)            │
  └───────────────┘           └─────────────────────────────────┘
                                      │                  │
                                     YES                 NO
                                      │                  │
                                      ▼                  ▼
                              ┌───────────────┐   ┌───────────────────────────┐
                              │    GRAPH      │   │ Q3: Do entities have      │
                              │   (Neo4j,     │   │     varying structure?    │
                              │    Neptune)   │   │     (product types differ)│
                              └───────────────┘   └───────────────────────────┘
                                                          │              │
                                                         YES             NO
                                                          │              │
                                                          ▼              ▼
                                                  ┌───────────────┐ ┌─────────────────────────┐
                                                  │   DOCUMENT    │ │ Q4: Is write volume     │
                                                  │  (MongoDB,    │ │     extreme (>100K/sec) │
                                                  │   Couchbase)  │ │     or data >10TB?      │
                                                  └───────────────┘ └─────────────────────────┘
                                                                            │           │
                                                                           YES          NO
                                                                            │           │
                                                                            ▼           ▼
                                                                    ┌─────────────┐ ┌──────────────┐
                                                                    │ COLUMN-     │ │  RELATIONAL  │
                                                                    │ FAMILY      │ │ (PostgreSQL, │
                                                                    │(Cassandra,  │ │   MySQL)     │
                                                                    │ HBase)      │ │              │
                                                                    └─────────────┘ └──────────────┘

Important Caveats:

This flowchart oversimplifies complex decisions. Real considerations include:

Most applications have multiple access patterns requiring different answers
Hybrid capabilities exist (PostgreSQL JSONB, DynamoDB as document store)
Managed services change the operational calculus significantly
Team expertise may override theoretical best choices
Cost at scale varies dramatically between options

Use the flowchart as a starting hypothesis, then validate through the full framework analysis.

Default to Relational

Notice how the flowchart reaches 'Relational' only after ruling out specialized needs. This is intentional. Relational databases are the safest default: well-understood, widely supported, ACID-compliant, and increasingly capable (JSON, full-text search, etc.). Choose alternatives when you have specific requirements relational cannot satisfy.

Summary: Model Selection Mastery

Data model selection is one of the most consequential architectural decisions in software development. The right choice enables natural data expression, efficient queries, and smooth scaling. The wrong choice creates constant friction, technical debt, and eventual costly migrations.

Let's consolidate the framework we've developed:

Key Takeaways

•Apply the five-dimension framework — Analyze data structure, query patterns, scale requirements, consistency needs, and operational context systematically
•Data structure suggests natural fit — Tabular→Relational, Nested→Document, Connected→Graph, Wide/sparse→Column-family, Key-addressed→Key-value
•Query patterns often dominate — A model that can't efficiently answer your queries is wrong regardless of structural fit
•Scale appropriately — Most applications don't need distributed databases; size for your actual trajectory, not hypothetical Google scale
•Understand consistency implications — Eventual consistency is harder than it appears; prefer strong consistency unless scale demands otherwise
•Consider operational reality — The best database your team can't operate is worse than a good-enough database they can
•Polyglot is powerful but complex — Use multiple databases when genuinely needed; each addition multiplies complexity
•Default to relational — PostgreSQL covers 80%+ of use cases; justify alternatives with specific unmet requirements

Quick Reference: Model Selection Summary
If your priority is...	Consider...
Strong transactions, complex queries, structured data	Relational (PostgreSQL, MySQL)
Flexible schemas, rapid iteration, nested objects	Document (MongoDB, Couchbase)
Maximum speed, simple lookups, caching	Key-Value (Redis, DynamoDB)
Deep relationships, recommendations, fraud detection	Graph (Neo4j, Neptune)
Massive write throughput, time-series, IoT	Column-Family (Cassandra, HBase)

Module Complete:

You've now completed a comprehensive exploration of modern data models beyond the relational paradigm. From documents to key-values, graphs to column-families, you understand not just how each model works, but when to apply each one.

This knowledge transforms you from a developer who uses whatever database is familiar to an architect who deliberately selects the right tool for each job. As you encounter new projects, apply this framework: analyze systematically, justify alternatives explicitly, and always keep operational reality in mind.

Module Complete

Congratulations! You've mastered the Document and Other Data Models module. You understand five major data paradigms—their strengths, limitations, and ideal use cases. More importantly, you possess a systematic framework for selecting the right model for any application. This architectural judgment distinguishes senior engineers from those who simply follow convention.

Model Selection

The Art of Choosing

Forcing round pegs into square holes: Constant workarounds, query complexity, performance problems
Expensive migrations later: Rewriting core data access logic after reaching scale limits

This page synthesizes everything we've learned into a practical decision framework—not a rigid flowchart, but a structured approach to evaluating options and making informed choices.

What You Will Learn

The Selection Framework

Model selection isn't a single decision—it's a structured analysis across multiple dimensions. Consider each dimension as a lens through which to evaluate candidate models:

The Five Dimensions of Model Selection:

Data Structure: What shape does your data naturally take?
Query Patterns: How will the data be accessed and analyzed?
Scale Requirements: How large will the data grow? How many operations per second?
Consistency Needs: What guarantees does the application require?
Operational Context: What can your team build, operate, and afford?

Model Selection Dimensions Overview
Dimension	Key Questions	Impact on Choice
Data Structure	Is data tabular, hierarchical, connected, or key-addressed?	Determines natural model fit
Query Patterns	CRUD? Analytics? Traversals? Full-text search?	Determines performance characteristics
Scale Requirements	GB or PB? 100 QPS or 1M QPS? Growth trajectory?	Filters viable options by capacity
Consistency Needs	ACID required? Eventual consistency acceptable?	Eliminates models that can't guarantee requirements
Operational Context	Team skills? Cloud vs on-prem? Budget?	Practical feasibility of operation

The Process:

Enumerate Requirements: List all data entities, relationships, and queries
Apply Each Lens: Score candidate models against each dimension
Identify Deal Breakers: Some requirements eliminate models entirely
Evaluate Trade-offs: Among remaining options, which trade-offs are acceptable?
Consider Polyglot: Would multiple specialized databases outperform a single general-purpose one?
Prototype and Validate: Test assumptions with representative data and workloads

Start with Relational, Justify Alternatives

Data Structure Analysis

The first dimension examines the natural shape of your data. Different data models represent different structures efficiently.

Structural Characteristics to Evaluate:

Regularity: Do all entities of a type have the same attributes?
Nesting: Is data naturally hierarchical with embedded sub-structures?
Relationships: How entities connect—are relationships simple or central to the domain?
Schema Stability: Will the structure evolve frequently or remain stable?

Data Structure to Model Mapping
Data Characteristic	Best Model Fit	Example Domain
Highly regular, tabular data	Relational	Financial transactions, ERP systems
Nested, varying structure, self-contained entities	Document	Product catalogs, CMS, user profiles
Simple key-addressed, opaque values	Key-Value	Sessions, caches, feature flags
Densely connected, relationship-centric	Graph	Social networks, fraud detection, knowledge graphs
Wide, sparse, time-oriented	Column-Family	IoT telemetry, event logs, analytics

structure-analysis-examples.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
# Structure Analysis: E-commerce Platform
 
## Order Management (Core Transaction)
- Fixed structure: order_id, customer_id, date, status, total
- Strong relationships: order -> customer, order -> items -> products
- Referential integrity critical
→ RELATIONAL: Orders, OrderItems, Customers, Products
 
## Product Catalog (Varying Attributes)
- Electronics have specifications; clothing has sizes/colors
- Nested: product -> variants -> images
- Attributes vary wildly by category
→ DOCUMENT: Each product type has different fields
 
## User Sessions (Ephemeral, Key-Based)
- Lookup by session_id only
- Expires after timeout
- No relationships or queries on content
→ KEY-VALUE: Redis with TTL
 
## Recommendation Engine (Connection-Centric)
- "Users who bought X also bought Y"
- "Friends of friends who like Z"
- Multi-hop relationship traversals
→ GRAPH: User-Product purchase graph
 
## Analytics Events (High Volume, Time-Series)
- Millions of events per second
- Queries by time range, aggregated
- Retention policy (delete after 90 days)
→ COLUMN-FAMILY: Cassandra with time-bucketed partitions

Hybrid Structures:

Document Inside Relational

Query Pattern Analysis

Query Pattern Categories:

Point Queries: Fetch a single entity by key (all models handle well)
Range Queries: Fetch entities within a range (time, numeric, lexicographic)
Secondary Lookups: Query by non-primary attributes
Aggregations: COUNT, SUM, AVG, GROUP BY operations
Joins: Combine data from multiple entities
Traversals: Follow relationships across multiple hops
Full-Text Search: Find entities containing search terms
Pattern Matching: Find entities matching complex conditions

Query Pattern Support by Model
Query Pattern	Relational	Document	Key-Value	Graph	Column-Family
Point Query	✅ Excellent	✅ Excellent	✅ Excellent	✅ Excellent	✅ Excellent
Range Query	✅ Excellent	✅ Good	⚠️ Limited*	⚠️ Limited	✅ Excellent**
Secondary Lookup	✅ Excellent	✅ Good	❌ Not supported	✅ Good	⚠️ Limited***
Aggregations	✅ Excellent	✅ Good	❌ Not supported	⚠️ Limited	⚠️ Limited
Joins	✅ Excellent	⚠️ Limited****	❌ Not supported	✅ Native	❌ Not supported
Traversals	⚠️ Expensive*****	❌ Not supported	❌ Not supported	✅ Excellent	❌ Not supported
Full-Text Search	⚠️ Basic	⚠️ Basic	❌ Not supported	❌ Not supported	❌ Not supported

Legend:

*Key-value range queries: Only if keys are sorted (Redis sorted sets, DynamoDB with sort keys)
**Column-family range: Excellent within partitions; cross-partition is expensive
***Column-family secondary: Requires denormalization or materialized views
****Document joins: Application-side or $lookup (limited); not a strength
*****Relational traversals: Recursive CTEs work but scale poorly beyond 3-4 levels

The 80/20 Rule:

Query-Driven Selection Criteria

•Heavy JOIN workloads → Relational shines; avoid key-value and column-family entirely
•Deep traversals (3+ hops) → Graph is likely required; relational struggles
•High-volume aggregations → Relational or columnar analytics; NoSQL models are weak
•Primarily point reads/writes → Key-value offers best performance; relational works fine
•Complex filtering on nested data → Document databases with rich query support
•Full-text search → Dedicated search engine (Elasticsearch) alongside primary store
•Time-range scans → Column-family or time-series database

Unknown Queries Are Dangerous

Scale Requirements Analysis

Scale requirements filter out models that simply cannot handle your projected load. This dimension includes data volume, throughput, and latency requirements.

Data Volume:

MB to GB: Almost any database works
GB to low TB: Single-node relational databases handle well
TB range: Requires careful planning; consider sharding or distributed systems
PB range: Column-family stores, distributed object storage, or data lakes

Throughput (Operations per Second):

Hundreds to thousands: Most databases handle comfortably
Tens of thousands: Requires optimization, connection pooling, potentially read replicas
Hundreds of thousands to millions: Distributed systems, caching layers, specialized databases

Scalability Characteristics by Model
Model	Vertical Scaling	Horizontal Scaling	Max Practical Scale
Relational (single)	Excellent	Limited (read replicas)	Low TB / 10K QPS
Relational (sharded)	Good	Moderate (manual)	High TB / 100K QPS
Document (MongoDB)	Good	Good (auto-sharding)	PB / 1M+ QPS
Key-Value (Redis)	Excellent	Good (cluster mode)	TB / 1M+ QPS*
Graph (Neo4j)	Excellent	Limited (read replicas)	Billions of nodes with care
Column-Family	Good	Excellent (native)	PB / Millions QPS

Latency Requirements:

Sub-millisecond: In-memory key-value (Redis, Memcached) or local caches
Single-digit milliseconds: Well-tuned databases with proper indexing
Tens of milliseconds: Acceptable for most applications
Hundreds of milliseconds: Only acceptable for batch operations; problematic for user-facing

The "Google Scale" Trap:

scale-estimation.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
# Scale Estimation Checklist
 
## Current State
- Total data volume: _____ GB/TB
- Annual growth rate: _____ %
- Peak read QPS: _____
- Peak write QPS: _____
- p99 latency requirement: _____ ms
 
## 3-Year Projection
- Projected data volume: _____ TB
- Projected peak read QPS: _____
- Projected peak write QPS: _____
 
## Example: SaaS Application
 
Current:
- 50 GB data
- 20% YoY growth
- 500 read QPS peak
- 50 write QPS peak
- 50ms p99 latency required
 
3-Year Projection:
- ~85 GB data (50 * 1.2^3)
- ~850 read QPS
- ~85 write QPS
 
Analysis:
- A single PostgreSQL instance handles this easily
- No need for distributed databases
- Focus on proper indexing and connection pooling
→ Recommendation: PostgreSQL with read replica for redundancy

Scale ≠ Complexity

Consistency Requirements Analysis

Consistency Spectrum:

Strong Consistency (Linearizability): Reads always see the most recent write. Operations appear instantaneous.
Sequential Consistency: All processes see operations in the same order (though that order might lag reality).
Eventual Consistency: If updates stop, all replicas will eventually converge. Reads may return stale data.
Causal Consistency: Causally related operations appear in order; concurrent operations may appear in any order.

Consistency Capabilities by Database Type
Database Type	Consistency Model	ACID Transactions	Notes
Relational (single node)	Strong	Full ACID	Industry-leading transactional guarantees
Relational (replicated)	Varies*	Partial**	Depends on replication mode (sync vs async)
Document (MongoDB)	Tunable	Multi-document (4.0+)	Read/write concern configurable
Key-Value (Redis)	Varies***	Single-key only	Cluster mode: eventually consistent
Graph (Neo4j)	Strong	Full ACID	One of Neo4j's key advantages
Column-Family	Tunable (typically eventual)	Row-level****	Cassandra: tunable consistency levels

Legend:

*Relational replicated: Sync replication is strong; async is eventually consistent for reads
**Distributed relational: Solutions like CockroachDB, Spanner offer distributed ACID
***Redis: Single node is strong; cluster uses gossip (eventual)
****Column-family: Most support lightweight transactions or row-level atomicity

Use Case Requirements:

Consistency Requirements by Domain

•Financial transactions → Strong consistency + ACID required; relational or distributed SQL (CockroachDB, Spanner)
•User authentication → Strong consistency for security; stale reads can allow unauthorized access
•Social media feeds → Eventual consistency acceptable; slight delays in seeing posts is tolerable
•Shopping carts → Eventual consistency acceptable; merge conflicts on rare collisions
•Inventory management → Strong consistency to prevent overselling; need atomic decrement
•Analytics dashboards → Eventual consistency fine; data is already aggregated/delayed
•Collaborative editing → Causal consistency with conflict resolution (CRDTs often used)

Eventual Consistency Is Hard

Operational Context

Team Skills:

A database your team cannot operate reliably is worse than a suboptimal one they can. Consider:

Do you have experience with distributed systems?
Is the team comfortable with eventual consistency?
Do you have database administrators or is it developer-operated?
What's the learning curve for the database's query language and tools?

Operational Complexity by Database Type
Database Type	Ops Complexity	Managed Options	Skills Required
Relational (single)	Low	Excellent (RDS, Cloud SQL, etc.)	Standard DBA skills
Relational (sharded)	High	Moderate (Vitess, Citus)	Sharding expertise, custom tooling
Document (MongoDB)	Moderate	Excellent (Atlas)	Understanding of document modeling
Key-Value (Redis)	Low-Moderate	Excellent (ElastiCache, etc.)	Memory management, cluster ops
Graph (Neo4j)	Moderate	Good (Aura)	Graph modeling, Cypher proficiency
Column-Family	High	Moderate (Astra, DynamoDB*)	Deep understanding of data modeling, compaction, consistency

Cloud vs Self-Hosted:

Managed services (RDS, MongoDB Atlas, Astra) reduce operational burden but cost more and limit customization
Self-hosted offers control and cost savings at scale but requires significant ops expertise
Hybrid (development on managed, production self-hosted) can provide flexibility

Budget Considerations:

Open-source databases (PostgreSQL, MySQL, Cassandra) have no license costs
Commercial databases (Oracle, Neo4j Enterprise) have significant licensing fees
Managed services charge for convenience; costs can surprise at scale
In-memory databases (Redis) require more expensive hardware (RAM vs SSD)

Operational Decision Factors

•On-call burden — Some databases require 24/7 attention to compaction, rebalancing, and failures
•Backup and recovery — How complex is point-in-time recovery? Cross-region replication?
•Monitoring and observability — Are there mature tools for your database? Prometheus exporters? APM integration?
•Upgrade path — How painful are major version upgrades? Can you do zero-downtime upgrades?
•Community and support — Is there an active community, good documentation, enterprise support options?
•Vendor lock-in — Managed services ease ops but may create dependency; evaluate exit strategies

The DynamoDB Path

Polyglot Persistence

A Typical Polyglot Architecture:

polyglot-architecture.txt

# E-Commerce Platform: Polyglot Architecture
 
┌─────────────────────────────────────────────────────────────────┐
│                       Application Layer                         │
└────────────┬──────────────┬──────────────┬──────────────┬───────┘
             │              │              │              │
             ▼              ▼              ▼              ▼
    ┌────────────┐   ┌────────────┐   ┌────────────┐   ┌────────────┐
    │ PostgreSQL │   │  MongoDB   │   │   Redis    │   │Elasticsearch│
    │ (Primary)  │   │ (Catalog)  │   │ (Cache)    │   │ (Search)   │
    └────────────┘   └────────────┘   └────────────┘   └────────────┘
         │                 │                │                │
         │                 │                │                │
    • Orders          • Products       • Sessions       • Product Search
    • Customers       • Reviews        • Cart           • Autocomplete
    • Inventory       • Content        • Rate Limits    • Filters
    • Transactions    • Promotions     • Real-time      • Faceted Nav
                                         Counters
 
┌─────────────────────────────────────────────────────────────────┐
│                       Analytics Layer                           │
└────────────┬──────────────┬─────────────────────────────────────┘
             │              │              
             ▼              ▼              
    ┌────────────┐   ┌────────────┐   
    │ ClickHouse │   │  Neo4j     │   
    │ (Analytics)│   │ (RecEngine)│   
    └────────────┘   └────────────┘   
         │                 │          
    • Event Logs      • User Graph
    • Aggregations    • "Customers also
    • Dashboards        bought"

Polyglot Benefits

•Each database used for its strengths
•Optimal performance for each workload
•Can scale components independently
•Avoid compromising on any one model
•Future flexibility to swap components

Polyglot Challenges

•Data synchronization complexity
•No cross-database transactions
•Increased operational burden
•Team must master multiple systems
•Debugging spans multiple data stores

Managing Data Synchronization:

The central challenge in polyglot architectures is keeping data consistent across databases. Common patterns:

Event Sourcing: All changes published as events; each database subscribes and updates independently
Change Data Capture (CDC): Monitor primary database's transaction log; propagate changes to secondary stores
Dual Writes: Application writes to multiple databases (risky—can fail inconsistently)
Eventual Sync: Background jobs reconcile differences periodically

Tools like Debezium (CDC), Apache Kafka (event streaming), and platform-specific solutions simplify synchronization.

Start Simple, Add Complexity Judiciously

Decision Flowchart

While no flowchart can capture all nuances, the following provides a starting point for model selection. Use it as a conversation starter, not an absolute prescription.

Primary Decision Points:

decision-flowchart.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
# Data Model Selection Flowchart
 
START
  │
  ▼
┌───────────────────────────────────────────────────────┐
│ Q1: Is your primary access pattern key-based lookup   │
│     with no need to query on value contents?          │
└───────────────────────────────────────────────────────┘
          │                              │
         YES                             NO
          │                              │
          ▼                              ▼
  ┌───────────────┐           ┌─────────────────────────────────┐
  │  KEY-VALUE    │           │ Q2: Is relationship traversal   │
  │  (Redis,      │           │     central to your queries?    │
  │   DynamoDB)   │           │     (3+ hops common)            │
  └───────────────┘           └─────────────────────────────────┘
                                      │                  │
                                     YES                 NO
                                      │                  │
                                      ▼                  ▼
                              ┌───────────────┐   ┌───────────────────────────┐
                              │    GRAPH      │   │ Q3: Do entities have      │
                              │   (Neo4j,     │   │     varying structure?    │
                              │    Neptune)   │   │     (product types differ)│
                              └───────────────┘   └───────────────────────────┘
                                                          │              │
                                                         YES             NO
                                                          │              │
                                                          ▼              ▼
                                                  ┌───────────────┐ ┌─────────────────────────┐
                                                  │   DOCUMENT    │ │ Q4: Is write volume     │
                                                  │  (MongoDB,    │ │     extreme (>100K/sec) │
                                                  │   Couchbase)  │ │     or data >10TB?      │
                                                  └───────────────┘ └─────────────────────────┘
                                                                            │           │
                                                                           YES          NO
                                                                            │           │
                                                                            ▼           ▼
                                                                    ┌─────────────┐ ┌──────────────┐
                                                                    │ COLUMN-     │ │  RELATIONAL  │
                                                                    │ FAMILY      │ │ (PostgreSQL, │
                                                                    │(Cassandra,  │ │   MySQL)     │
                                                                    │ HBase)      │ │              │
                                                                    └─────────────┘ └──────────────┘

Important Caveats:

This flowchart oversimplifies complex decisions. Real considerations include:

Most applications have multiple access patterns requiring different answers
Hybrid capabilities exist (PostgreSQL JSONB, DynamoDB as document store)
Managed services change the operational calculus significantly
Team expertise may override theoretical best choices
Cost at scale varies dramatically between options

Use the flowchart as a starting hypothesis, then validate through the full framework analysis.

Default to Relational

Summary: Model Selection Mastery

Let's consolidate the framework we've developed:

Key Takeaways

•Apply the five-dimension framework — Analyze data structure, query patterns, scale requirements, consistency needs, and operational context systematically
•Data structure suggests natural fit — Tabular→Relational, Nested→Document, Connected→Graph, Wide/sparse→Column-family, Key-addressed→Key-value
•Query patterns often dominate — A model that can't efficiently answer your queries is wrong regardless of structural fit
•Scale appropriately — Most applications don't need distributed databases; size for your actual trajectory, not hypothetical Google scale
•Understand consistency implications — Eventual consistency is harder than it appears; prefer strong consistency unless scale demands otherwise
•Consider operational reality — The best database your team can't operate is worse than a good-enough database they can
•Polyglot is powerful but complex — Use multiple databases when genuinely needed; each addition multiplies complexity
•Default to relational — PostgreSQL covers 80%+ of use cases; justify alternatives with specific unmet requirements

Quick Reference: Model Selection Summary
If your priority is...	Consider...
Strong transactions, complex queries, structured data	Relational (PostgreSQL, MySQL)
Flexible schemas, rapid iteration, nested objects	Document (MongoDB, Couchbase)
Maximum speed, simple lookups, caching	Key-Value (Redis, DynamoDB)
Deep relationships, recommendations, fraud detection	Graph (Neo4j, Neptune)
Massive write throughput, time-series, IoT	Column-Family (Cassandra, HBase)

Module Complete:

Module Complete