Loading content...
Every engineer wants to build the perfect system—infinitely scalable, always available, blazing fast, completely consistent, effortlessly maintainable, and cheap to operate.
This system does not exist. It cannot exist.
System design is fundamentally the art of choosing which imperfections to accept. Every architectural decision involves trading one desirable property for another. Faster often means more expensive. More consistent often means less available. Simpler often means less flexible.
The skill of system design is not finding the magical solution that avoids all tradeoffs—it's understanding tradeoffs deeply enough to choose wisely for your specific context.
This page teaches you how to think about, evaluate, and communicate the tradeoffs that define every system you'll ever design.
By the end of this page, you will understand why tradeoffs are inescapable, know the major tradeoff categories in system design, learn frameworks for evaluating tradeoffs, and develop the communication skills to defend your choices. Tradeoff thinking is the lens through which all system design should be viewed.
Tradeoffs in system design aren't a failure of imagination—they're a consequence of physics, economics, and mathematics. Understanding why tradeoffs exist helps you accept them and navigate them wisely.
The tradeoff mindset:
Once you accept that tradeoffs are inevitable, your approach to design changes. Instead of seeking the "best" solution, you seek the "most appropriate" solution. Instead of asking "What's the right answer?", you ask "What are we optimizing for, and what are we willing to sacrifice?"
This shift—from absolutist thinking to tradeoff thinking—is perhaps the most important mental transition in becoming a system designer.
When a candidate presents a design without discussing tradeoffs, it's a red flag. Either they don't understand the tradeoffs (concerning), or they think there's a perfect solution (naive). Strong candidates proactively surface tradeoffs and explain their reasoning.
While there are countless specific tradeoffs, they cluster into recognizable categories. Understanding these categories gives you a framework for analyzing any architectural decision.
| Tradeoff Pair | What You're Choosing | Example |
|---|---|---|
| Consistency vs Availability | Data accuracy vs system responsiveness under failures | Banking (consistency) vs social media likes (availability) |
| Latency vs Throughput | How fast vs how many | Real-time gaming (latency) vs batch processing (throughput) |
| Performance vs Cost | Speed/capacity vs budget | Premium SSDs vs cheaper HDDs |
| Simplicity vs Flexibility | Ease of understanding vs capability | Monolith (simple) vs microservices (flexible) |
| Speed vs Accuracy | Quick responses vs precise results | Approximate counting vs exact counting at scale |
| Read Performance vs Write Performance | Fast reads vs fast writes | Heavily indexed tables (read) vs minimal indexes (write) |
| Durability vs Performance | Data safety vs speed | Fsync on every write (durable) vs batched writes (fast) |
| Isolation vs Integration | Independence vs coordination | Separate databases per service vs shared database |
Tradeoffs are not binary:
Note that these aren't simple either/or choices. Real systems exist on a spectrum. You might choose to be "mostly consistent with occasional staleness" or "high availability with eventual consistency after 5 seconds." The art is in finding the right point on the spectrum for your requirements.
The 'right' position on any tradeoff depends entirely on context. A banking system and a social media platform make opposite tradeoff choices—and both are correct for their domains. Always ask: 'What does our specific use case require?'
This is the most famous tradeoff in distributed systems, codified in the CAP Theorem. Let's understand it deeply.
CAP Theorem states: In a distributed system, when a network partition occurs, you must choose between:
You cannot have both during a partition. Network partitions (P) are not optional—they happen. So you're really choosing between CP (consistency over availability) and AP (availability over consistency).
Beyond CAP—PACELC:
The PACELC theorem extends CAP: If there is a Partition, choose Availability or Consistency. Else (normal operation), choose Latency or Consistency.
This captures that even without partitions, there's a tradeoff. Synchronous replication (consistent) adds latency. Asynchronous replication (fast) risks stale reads.
Example: Google Spanner chooses consistency at the cost of latency, using atomic clocks to enable globally consistent reads. Amazon DynamoDB defaults to eventual consistency for lower latency, with optional strong consistency at higher latency cost.
Real systems often mix strategies. You might have consistent writes with eventually consistent reads. You might be consistent within a region but eventually consistent across regions. 'Choose consistency or availability' is a simplification—production systems are more nuanced.
Latency (how fast) and throughput (how many) are often in tension. Optimizing for one can hurt the other.
Latency: Time from request start to response received. Usually measured in milliseconds. Matters for user experience and real-time systems.
Throughput: Number of requests processed per unit time. Usually measured in requests per second (RPS) or transactions per second (TPS). Matters for capacity and cost efficiency.
Practical example — Message Queues:
Consider a message queue like Kafka:
Low latency configuration: Small batch sizes, immediate flush, minimal buffering → Each message is processed quickly, but you can't handle as many messages per second.
High throughput configuration: Large batches, buffered writes, background compression → You process millions of messages per second, but each individual message waits longer.
Which to choose depends on use case:
Don't just measure average latency—measure p99 and p99.9. At scale, even if 99% of requests are fast, the 1% that are slow affect enough users to matter. Throughput optimizations often hurt tail latency more than average latency.
This is often the most uncomfortable tradeoff for engineers—we want the best performance, but budgets are real constraints. Understanding this tradeoff is essential for practical system design.
| Decision Point | High Performance (Higher Cost) | Lower Performance (Lower Cost) |
|---|---|---|
| Storage type | NVMe SSDs, provisioned IOPS | Standard SSDs, HDDs |
| Compute | Dedicated instances, more cores | Shared instances, smaller sizes |
| Replication | Synchronous, multi-AZ/region | Asynchronous, single region |
| Caching | Large Redis clusters, more memory | Smaller caches, more cache misses |
| CDN | Premium tiers, more edge locations | Basic tiers, fewer locations |
| Database | Global tables, instant failover | Single master, manual failover |
The economic reality:
Every system has a cost/performance curve. Initially, more spending brings proportional performance gains. But eventually, you hit diminishing returns—each additional dollar buys less improvement.
Finding the sweet spot:
Hidden costs to consider:
A system that saves $10K/month in infrastructure but requires a full-time engineer to maintain might not be worth it. Consider total cost of ownership, including engineering time, operational burden, and opportunity cost.
This tradeoff is often underappreciated but has profound long-term impact. Simple systems are easier to understand, maintain, and debug—but less capable of handling diverse requirements. Flexible systems can handle almost anything—but become difficult to understand and maintain.
The monolith vs microservices example:
Monoliths are simpler: one codebase, one deployment, no network calls between components. But they're less flexible: harder to scale components independently, all teams work in one repo, one technology stack.
Microservices are more flexible: independent scaling, polyglot development, small focused teams. But they're more complex: distributed transactions, network failures, operational overhead, complex debugging.
When to choose:
Lean toward simplicity when requirements are stable, team is small, speed of development matters most, and you're uncertain about future needs.
Lean toward flexibility when requirements are diverse, teams are large and need independence, you have proven scaling needs, and you can afford the operational investment.
The common mistake: Building for flexibility before you need it. Many systems become unnecessarily complex because architects anticipated needs that never materialized. Start simple; add flexibility when demonstrated needs arise.
The flexibility you're adding 'just in case' often creates maintenance burden, introduces bugs, and never gets used. Build for today's requirements. Add flexibility when you have concrete evidence you'll need it.
Understanding tradeoffs conceptually is one thing; evaluating them for specific decisions is another. Here's a practical framework for making tradeoff decisions.
Practical application:
Let's say you're deciding between:
Apply the framework:
The "right" answer depends entirely on these factors. There's no universal correct choice.
When you make tradeoff decisions, document why. 'We chose eventual consistency because our use case tolerates stale data and strong consistency would add 100ms latency.' This helps future maintainers (including yourself) understand and re-evaluate when context changes.
Understanding tradeoffs is only half the battle. Communicating them to stakeholders—peers, managers, product teams—is equally important. Many good designs fail because they weren't communicated effectively.
The tradeoff table:
A useful communication tool is a simple table comparing options:
| Criterion | Option A | Option B | Option C |
|---|---|---|---|
| Latency | 50ms | 100ms | 30ms |
| Cost | $5K/mo | $3K/mo | $12K/mo |
| Complexity | Low | Medium | High |
| Availability | 99.9% | 99.5% | 99.99% |
| Recommendation | ✓ Best for most cases | If budget-constrained | If SLA requires it |
This format makes tradeoffs visible and facilitates productive discussion.
By clearly presenting tradeoffs, you share the decision with stakeholders. It's not 'I chose X'; it's 'Given requirements, I recommend X because of these tradeoffs. Does the team agree?' This creates shared ownership and better decisions.
Even experienced engineers make tradeoff mistakes. Knowing the common patterns helps you avoid them.
Tradeoff decisions that were correct a year ago might be wrong today. Requirements change, traffic grows, costs shift. Periodically review major architectural tradeoffs: are they still appropriate for current conditions?
Tradeoff thinking is the core intellectual skill of system design. Let's consolidate what we've learned:
What's next:
We've defined system design, distinguished it from coding, and explored the art of tradeoffs. Now we're ready for the final piece of Module 1's foundation: understanding system design as problem-solving at scale. The next page explores how scale transforms simple problems into complex systems challenges—and how to think about problems at the scale of millions of users.
You now understand why tradeoffs are inevitable, know the major categories, and have frameworks for evaluating and communicating them. This tradeoff-centric thinking will inform every design decision you make. Next, we'll explore what happens when problems scale to millions of users.