What Is System Design? - Learning Module

Loading content...

0/273

The Art of Making Trade-offs

There Is No Perfect Design

Every engineer wants to build the perfect system—infinitely scalable, always available, blazing fast, completely consistent, effortlessly maintainable, and cheap to operate.

This system does not exist. It cannot exist.

System design is fundamentally the art of choosing which imperfections to accept. Every architectural decision involves trading one desirable property for another. Faster often means more expensive. More consistent often means less available. Simpler often means less flexible.

The skill of system design is not finding the magical solution that avoids all tradeoffs—it's understanding tradeoffs deeply enough to choose wisely for your specific context.

This page teaches you how to think about, evaluate, and communicate the tradeoffs that define every system you'll ever design.

What You Will Learn

By the end of this page, you will understand why tradeoffs are inescapable, know the major tradeoff categories in system design, learn frameworks for evaluating tradeoffs, and develop the communication skills to defend your choices. Tradeoff thinking is the lens through which all system design should be viewed.

Why Tradeoffs Are Inevitable

Tradeoffs in system design aren't a failure of imagination—they're a consequence of physics, economics, and mathematics. Understanding why tradeoffs exist helps you accept them and navigate them wisely.

Sources of Fundamental Tradeoffs

•Physics constraints — Light travels at finite speed. Electrons switch at finite rates. Power generates heat. Cross-continental network calls at 80ms latency are a physical reality, not a bug to fix.
•CAP Theorem — Proven mathematically: in the presence of network partitions, a distributed system cannot be both consistent and available. You must choose.
•Resource finitude — Memory, CPU, storage, bandwidth—all are finite and cost money. Every system operates under budget constraints.
•Complexity limits — Human brains have finite capacity. Systems too complex to understand become unmaintainable. Simplicity has value, but it costs capability.
•Time constraints — Building perfect systems takes infinite time. Real engineering operates under deadlines. 'Ship now' often means 'accept imperfection.'
•Uncertainty — You can't perfectly predict future load, future features, or future failures. Design for one scenario, and another scenario becomes harder.

The tradeoff mindset:

Once you accept that tradeoffs are inevitable, your approach to design changes. Instead of seeking the "best" solution, you seek the "most appropriate" solution. Instead of asking "What's the right answer?", you ask "What are we optimizing for, and what are we willing to sacrifice?"

This shift—from absolutist thinking to tradeoff thinking—is perhaps the most important mental transition in becoming a system designer.

Red Flag in Interviews

When a candidate presents a design without discussing tradeoffs, it's a red flag. Either they don't understand the tradeoffs (concerning), or they think there's a perfect solution (naive). Strong candidates proactively surface tradeoffs and explain their reasoning.

The Major Tradeoff Categories

While there are countless specific tradeoffs, they cluster into recognizable categories. Understanding these categories gives you a framework for analyzing any architectural decision.

Core System Design Tradeoffs
Tradeoff Pair	What You're Choosing	Example
Consistency vs Availability	Data accuracy vs system responsiveness under failures	Banking (consistency) vs social media likes (availability)
Latency vs Throughput	How fast vs how many	Real-time gaming (latency) vs batch processing (throughput)
Performance vs Cost	Speed/capacity vs budget	Premium SSDs vs cheaper HDDs
Simplicity vs Flexibility	Ease of understanding vs capability	Monolith (simple) vs microservices (flexible)
Speed vs Accuracy	Quick responses vs precise results	Approximate counting vs exact counting at scale
Read Performance vs Write Performance	Fast reads vs fast writes	Heavily indexed tables (read) vs minimal indexes (write)
Durability vs Performance	Data safety vs speed	Fsync on every write (durable) vs batched writes (fast)
Isolation vs Integration	Independence vs coordination	Separate databases per service vs shared database

Tradeoffs are not binary:

Note that these aren't simple either/or choices. Real systems exist on a spectrum. You might choose to be "mostly consistent with occasional staleness" or "high availability with eventual consistency after 5 seconds." The art is in finding the right point on the spectrum for your requirements.

Context Is Everything

The 'right' position on any tradeoff depends entirely on context. A banking system and a social media platform make opposite tradeoff choices—and both are correct for their domains. Always ask: 'What does our specific use case require?'

Deep Dive: Consistency vs Availability

This is the most famous tradeoff in distributed systems, codified in the CAP Theorem. Let's understand it deeply.

CAP Theorem states: In a distributed system, when a network partition occurs, you must choose between:

Consistency (C): Every read receives the most recent write or an error
Availability (A): Every request receives a response (no errors, no timeouts)

You cannot have both during a partition. Network partitions (P) are not optional—they happen. So you're really choosing between CP (consistency over availability) and AP (availability over consistency).

Choose Consistency (CP) When

•Financial transactions (double-spending is unacceptable)
•Inventory systems (overselling is costly)
•Booking systems (double-booking is a disaster)
•Medical records (incorrect data can harm patients)
•Coordination systems (leader election must be unambiguous)

Choose Availability (AP) When

•Social media feeds (slightly stale is acceptable)
•Analytics dashboards (approximate is fine)
•Content delivery (showing cached content beats errors)
•Session state (user can log in again worst-case)
•Real-time metrics (some data loss is acceptable)

Beyond CAP—PACELC:

The PACELC theorem extends CAP: If there is a Partition, choose Availability or Consistency. Else (normal operation), choose Latency or Consistency.

This captures that even without partitions, there's a tradeoff. Synchronous replication (consistent) adds latency. Asynchronous replication (fast) risks stale reads.

Example: Google Spanner chooses consistency at the cost of latency, using atomic clocks to enable globally consistent reads. Amazon DynamoDB defaults to eventual consistency for lower latency, with optional strong consistency at higher latency cost.

Nuance Matters

Real systems often mix strategies. You might have consistent writes with eventually consistent reads. You might be consistent within a region but eventually consistent across regions. 'Choose consistency or availability' is a simplification—production systems are more nuanced.

Deep Dive: Latency vs Throughput

Latency (how fast) and throughput (how many) are often in tension. Optimizing for one can hurt the other.

Latency: Time from request start to response received. Usually measured in milliseconds. Matters for user experience and real-time systems.

Throughput: Number of requests processed per unit time. Usually measured in requests per second (RPS) or transactions per second (TPS). Matters for capacity and cost efficiency.

Why They Trade Off

•Batching — Processing requests in batches increases throughput (amortized overhead) but increases latency (waiting for batch to fill).
•Parallelism — Adding more parallel workers increases throughput but can increase latency (coordination overhead, resource contention).
•Buffering — Larger buffers smooth throughput but increase latency (more time in buffer).
•Connection pooling — Reusing connections increases throughput but can increase latency (waiting for connection).
•Caching — Caching decreases average latency but requires background refresh that consumes throughput.

Practical example — Message Queues:

Consider a message queue like Kafka:

Low latency configuration: Small batch sizes, immediate flush, minimal buffering → Each message is processed quickly, but you can't handle as many messages per second.
High throughput configuration: Large batches, buffered writes, background compression → You process millions of messages per second, but each individual message waits longer.

Which to choose depends on use case:

User-facing notifications: Optimize for latency (users want to see notifications immediately)
Log aggregation: Optimize for throughput (slightly delayed logs are fine; volume is massive)

Tail Latency Matters

Don't just measure average latency—measure p99 and p99.9. At scale, even if 99% of requests are fast, the 1% that are slow affect enough users to matter. Throughput optimizations often hurt tail latency more than average latency.

Deep Dive: Performance vs Cost

This is often the most uncomfortable tradeoff for engineers—we want the best performance, but budgets are real constraints. Understanding this tradeoff is essential for practical system design.

Performance-Cost Tradeoff Examples
Decision Point	High Performance (Higher Cost)	Lower Performance (Lower Cost)
Storage type	NVMe SSDs, provisioned IOPS	Standard SSDs, HDDs
Compute	Dedicated instances, more cores	Shared instances, smaller sizes
Replication	Synchronous, multi-AZ/region	Asynchronous, single region
Caching	Large Redis clusters, more memory	Smaller caches, more cache misses
CDN	Premium tiers, more edge locations	Basic tiers, fewer locations
Database	Global tables, instant failover	Single master, manual failover

The economic reality:

Every system has a cost/performance curve. Initially, more spending brings proportional performance gains. But eventually, you hit diminishing returns—each additional dollar buys less improvement.

Finding the sweet spot:

Define "good enough" — What's the minimum acceptable performance for your use case?
Measure current state — Where are you on the curve?
Identify the knee — Where do returns start diminishing sharply?
Optimize for value — Aim for the knee of the curve, not the theoretical maximum.

Hidden costs to consider:

Engineering time to build and maintain complex high-performance systems
Operational burden of more moving pieces
Opportunity cost—money spent on performance is money not spent elsewhere

Engineering Time Is Cost Too

A system that saves $10K/month in infrastructure but requires a full-time engineer to maintain might not be worth it. Consider total cost of ownership, including engineering time, operational burden, and opportunity cost.

The Simplicity-Flexibility Spectrum

This tradeoff is often underappreciated but has profound long-term impact. Simple systems are easier to understand, maintain, and debug—but less capable of handling diverse requirements. Flexible systems can handle almost anything—but become difficult to understand and maintain.

Benefits of Simplicity

•Easier to understand for new team members
•Faster to debug when things go wrong
•Fewer moving parts means fewer failure modes
•Quicker to modify for common cases
•Lower operational burden
•Reduced cognitive load

Benefits of Flexibility

•Can handle diverse requirements
•Adapts to unforeseen use cases
•Enables reuse across projects
•Scales to complex scenarios
•Supports evolution without rewrites
•Power-user capabilities

The monolith vs microservices example:

Monoliths are simpler: one codebase, one deployment, no network calls between components. But they're less flexible: harder to scale components independently, all teams work in one repo, one technology stack.

Microservices are more flexible: independent scaling, polyglot development, small focused teams. But they're more complex: distributed transactions, network failures, operational overhead, complex debugging.

When to choose:

Lean toward simplicity when requirements are stable, team is small, speed of development matters most, and you're uncertain about future needs.
Lean toward flexibility when requirements are diverse, teams are large and need independence, you have proven scaling needs, and you can afford the operational investment.

The common mistake: Building for flexibility before you need it. Many systems become unnecessarily complex because architects anticipated needs that never materialized. Start simple; add flexibility when demonstrated needs arise.

YAGNI—You Ain't Gonna Need It

The flexibility you're adding 'just in case' often creates maintenance burden, introduces bugs, and never gets used. Build for today's requirements. Add flexibility when you have concrete evidence you'll need it.

A Framework for Evaluating Tradeoffs

Understanding tradeoffs conceptually is one thing; evaluating them for specific decisions is another. Here's a practical framework for making tradeoff decisions.

The RICE Framework for Tradeoffs

•Requirements — What are the hard constraints? Some requirements are non-negotiable (regulatory compliance, security minimums, budget caps). These eliminate options.
•Impact — What's the magnitude of each side of the tradeoff? Sacrificing 5% performance for 50% cost savings is different from sacrificing 50% performance for 5% cost savings.
•Context — What's your current situation? A startup with 100 users has different tradeoff calculus than a mature company with 100 million users.
•Evolution — How might requirements change? If you'll need flexibility in 6 months, building rigidly now might cost more than building flexibly now.

Practical application:

Let's say you're deciding between:

Option A: Simple single-region deployment (lower latency for nearby users, cheaper, simpler)
Option B: Multi-region deployment (resilient to regional failures, global low latency)

Apply the framework:

Requirements: Is 99.99% uptime required? If yes, single-region likely can't achieve it—disaster eliminates the option.
Impact: How many users are far from your single region? If 60% of users are in other continents, the latency impact is severe.
Context: What's your budget? Multi-region can double costs. What's your ops team size? Multi-region adds operational complexity.
Evolution: Are you expanding internationally? If yes, building multi-region now avoids migration later.

The "right" answer depends entirely on these factors. There's no universal correct choice.

Document Your Reasoning

When you make tradeoff decisions, document why. 'We chose eventual consistency because our use case tolerates stale data and strong consistency would add 100ms latency.' This helps future maintainers (including yourself) understand and re-evaluate when context changes.

Communicating Tradeoffs Effectively

Understanding tradeoffs is only half the battle. Communicating them to stakeholders—peers, managers, product teams—is equally important. Many good designs fail because they weren't communicated effectively.

Communication Best Practices

•Be explicit — Don't hide tradeoffs hoping no one notices. Surface them proactively: 'This approach gives us X but sacrifices Y.'
•Quantify when possible — 'This saves 50% compute costs but increases p99 latency from 100ms to 150ms' is better than 'This is cheaper but slower.'
•Frame for your audience — Engineers care about latency percentiles; executives care about user impact and cost. Translate the same tradeoff for different audiences.
•Present alternatives — 'We could also do B which gives us more availability but costs 30% more.' Showing options demonstrates you've thought deeply.
•Explain reversibility — 'This decision is easy to reverse if requirements change' vs 'This locks us in for 2 years.' Irreversible tradeoffs need more scrutiny.
•Acknowledge uncertainty — 'We estimate this will handle 1M users, but we have limited data beyond 500K. We should revisit at 750K.'

The tradeoff table:

A useful communication tool is a simple table comparing options:

Criterion	Option A	Option B	Option C
Latency	50ms	100ms	30ms
Cost	$5K/mo	$3K/mo	$12K/mo
Complexity	Low	Medium	High
Availability	99.9%	99.5%	99.99%
Recommendation	✓ Best for most cases	If budget-constrained	If SLA requires it

This format makes tradeoffs visible and facilitates productive discussion.

Tradeoffs Aren't Your Problem Alone

By clearly presenting tradeoffs, you share the decision with stakeholders. It's not 'I chose X'; it's 'Given requirements, I recommend X because of these tradeoffs. Does the team agree?' This creates shared ownership and better decisions.

Common Tradeoff Mistakes

Even experienced engineers make tradeoff mistakes. Knowing the common patterns helps you avoid them.

Tradeoff Anti-Patterns

•Optimizing the wrong dimension — Spending weeks optimizing latency when users care about reliability. Ask what actually matters before optimizing.
•Ignoring second-order effects — Choosing low-latency synchronous writes without considering what happens when the database is slow. Tradeoffs cascade.
•Premature optimization — Building complex solutions for scale challenges you haven't encountered. You trade today's simplicity for tomorrow's theoretical performance.
•Technology-driven decisions — Choosing microservices because they're trendy, not because you need independent scaling. Technology should serve requirements, not vice versa.
•Ignoring the human tradeoff — Choosing technically optimal solutions that no one on your team can maintain. Human capability is a hard constraint.
•False dichotomies — Assuming only two options exist when there are hybrid approaches. 'We must choose Redis OR PostgreSQL' vs 'We can use both for different access patterns.'
•Invisible tradeoffs — Making decisions without realizing you're trading off. Defaulting to 'what I know' without evaluating alternatives.
•Sunk cost bias — Refusing to reconsider tradeoffs because of past investment. 'We already built it this way' shouldn't prevent better decisions now.

Regular Tradeoff Review

Tradeoff decisions that were correct a year ago might be wrong today. Requirements change, traffic grows, costs shift. Periodically review major architectural tradeoffs: are they still appropriate for current conditions?

Summary: Mastering the Art of Tradeoffs

Tradeoff thinking is the core intellectual skill of system design. Let's consolidate what we've learned:

Key Takeaways

•Tradeoffs are inevitable — Physics, economics, and mathematics ensure there's no perfect design. Accept this and focus on choosing wisely.
•Major tradeoff categories — Consistency vs availability, latency vs throughput, performance vs cost, simplicity vs flexibility. Know these deeply.
•Context determines the right tradeoff — The same decision can be right for one system and wrong for another. Always evaluate in context.
•Evaluate systematically — Use frameworks like RICE (Requirements, Impact, Context, Evolution) to make consistent tradeoff decisions.
•Communicate explicitly — Surface tradeoffs, quantify them, present alternatives. Hidden tradeoffs cause problems later.
•Document your reasoning — Future you will thank you for explaining why you made the tradeoffs you did.
•Avoid common mistakes — Optimizing wrong dimensions, premature optimization, false dichotomies, sunk cost bias.
•Review periodically — Tradeoffs that were right yesterday might be wrong tomorrow. Revisit as conditions change.

What's next:

We've defined system design, distinguished it from coding, and explored the art of tradeoffs. Now we're ready for the final piece of Module 1's foundation: understanding system design as problem-solving at scale. The next page explores how scale transforms simple problems into complex systems challenges—and how to think about problems at the scale of millions of users.

Page Complete

You now understand why tradeoffs are inevitable, know the major categories, and have frameworks for evaluating and communicating them. This tradeoff-centric thinking will inform every design decision you make. Next, we'll explore what happens when problems scale to millions of users.