Loading content...
Problem-solving ability provides the framework for thinking through system design challenges. But frameworks need content—you can't design what you don't understand. System design knowledge is the raw material from which designs are built.
When interviewers evaluate your system design knowledge, they're not checking whether you've memorized CAP theorem or can recite the properties of consistent hashing. They're assessing whether you possess working knowledge—the kind of knowledge that activates when you encounter a real design decision, that connects to other concepts, and that you can apply in novel contexts.
This distinction matters profoundly. Candidates with surface-level knowledge can define terms but falter when asked to apply concepts. Candidates with working knowledge seamlessly integrate their understanding into design decisions, explaining not just what a concept is but why it matters for the problem at hand.
By the end of this page, you will understand the major knowledge domains that interviewers expect, how depth of understanding is evaluated, the difference between memorization and working knowledge, and how to demonstrate mastery authentically. You'll know precisely what technical breadth and depth to develop for system design interviews.
System design knowledge spans an enormous territory. No one masters everything—but interviewers expect fluency across a set of foundational domains. These domains form the building blocks from which any system can be constructed.
The core knowledge domains are:
Distributed Systems Fundamentals — The theoretical underpinnings: CAP theorem, consistency models, consensus protocols, network partitions, distributed clocks.
Storage Systems — Relational databases, NoSQL datastores, caching layers, object storage, database internals (indexing, transactions, replication, sharding).
Compute and Processing — Stateless services, workers, batch processing, stream processing, serverless architectures, container orchestration.
Networking and Communication — Load balancing, API gateways, DNS, CDNs, RPC vs messaging, synchronous vs asynchronous communication.
Reliability Engineering — Redundancy, failover, health checking, circuit breakers, rate limiting, graceful degradation, chaos engineering principles.
Scalability Patterns — Horizontal vs vertical scaling, caching strategies, partitioning, replication, database scaling patterns, eventual consistency implications.
Security Fundamentals — Authentication, authorization, encryption at rest and in transit, secrets management, threat modeling basics.
While you need breadth across these domains, you don't need expert-level depth in all of them. Interviewers expect you to have solid working knowledge across the board, with deeper expertise in a few areas. A Staff Engineer might have exceptional depth in distributed storage but only working knowledge of ML systems—and that's appropriate.
| Domain | Core Concepts | Why It Matters in Interviews |
|---|---|---|
| Distributed Systems | CAP theorem, Consistency models, Consensus (Paxos/Raft), Clock synchronization | Every non-trivial system is distributed; understanding trade-offs here is foundational |
| Storage Systems | ACID, Indexing, Replication, Sharding, LSM trees vs B-trees | Data is central to most systems; storage decisions cascade through the design |
| Compute/Processing | Stateless design, Worker queues, Stream processing, Container orchestration | Computing at scale requires specific patterns; naive approaches don't scale |
| Networking | Load balancing algorithms, DNS resolution, CDN edge caching, Protocol trade-offs | Users connect through networks; networking decisions affect latency and availability |
| Reliability | Redundancy patterns, Circuit breakers, Health checks, Graceful degradation | Systems must handle failures; reliability knowledge prevents naive designs |
| Scalability | Horizontal scaling, Caching tiers, Partitioning strategies, Read replicas | Scale is the defining challenge; scalability patterns are essential vocabulary |
| Security | AuthN/AuthZ, Encryption, Token validation, Threat models | Security constraints shape architecture; basic security knowledge is expected |
The distinction between working knowledge and memorization is perhaps the most important concept on this page. Interviewers can instantly tell the difference, and it dramatically affects their evaluation.
Memorization looks like:
Working knowledge looks like:
Working knowledge comes from applying concepts, not just reading about them. For each concept you study, ask: 'In what situations would I use this? What would go wrong if I didn't? What are the alternatives, and when would I prefer them?' Better yet, review real-world system architectures (company engineering blogs, conference talks) and identify how these concepts manifest in practice.
Every interesting system design problem involves distribution. Users are distributed globally. Data must be replicated for reliability. Processing must be parallelized for scale. Distributed systems knowledge is non-negotiable.
The core concepts interviewers expect you to understand deeply include:
Candidates often stumble by: (1) Treating the network as reliable when it isn't; (2) Assuming clocks are synchronized when they diverge; (3) Ignoring partial failure scenarios where some components work and others don't; (4) Underestimating the cost of coordination across nodes. These assumptions lead to designs that fail under real-world conditions.
Demonstrating distributed systems mastery:
In an interview, you demonstrate distributed systems knowledge by:
Data is central to every system, and storage decisions profoundly impact every other aspect of the design. Interviewers pay particular attention to your storage knowledge because poor storage choices cascade into performance problems, scaling limitations, and operational headaches.
The storage knowledge you need:
| Storage Type | Characteristics | Ideal Use Cases |
|---|---|---|
| Relational (PostgreSQL, MySQL) | ACID transactions, structured schema, complex queries, mature tooling | Transactional systems, complex relationships, reporting needs |
| Document (MongoDB, DynamoDB) | Flexible schema, nested documents, horizontal scaling | Evolving data models, read-heavy workloads, hierarchical data |
| Key-Value (Redis, Memcached) | Extreme speed, simple operations, in-memory option | Caching, session storage, real-time counters |
| Wide-Column (Cassandra, HBase) | Massive write throughput, time-series optimization | Time-series data, IoT events, high-write workloads |
| Graph (Neo4j, Neptune) | Relationship-first model, traversal queries | Social networks, recommendation engines, knowledge graphs |
| Object Storage (S3, GCS) | Infinite scale, blob storage, cheap at rest | Media files, backups, data lake storage |
| Search Engines (Elasticsearch) | Full-text search, aggregations, log analysis | Search features, logging, analytics |
Beyond choosing databases—understanding internals:
Interviewers value candidates who understand why storage systems behave as they do:
B-trees vs LSM trees: B-trees (used in PostgreSQL, traditional RDBMS) optimize for read-heavy workloads with random writes. LSM trees (used in Cassandra, RocksDB) optimize for write-heavy workloads by sequentializing writes. Knowing this helps you choose databases appropriately.
Indexing fundamentals: Indexes speed up reads but slow down writes. Composite indexes have ordering implications. Full-text indexes use inverted structures. Understanding indexes helps you make schema decisions.
Transaction isolation levels: From READ UNCOMMITTED through SERIALIZABLE, each level offers different trade-offs between consistency and performance. Know what anomalies each level allows and when to use stricter vs. relaxed isolation.
Replication and consistency: Synchronous replication guarantees consistency but adds latency. Asynchronous replication is faster but risks data loss. Read replicas introduce replication lag that affects query semantics.
Don't just say 'we'll use PostgreSQL.' Explain why: 'We need ACID transactions for the payment flow, complex queries for the admin dashboard, and the data model is highly relational. PostgreSQL fits these requirements. For the activity feed, which is write-heavy and can tolerate eventual consistency, I'd consider Cassandra or a similar wide-column store.'
Scalability is often the central challenge in system design interviews. The question isn't 'can you build a system that works?' but 'can you build a system that works at scale?'
Core scalability concepts:
Performance characteristics you should know:
Latency hierarchy: L1 cache (~1ns) → L2 cache (~4ns) → RAM (~100ns) → SSD (~100μs) → HDD (~10ms) → Network round-trip (~1-100ms depending on distance). This hierarchy explains why caching matters and why network calls are expensive.
Throughput vs Latency: Systems can be optimized for high throughput (many requests per second) or low latency (fast individual responses). Sometimes these conflict—batching improves throughput but may add latency.
Amdahl's Law implications: The speedup from parallelization is limited by the serial portion of the workload. If 10% of your work is inherently sequential, you can never achieve more than 10x speedup regardless of parallelization.
Little's Law: For a stable system, L = λW (L = items in system, λ = arrival rate, W = time in system). This helps reason about queue sizes, service times, and capacity planning.
Common interview mistakes: (1) Adding more servers without explaining how they coordinate; (2) Ignoring database bottlenecks while only scaling compute; (3) Using synchronous processing for high-volume ingest; (4) Assuming caching solves all problems without addressing invalidation; (5) Not considering the cost of horizontal distribution (network latency, data movement).
Scalability matters, but systems that scale but crash aren't useful. Reliability and availability are equally critical dimensions that interviewers evaluate.
Key reliability concepts:
| Availability | Annual Downtime | Monthly Downtime | Practical Meaning |
|---|---|---|---|
| 99.9% (three nines) | 8.76 hours | 43.8 minutes | Acceptable for non-critical internal tools |
| 99.95% | 4.38 hours | 21.9 minutes | Standard for most consumer applications |
| 99.99% (four nines) | 52.6 minutes | 4.38 minutes | High-availability requirements, financial services |
| 99.999% (five nines) | 5.26 minutes | 26.3 seconds | Critical infrastructure, telecom, life-safety systems |
Patterns for achieving reliability:
Redundancy at every layer: Single points of failure (SPOFs) are the enemy of availability. Every critical component should have redundant instances—load balancers, application servers, database replicas, even multiple data centers.
Health checking and automatic failover: Systems must detect when components fail and route around them. Load balancers perform health checks; database clusters promote replicas when leaders fail; circuit breakers stop calling unhealthy services.
Graceful degradation: When parts of a system fail or become overloaded, the system should degrade gracefully rather than fail completely. Disable non-essential features to preserve core functionality.
Blast radius containment: Failures should be isolated to prevent cascade. Use bulkheads to separate critical paths, implement timeouts to prevent hanging, and shed load before resources exhaust completely.
Chaos engineering mindset: Assume failures will happen and design for them. Proactively inject failures in non-production environments to discover weaknesses before they cause incidents.
In interviews, proactively address failure scenarios: 'What happens if the database goes down? We have a read replica that can be promoted. What if the entire region fails? We replicate to a secondary region with a recovery time objective of 15 minutes.' Interviewers notice when candidates think about reliability without prompting—it signals production experience.
Systems are composed of many components—but those components must communicate. Networking and communication patterns form another essential knowledge domain.
Communication paradigms:
Networking components you should understand:
Load balancers: Distribute traffic across instances. L4 (TCP) vs L7 (HTTP) balancing. Algorithms: round-robin, least connections, consistent hashing, weighted.
API gateways: Entry point for external traffic. Handle authentication, rate limiting, routing, protocol translation. Important for microservices architectures.
CDNs: Cache static (and increasingly dynamic) content at edge locations. Reduce latency for global users, offload origin traffic. Understand TTLs, cache invalidation, edge compute.
Service discovery: How services find each other in dynamic environments. DNS-based, registry-based (Consul, Eureka), Kubernetes built-in. Essential for container orchestration.
DNS: The internet's address book. Understand TTL implications, DNS-based failover, latency-based routing, GeoDNS for multi-region.
Use synchronous communication when: the user needs an immediate response; operations must be transactional; the dependency is highly reliable. Use asynchronous communication when: the operation can be completed later; you need to decouple services; you're dealing with high-volume events; you need guaranteed delivery even if consumers are temporarily down.
Interviewers use several techniques to evaluate whether your knowledge is superficial or deep:
Probing questions: After you make a design choice, expect follow-up questions that test the depth of your understanding:
Increasing precision requests: The interviewer may ask you to be more specific:
Red flags include: using buzzwords without explanation; inability to answer 'why' questions; defensiveness when probed; mixing up related concepts (e.g., confusing replication with sharding); overconfidence about things you don't actually understand. It's far better to say 'I don't know the specifics here' than to bluff and get caught.
We've explored the second dimension of what interviewers evaluate: system design knowledge. Let's consolidate the key insights:
What's next:
Problem-solving ability determines how you approach challenges; system design knowledge determines what tools you have available. The third dimension—Trade-off Analysis—is where these combine. Next, we'll explore how interviewers evaluate your ability to navigate competing concerns, make defensible choices, and articulate why one design is better than another for a given context.
You now understand the knowledge landscape that interviewers expect, the difference between memorization and working knowledge, and how depth is assessed. Building this knowledge base is a long-term investment—continue deepening your understanding through study, practice, and real-world experience.