System Design (HLD)What Is a Distributed System?

What Is a Distributed System?

LevelIntermediate

Duration55 mins

TopicWhat Is a Distributed System?

1 / 4

Definition and Characteristics

The Distributed Revolution

Every time you send a message on WhatsApp, stream a video on Netflix, make a payment through Stripe, or search on Google, you're interacting with a distributed system. These systems are so woven into the fabric of modern digital life that we rarely notice them—until something goes wrong. A single tweet can trigger cascading failures across continents. A network partition can split a database, causing your bank to show two different balances. A clock drift of a few milliseconds can corrupt years of transaction history.

Distributed systems are simultaneously the most powerful and most treacherous constructs in software engineering. They enable feats impossible with single machines—serving billions of users, processing petabytes of data, achieving five-nines availability—but they also introduce failure modes that seem to defy logic.

What You Will Master

By the end of this page, you will possess a rigorous understanding of what distributed systems are, their formal definition from multiple perspectives, the core characteristics that define them, and the fundamental properties that distinguish them from monolithic systems. This foundation is essential for every concept that follows in system design.

Formal Definition of a Distributed System

Before we can reason about distributed systems, we need a precise definition. Computer scientists have offered several, each emphasizing different aspects:

Andrew Tanenbaum's Definition (1995):

"A distributed system is a collection of independent computers that appears to its users as a single coherent system."

This definition emphasizes the transparency goal—the idea that distributed complexity should be hidden from users, who perceive a unified service.

Leslie Lamport's Definition (1987):

"A distributed system is one in which the failure of a computer you didn't even know existed can render your own computer unusable."

This sardonic but profound definition emphasizes failure dependency—the interconnected nature of distributed components and their cascading failure modes.

Modern Working Definition:

A distributed system consists of multiple autonomous computational nodes that communicate through a network to coordinate their actions and share state while appearing to external observers as a unified, coherent system.

The Key Insight

The tension at the heart of distributed systems is this: We want the power of multiple machines while presenting the simplicity of one. This impossible goal—multiple becoming one—drives every trade-off, every algorithm, and every architectural decision in the field.

Breaking down the definition:

Let's examine each component of our working definition to understand its implications:

1. Multiple Autonomous Nodes

Each node (computer, process, container) operates independently
Nodes have their own local memory, clock, and execution context
No single node has complete knowledge of the entire system state
Nodes can fail independently without bringing down the entire system

2. Network Communication

Nodes communicate exclusively through message passing
Messages can be lost, delayed, duplicated, or reordered
There is no shared memory between nodes (unless explicitly built)
Network partitions can isolate groups of nodes from each other

3. Coordination and Shared State

Nodes must agree on shared state (consensus problem)
Coordination is required for tasks like leader election, locking, ordering
Maintaining consistency across nodes requires explicit protocols
The speed of light imposes fundamental limits on coordination speed

4. Unified External Appearance

Users should not need to know how many machines are involved
The system should handle node failures transparently when possible
Location transparency: users don't need to know where data lives
Replication transparency: users don't need to manage copies

Distributed Systems vs Centralized Systems
Aspect	Centralized System	Distributed System
Failure Domain	Single point: whole system fails together	Partial failures: components fail independently
State Location	One authoritative source of truth	State replicated or partitioned across nodes
Time Model	Single clock, total ordering trivial	Multiple clocks, ordering is a fundamental challenge
Concurrency	Thread-level, shared memory	Node-level, message passing
Scaling	Vertical: bigger machine	Horizontal: more machines
Debugging	Stack traces, memory inspection	Distributed traces, log aggregation, non-reproducible bugs
Consistency	Trivially consistent (ACID)	Consistency requires explicit design (CAP/PACELC)

The Eight Distinguishing Characteristics

What makes a distributed system categorically different from a program running on one machine? Eight fundamental characteristics define the distributed paradigm. Understanding these deeply is essential for sound system design.

The Eight Pillars of Distributed Systems

•1. Concurrency — Multiple components execute simultaneously, sharing resources and competing for coordination. Every request may be processed by different nodes in different orders.
•2. Lack of Global Clock — There is no single authoritative time source. Each node has its own clock, and these clocks drift. Time synchronization protocols (NTP) help but cannot guarantee perfect alignment.
•3. Independent Failures — Components can fail without affecting others. A failed node doesn't crash the system—but it creates partial failure states that are far harder to reason about than total failures.
•4. Message Passing — Nodes communicate by sending messages over a network. This is fundamentally different from shared memory: messages take time, can be lost, and provide no guarantee of delivery order.
•5. No Global State — No single entity knows the complete system state at any moment. Each node has only partial, potentially stale, local views. Global snapshots require careful algorithms.
•6. Unreliable Network — Networks can partition, introduce latency, lose packets, or reorder messages. The network is not a reliable transport—it's a hostile environment.
•7. Heterogeneity — Nodes may run different hardware, operating systems, languages, and versions. The system must function despite this diversity.
•8. Openness — The system should be extensible, allowing new components to join without redesigning existing ones. Standards and protocols enable this.

The Fundamental Challenge

These eight characteristics combine to create emergent complexity. A bug that would never manifest on a single machine—because timing, ordering, and failures would be deterministic—can occur in production with distributed systems. This is why distributed systems engineering is considered among the most challenging disciplines in software.

Deep Dive: Why These Matter

Let's explore why each characteristic creates unique challenges:

Concurrency is not new—multi-threaded programs have it. But distributed concurrency is different because:

You cannot use locks (they require shared memory)
You cannot atomically read-modify-write across nodes
Race conditions can span network round-trips (milliseconds, not microseconds)

Lack of Global Clock seems trivial until you realize:

You cannot determine which of two events happened first just from timestamps
"Wall clock time" is not monotonic (NTP can adjust clocks backward)
Ordering events requires logical clocks (Lamport, vector clocks)

Independent Failures mean:

You cannot distinguish a slow node from a dead node
Timeouts are a guess—too short causes false positives, too long causes delays
A "successfully sent" message may never be received

Message Passing implies:

Every remote call might fail, succeed, or succeed but you never know
Idempotency becomes essential for safe retries
The "exactly-once" delivery is a myth without application-level protocols

The Illusion of Simplicity

One of the most dangerous aspects of distributed systems is how simple they can appear. Modern frameworks, cloud platforms, and managed services abstract away the complexity—until they don't.

Consider a "simple" microservice:

Service A → calls → Service B → queries → Database

This looks like a straightforward three-step process. But let's enumerate what can go wrong:

Between A and B:

A's request could be lost by the network
B could receive the request but crash before responding
B could respond but the response is lost
B could respond but A times out before receiving it
A could retry, causing duplicate processing

Between B and Database:

Same failure modes as above
Database could be temporarily unavailable
Query could succeed but commit fail
Commit could succeed but acknowledgment lost

Timing Issues:

A might have stale data about B's capacity
B's response might be based on stale database state
Database replicas might not yet have the latest writes

What appears to be 3 steps is actually dozens of potential failure points, each requiring explicit handling.

What Developers Assume

•Network calls always complete
•Latency is negligible and constant
•Bandwidth is sufficient
•The network is secure
•Topology never changes
•There is one administrator
•Transport cost is zero
•The network is homogeneous

What Reality Delivers

•Network calls fail regularly (0.1-1% at scale)
•Latency varies from 1ms to 1000ms+
•Bandwidth is limited and contested
•Networks are constantly under attack
•Topology changes with every deployment
•Multiple teams own pieces
•Data transfer costs dominate at scale
•Hybrid cloud, edge, mobile—nothing is uniform

Peter Deutsch's Eight Fallacies

The left column represents the 'Eight Fallacies of Distributed Computing' identified by Peter Deutsch and others at Sun Microsystems. Every engineer who builds distributed systems eventually learns these lessons—ideally through study rather than production incidents at 3 AM.

Taxonomy of Distributed Systems

Not all distributed systems are alike. They can be categorized along several dimensions, each with different characteristics, challenges, and design patterns.

Classification by Coupling
Type	Description	Examples	Key Challenge
Tightly Coupled	Nodes share high-bandwidth, low-latency connections; often homogeneous	HPC clusters, supercomputers, shared-memory multiprocessors	Synchronization overhead, scalability limits
Loosely Coupled	Nodes connected via commodity networks; heterogeneous	Web services, microservices, cloud applications	Partial failures, eventual consistency
Peer-to-Peer	No central coordination; all nodes are equal	BitTorrent, blockchain, IPFS	Discovery, trust, free riders

Classification by Purpose:

Computing-Oriented Systems focus on processing power:

Distribute computation across nodes
Examples: MapReduce, Spark, scientific simulations
Challenge: Efficiently partitioning and combining work

Data-Oriented Systems focus on storing and retrieving data:

Distribute data across nodes for capacity and performance
Examples: Cassandra, DynamoDB, HDFS
Challenge: Consistency, replication, partitioning

Pervasive/Ubiquitous Systems embed into physical environments:

Mobile devices, sensors, IoT
Examples: Smart home systems, wearables, vehicle networks
Challenge: Battery constraints, intermittent connectivity, security

Classification by Architecture
Architecture	Description	Trade-offs
Client-Server	Clear distinction between service requesters and providers	Simple mental model, but server is a bottleneck and SPOF
Master-Slave	One node coordinates, others follow instructions	Simpler consistency, but master is a bottleneck and SPOF
Multi-Master	Multiple nodes can accept writes	Higher availability, but conflict resolution required
Hierarchical	Tree structure of nodes with cascading responsibilities	Reduces cross-level communication, but tree depth adds latency
Peer-to-Peer	All nodes are equal; no designated servers	Highly fault-tolerant, but coordination is complex

Real Systems Are Hybrid

Most production distributed systems combine multiple architectures. For example, a social network might use client-server for API access, master-slave for databases, peer-to-peer for CDN edge caching, and hierarchical for geographic distribution. Understanding pure archetypes helps reason about hybrid designs.

Properties and Goals of Distributed Systems

When designing distributed systems, we pursue several key properties. These goals often conflict, forcing trade-offs that define the character of the system.

The Five Cardinal Properties

•Availability — The system is operational and responsive. Users can read and write data. Measured as percentage of uptime (e.g., 99.99% = ~52 minutes downtime/year). Affected by failures, maintenance, and capacity.
•Consistency — All nodes see the same data at the same time. A read after a write returns the written value. Levels range from strong (linearizability) to eventual. Affects correctness and user experience.
•Partition Tolerance — The system continues operating despite network partitions that prevent some nodes from communicating. Essential for any geographically distributed system.
•Scalability — The system can handle increased load by adding resources. Horizontal scaling adds nodes; vertical scaling adds resources to existing nodes. Scaling should be roughly linear.
•Performance — The system responds quickly. Measured by latency (time to respond) and throughput (requests per second). Often trades off against consistency and durability.

Transparency Goals:

Distributed systems aim to hide their distributed nature through various forms of transparency:

Access Transparency: Local and remote resources accessed identically

Users don't need different code paths for local vs remote
Same API regardless of where data resides

Location Transparency: Resource location is hidden from users

Services can move between machines
DNS, service discovery enable this

Replication Transparency: Multiple copies behave as one

Users don't manage replicas
System handles consistency internally

Failure Transparency: Failures are masked when possible

Retries happen automatically
Failover to replicas is seamless

Migration Transparency: Resources can move without user awareness

Load balancing can shift requests
Data can be rebalanced across nodes

Concurrency Transparency: Multiple users share resources without interference

Isolation prevents conflicts
Users don't coordinate explicitly

Transparency Is a Leaky Abstraction

Complete transparency is impossible. Network latency, partial failures, and consistency anomalies will leak through any abstraction. The goal is not perfect transparency but 'good enough' transparency—hiding routine complexity while exposing what applications must handle explicitly.

When Systems Become Distributed

Not every application needs to be distributed. Understanding when and why systems transition from centralized to distributed architecture is crucial for making sound engineering decisions.

The Transition Triggers:

1. Scale Exceeds Single Machine Limits When a single server cannot handle the load—whether CPU, memory, network, or storage—you must distribute. Modern servers are powerful (96+ cores, 1TB+ RAM), but there are limits.

2. Reliability Requirements Demand Redundancy A single machine is a single point of failure. For high-availability requirements (99.99% and beyond), you need multiple machines in different failure domains.

3. Geographic Distribution Required Users around the world expect low latency. The speed of light constrains how fast you can serve users from a single data center. 100ms to Tokyo, 150ms to Europe—unacceptable for real-time applications.

4. Regulatory or Data Sovereignty Requirements Laws like GDPR require data to remain in specific regions. This forces geographic distribution of data and processing.

5. Organizational Scaling As teams grow, a single codebase/database becomes a coordination bottleneck. Distributed systems allow independent deployment and scaling.

Scale Thresholds Triggering Distribution
Dimension	Single-Machine Comfort Zone	Distribution Trigger
Requests/second	1,000 - 50,000	50K (varies by complexity)
Data Size	Up to ~10 TB (SSD)	Storage > single machine capacity
Concurrent Connections	10,000 - 100,000	100K (C10K problem and beyond)
Latency Requirements	Any latency acceptable	<50ms globally
Availability Requirement	99.9% (~8.7 hours/year downtime)	99.99%+ requires redundancy
Team Size	1-10 engineers	10 engineers (coordination overhead)

Start Centralized, Distribute When Necessary

Distributed systems are expensive—in complexity, engineering time, operational overhead, and actual dollars. Start with the simplest architecture that works. Move to distributed designs when you have concrete evidence that you need them, not based on anticipated future scale that may never materialize.

Historical Context: The Evolution of Distributed Computing

Understanding where distributed systems came from helps contextualize current challenges and solutions.

1960s-1970s: The Mainframe Era

Computing was centralized: one mainframe, many terminals
"Distributed" meant terminals connected to a central machine
ARPANET (1969) introduced true node-to-node networking

1980s: The Rise of Workstations

Personal workstations became powerful
Sun Microsystems' slogan: "The network is the computer"
RPC (Remote Procedure Call) abstracted network calls
NFS enabled transparent remote file access

1990s: The Web and Client-Server

World Wide Web created a global distributed system
Three-tier architecture became standard (browser → application server → database)
CORBA, DCOM attempted to standardize distributed objects
Java RMI brought distributed computing to mainstream developers

2000s: The Scale-Out Revolution

Google's papers on GFS (2003), MapReduce (2004), BigTable (2006) redefined distributed data
Amazon's Dynamo paper (2007) introduced eventually consistent systems
The CAP theorem (proved 2002) explained fundamental trade-offs
Cloud computing (AWS 2006) commoditized distributed infrastructure

2010s: Microservices and Cloud-Native

Netflix, Twitter, Uber pioneered microservices at scale
Kubernetes (2014) standardized container orchestration
Service mesh (Istio, Linkerd) addressed service-to-service communication
Event-driven architectures and streaming (Kafka) became mainstream

2020s: Edge Computing and Global Distribution

Edge computing pushes logic closer to users
Global databases (CockroachDB, Spanner) handle worldwide distribution
Serverless (Lambda, CloudFlare Workers) abstracts even more infrastructure
AI/ML creates new distributed computing challenges at unprecedented scale

The More Things Change...

Many 'modern' problems—consensus, ordering, replication—were first solved in the 1970s-1990s. Paxos (1989), logical clocks (1978), and state machine replication (1990) remain foundational. What changed is scale: algorithms designed for 5 nodes must now work with 5,000. Current research extends classic solutions to modern contexts.

Summary: The Distributed Foundation

We've established the foundational understanding of what distributed systems are. Let's consolidate the key insights:

Key Takeaways

•A distributed system is multiple autonomous nodes communicating over a network to appear as one coherent system — This definition captures both the implementation reality and the user-facing goal.
•Eight characteristics distinguish distributed systems — Concurrency, no global clock, independent failures, message passing, no global state, unreliable networks, heterogeneity, and openness create unique challenges.
•Simplicity is an illusion — Even simple-looking architectures hide dozens of failure modes. The Eight Fallacies remind us what assumptions to challenge.
•Distributed systems vary by coupling, purpose, and architecture — Understanding the taxonomy helps select appropriate patterns and anticipate challenges.
•Five cardinal properties guide design — Availability, consistency, partition tolerance, scalability, and performance define goals. They often conflict, forcing trade-offs.
•Transparency hides complexity but never completely — Various forms of transparency mask distributed nature, but the abstraction leaks at failure boundaries.
•Distribute only when necessary — The complexity cost is high. Start simple, distribute when scale, reliability, geography, or organization demand it.

What's Next:

Now that we understand what distributed systems are, the next page explores why we build them despite their complexity. We'll examine the compelling benefits that make distributed systems worth the engineering investment—scalability that reaches billions of users, fault tolerance that survives data center failures, and performance that spans the globe.

Foundation Established

You now possess a rigorous understanding of distributed systems fundamentals. You can define what they are, identify their characteristics, recognize their taxonomy, and articulate their goals. This vocabulary and mental model will underpin every concept in distributed systems design.

1 / 4

Loading learning content...

System Design (HLD)What Is a Distributed System?

What Is a Distributed System?

LevelIntermediate

Duration55 mins

TopicWhat Is a Distributed System?

1 / 4

Definition and Characteristics

The Distributed Revolution

What You Will Master

Formal Definition of a Distributed System

Before we can reason about distributed systems, we need a precise definition. Computer scientists have offered several, each emphasizing different aspects:

Andrew Tanenbaum's Definition (1995):

"A distributed system is a collection of independent computers that appears to its users as a single coherent system."

This definition emphasizes the transparency goal—the idea that distributed complexity should be hidden from users, who perceive a unified service.

Leslie Lamport's Definition (1987):

"A distributed system is one in which the failure of a computer you didn't even know existed can render your own computer unusable."

This sardonic but profound definition emphasizes failure dependency—the interconnected nature of distributed components and their cascading failure modes.

Modern Working Definition:

A distributed system consists of multiple autonomous computational nodes that communicate through a network to coordinate their actions and share state while appearing to external observers as a unified, coherent system.

The Key Insight

Breaking down the definition:

Let's examine each component of our working definition to understand its implications:

1. Multiple Autonomous Nodes

Each node (computer, process, container) operates independently
Nodes have their own local memory, clock, and execution context
No single node has complete knowledge of the entire system state
Nodes can fail independently without bringing down the entire system

2. Network Communication

Nodes communicate exclusively through message passing
Messages can be lost, delayed, duplicated, or reordered
There is no shared memory between nodes (unless explicitly built)
Network partitions can isolate groups of nodes from each other

3. Coordination and Shared State

Nodes must agree on shared state (consensus problem)
Coordination is required for tasks like leader election, locking, ordering
Maintaining consistency across nodes requires explicit protocols
The speed of light imposes fundamental limits on coordination speed

4. Unified External Appearance

Users should not need to know how many machines are involved
The system should handle node failures transparently when possible
Location transparency: users don't need to know where data lives
Replication transparency: users don't need to manage copies

Distributed Systems vs Centralized Systems
Aspect	Centralized System	Distributed System
Failure Domain	Single point: whole system fails together	Partial failures: components fail independently
State Location	One authoritative source of truth	State replicated or partitioned across nodes
Time Model	Single clock, total ordering trivial	Multiple clocks, ordering is a fundamental challenge
Concurrency	Thread-level, shared memory	Node-level, message passing
Scaling	Vertical: bigger machine	Horizontal: more machines
Debugging	Stack traces, memory inspection	Distributed traces, log aggregation, non-reproducible bugs
Consistency	Trivially consistent (ACID)	Consistency requires explicit design (CAP/PACELC)

The Eight Distinguishing Characteristics

The Eight Pillars of Distributed Systems

•1. Concurrency — Multiple components execute simultaneously, sharing resources and competing for coordination. Every request may be processed by different nodes in different orders.
•2. Lack of Global Clock — There is no single authoritative time source. Each node has its own clock, and these clocks drift. Time synchronization protocols (NTP) help but cannot guarantee perfect alignment.
•3. Independent Failures — Components can fail without affecting others. A failed node doesn't crash the system—but it creates partial failure states that are far harder to reason about than total failures.
•4. Message Passing — Nodes communicate by sending messages over a network. This is fundamentally different from shared memory: messages take time, can be lost, and provide no guarantee of delivery order.
•5. No Global State — No single entity knows the complete system state at any moment. Each node has only partial, potentially stale, local views. Global snapshots require careful algorithms.
•6. Unreliable Network — Networks can partition, introduce latency, lose packets, or reorder messages. The network is not a reliable transport—it's a hostile environment.
•7. Heterogeneity — Nodes may run different hardware, operating systems, languages, and versions. The system must function despite this diversity.
•8. Openness — The system should be extensible, allowing new components to join without redesigning existing ones. Standards and protocols enable this.

The Fundamental Challenge

Deep Dive: Why These Matter

Let's explore why each characteristic creates unique challenges:

Concurrency is not new—multi-threaded programs have it. But distributed concurrency is different because:

You cannot use locks (they require shared memory)
You cannot atomically read-modify-write across nodes
Race conditions can span network round-trips (milliseconds, not microseconds)

Lack of Global Clock seems trivial until you realize:

You cannot determine which of two events happened first just from timestamps
"Wall clock time" is not monotonic (NTP can adjust clocks backward)
Ordering events requires logical clocks (Lamport, vector clocks)

Independent Failures mean:

You cannot distinguish a slow node from a dead node
Timeouts are a guess—too short causes false positives, too long causes delays
A "successfully sent" message may never be received

Message Passing implies:

Every remote call might fail, succeed, or succeed but you never know
Idempotency becomes essential for safe retries
The "exactly-once" delivery is a myth without application-level protocols

The Illusion of Simplicity

One of the most dangerous aspects of distributed systems is how simple they can appear. Modern frameworks, cloud platforms, and managed services abstract away the complexity—until they don't.

Consider a "simple" microservice:

Service A → calls → Service B → queries → Database

This looks like a straightforward three-step process. But let's enumerate what can go wrong:

Between A and B:

A's request could be lost by the network
B could receive the request but crash before responding
B could respond but the response is lost
B could respond but A times out before receiving it
A could retry, causing duplicate processing

Between B and Database:

Same failure modes as above
Database could be temporarily unavailable
Query could succeed but commit fail
Commit could succeed but acknowledgment lost

Timing Issues:

A might have stale data about B's capacity
B's response might be based on stale database state
Database replicas might not yet have the latest writes

What appears to be 3 steps is actually dozens of potential failure points, each requiring explicit handling.

What Developers Assume

•Network calls always complete
•Latency is negligible and constant
•Bandwidth is sufficient
•The network is secure
•Topology never changes
•There is one administrator
•Transport cost is zero
•The network is homogeneous

What Reality Delivers

•Network calls fail regularly (0.1-1% at scale)
•Latency varies from 1ms to 1000ms+
•Bandwidth is limited and contested
•Networks are constantly under attack
•Topology changes with every deployment
•Multiple teams own pieces
•Data transfer costs dominate at scale
•Hybrid cloud, edge, mobile—nothing is uniform

Peter Deutsch's Eight Fallacies

Taxonomy of Distributed Systems

Not all distributed systems are alike. They can be categorized along several dimensions, each with different characteristics, challenges, and design patterns.

Classification by Coupling
Type	Description	Examples	Key Challenge
Tightly Coupled	Nodes share high-bandwidth, low-latency connections; often homogeneous	HPC clusters, supercomputers, shared-memory multiprocessors	Synchronization overhead, scalability limits
Loosely Coupled	Nodes connected via commodity networks; heterogeneous	Web services, microservices, cloud applications	Partial failures, eventual consistency
Peer-to-Peer	No central coordination; all nodes are equal	BitTorrent, blockchain, IPFS	Discovery, trust, free riders

Classification by Purpose:

Computing-Oriented Systems focus on processing power:

Distribute computation across nodes
Examples: MapReduce, Spark, scientific simulations
Challenge: Efficiently partitioning and combining work

Data-Oriented Systems focus on storing and retrieving data:

Distribute data across nodes for capacity and performance
Examples: Cassandra, DynamoDB, HDFS
Challenge: Consistency, replication, partitioning

Pervasive/Ubiquitous Systems embed into physical environments:

Mobile devices, sensors, IoT
Examples: Smart home systems, wearables, vehicle networks
Challenge: Battery constraints, intermittent connectivity, security

Classification by Architecture
Architecture	Description	Trade-offs
Client-Server	Clear distinction between service requesters and providers	Simple mental model, but server is a bottleneck and SPOF
Master-Slave	One node coordinates, others follow instructions	Simpler consistency, but master is a bottleneck and SPOF
Multi-Master	Multiple nodes can accept writes	Higher availability, but conflict resolution required
Hierarchical	Tree structure of nodes with cascading responsibilities	Reduces cross-level communication, but tree depth adds latency
Peer-to-Peer	All nodes are equal; no designated servers	Highly fault-tolerant, but coordination is complex

Real Systems Are Hybrid

Properties and Goals of Distributed Systems

When designing distributed systems, we pursue several key properties. These goals often conflict, forcing trade-offs that define the character of the system.

The Five Cardinal Properties

•Availability — The system is operational and responsive. Users can read and write data. Measured as percentage of uptime (e.g., 99.99% = ~52 minutes downtime/year). Affected by failures, maintenance, and capacity.
•Consistency — All nodes see the same data at the same time. A read after a write returns the written value. Levels range from strong (linearizability) to eventual. Affects correctness and user experience.
•Partition Tolerance — The system continues operating despite network partitions that prevent some nodes from communicating. Essential for any geographically distributed system.
•Scalability — The system can handle increased load by adding resources. Horizontal scaling adds nodes; vertical scaling adds resources to existing nodes. Scaling should be roughly linear.
•Performance — The system responds quickly. Measured by latency (time to respond) and throughput (requests per second). Often trades off against consistency and durability.

Transparency Goals:

Distributed systems aim to hide their distributed nature through various forms of transparency:

Access Transparency: Local and remote resources accessed identically

Users don't need different code paths for local vs remote
Same API regardless of where data resides

Location Transparency: Resource location is hidden from users

Services can move between machines
DNS, service discovery enable this

Replication Transparency: Multiple copies behave as one

Users don't manage replicas
System handles consistency internally

Failure Transparency: Failures are masked when possible

Retries happen automatically
Failover to replicas is seamless

Migration Transparency: Resources can move without user awareness

Load balancing can shift requests
Data can be rebalanced across nodes

Concurrency Transparency: Multiple users share resources without interference

Isolation prevents conflicts
Users don't coordinate explicitly

Transparency Is a Leaky Abstraction

When Systems Become Distributed

Not every application needs to be distributed. Understanding when and why systems transition from centralized to distributed architecture is crucial for making sound engineering decisions.

The Transition Triggers:

4. Regulatory or Data Sovereignty Requirements Laws like GDPR require data to remain in specific regions. This forces geographic distribution of data and processing.

5. Organizational Scaling As teams grow, a single codebase/database becomes a coordination bottleneck. Distributed systems allow independent deployment and scaling.

Scale Thresholds Triggering Distribution
Dimension	Single-Machine Comfort Zone	Distribution Trigger
Requests/second	1,000 - 50,000	50K (varies by complexity)
Data Size	Up to ~10 TB (SSD)	Storage > single machine capacity
Concurrent Connections	10,000 - 100,000	100K (C10K problem and beyond)
Latency Requirements	Any latency acceptable	<50ms globally
Availability Requirement	99.9% (~8.7 hours/year downtime)	99.99%+ requires redundancy
Team Size	1-10 engineers	10 engineers (coordination overhead)

Start Centralized, Distribute When Necessary

Historical Context: The Evolution of Distributed Computing

Understanding where distributed systems came from helps contextualize current challenges and solutions.

1960s-1970s: The Mainframe Era

Computing was centralized: one mainframe, many terminals
"Distributed" meant terminals connected to a central machine
ARPANET (1969) introduced true node-to-node networking

1980s: The Rise of Workstations

Personal workstations became powerful
Sun Microsystems' slogan: "The network is the computer"
RPC (Remote Procedure Call) abstracted network calls
NFS enabled transparent remote file access

1990s: The Web and Client-Server

World Wide Web created a global distributed system
Three-tier architecture became standard (browser → application server → database)
CORBA, DCOM attempted to standardize distributed objects
Java RMI brought distributed computing to mainstream developers

2000s: The Scale-Out Revolution

Google's papers on GFS (2003), MapReduce (2004), BigTable (2006) redefined distributed data
Amazon's Dynamo paper (2007) introduced eventually consistent systems
The CAP theorem (proved 2002) explained fundamental trade-offs
Cloud computing (AWS 2006) commoditized distributed infrastructure

2010s: Microservices and Cloud-Native

Netflix, Twitter, Uber pioneered microservices at scale
Kubernetes (2014) standardized container orchestration
Service mesh (Istio, Linkerd) addressed service-to-service communication
Event-driven architectures and streaming (Kafka) became mainstream

2020s: Edge Computing and Global Distribution

Edge computing pushes logic closer to users
Global databases (CockroachDB, Spanner) handle worldwide distribution
Serverless (Lambda, CloudFlare Workers) abstracts even more infrastructure
AI/ML creates new distributed computing challenges at unprecedented scale

The More Things Change...

Summary: The Distributed Foundation

We've established the foundational understanding of what distributed systems are. Let's consolidate the key insights:

Key Takeaways

•A distributed system is multiple autonomous nodes communicating over a network to appear as one coherent system — This definition captures both the implementation reality and the user-facing goal.
•Eight characteristics distinguish distributed systems — Concurrency, no global clock, independent failures, message passing, no global state, unreliable networks, heterogeneity, and openness create unique challenges.
•Simplicity is an illusion — Even simple-looking architectures hide dozens of failure modes. The Eight Fallacies remind us what assumptions to challenge.
•Distributed systems vary by coupling, purpose, and architecture — Understanding the taxonomy helps select appropriate patterns and anticipate challenges.
•Five cardinal properties guide design — Availability, consistency, partition tolerance, scalability, and performance define goals. They often conflict, forcing trade-offs.
•Transparency hides complexity but never completely — Various forms of transparency mask distributed nature, but the abstraction leaks at failure boundaries.
•Distribute only when necessary — The complexity cost is high. Start simple, distribute when scale, reliability, geography, or organization demand it.

What's Next:

Foundation Established

1 / 4