Client Server Model - Learning Module

Loading content...

0/228

Scalability

Building Systems That Grow

A system that works for 100 users may crumble under 10,000. One that handles 10,000 might collapse at a million. Scalability is the ability of a system to handle increased load by adding resources—and it's perhaps the most critical architectural concern for any successful service.

The client-server model inherently creates scaling challenges: one server, many clients. As clients multiply, the server becomes a bottleneck. Solving this requires understanding the dimensions of scale, the techniques available, and the tradeoffs each involves. Great engineers don't just build systems that work—they build systems that continue working as demand grows.

What You Will Learn

By the end of this page, you will understand scalability fundamentals: measuring and defining scalability, vertical vs. horizontal scaling, load balancing techniques, caching strategies, database scaling patterns, and the principles that guide the design of systems that can grow from hundreds to millions of users.

Understanding Scalability

Scalability is the capability of a system to handle a growing amount of work, or its potential to accommodate growth. A scalable system can increase its capacity by adding resources (hardware, servers, instances) in a way that maintains or improves its performance characteristics.

Dimensions of Scalability:

Dimension	Description	Example
Load Scalability	Handle more concurrent requests	From 100 to 100,000 requests/second
Data Scalability	Handle more data storage/processing	From 1GB to 1PB of data
Geographic Scalability	Serve more geographic regions	From single region to global deployment
Administrative Scalability	More teams/services with independence	From one team to hundreds of services

Scalability vs. Performance:

These related concepts are often confused:

Performance: How fast a system responds to individual requests (latency) or how many it handles per second (throughput)
Scalability: How performance changes as load increases, and ability to improve by adding resources

A system can be performant but not scalable (fast for low load but degrades severely under high load), or scalable but not performant (handles load gracefully but each request is slow).

Scalability Indicators
Indicator	Scalable System	Non-Scalable System
Throughput vs. load	Linear or near-linear increase	Plateaus or decreases at higher load
Latency vs. load	Relatively constant until saturation	Increases rapidly with load
Resource addition	Adding resources increases capacity proportionally	Diminishing returns quickly
Bottleneck behavior	Multiple components can be the limiter	Single component always limits

Amdahl's Law

Scalability is limited by the sequential (non-parallelizable) parts of your system. If 10% of processing is inherently sequential, you cannot scale beyond 10x no matter how many resources you add. Identify and minimize sequential bottlenecks—they set the ultimate limit on your scalability.

Vertical Scaling (Scale Up)

Vertical scaling means adding more power to an existing machine: more CPU, more RAM, faster storage. It's the conceptually simplest approach—make the server bigger.

Converting Mermaid diagram...

Advantages

•Simplicity — No code changes; no distributed system complexity; single point of management
•No Consistency Issues — All data on one machine; no synchronization between nodes
•Lower Latency — No network hops between components
•Simpler Operations — One server to monitor, backup, update
•Works for Any Application — Even poorly designed apps scale vertically (to a point)

Limitations

•Physical Limits — Hardware has maximum specs; can't add more than the biggest available
•Cost Escalation — High-end hardware costs disproportionately more (2x CPU might cost 4x)
•Single Point of Failure — One machine means one failure crashes everything
•Downtime for Upgrade — Often requires shutdown to add resources
•Vendor Lock-in — May depend on specific hardware/cloud instance types

When Vertical Scaling Works Well:

Early-stage systems — Simpler to start with a bigger server than design for distribution
Database servers — Many databases scale vertically better than horizontally (ACID is hard to distribute)
Low-latency requirements — Single server avoids network latency between components
Limited concurrent users — If peak load fits on one machine, why add complexity?

Practical Limits:

As of 2024, the largest cloud instances offer:

CPUs: 400+ cores
Memory: 24+ TB RAM
Storage: Millions of IOPS

This seems enormous, but traffic for popular services can easily exceed what any single machine can handle. Twitter processes ~500 million tweets/day. Netflix serves ~15% of global internet bandwidth. No single server can handle these loads.

Vertical Scaling in Practice:

Load increases 5x:
  → Upgrade from 8 CPU to 32 CPU: $500/mo → $2,000/mo
  → Still one server; works immediately
  → If load increases 10x more, hit ceiling

Compare to horizontal scaling:
  → Add 5 servers at original spec: 5 × $500/mo = $2,500/mo
  → More complex setup, but no ceiling
  → Built-in redundancy

Horizontal Scaling (Scale Out)

Horizontal scaling means adding more machines to distribute the load. Instead of one big server, you have many smaller servers working together.

Converting Mermaid diagram...

Advantages

•Theoretically Unlimited — Add more servers as needed; no hardware ceiling
•Fault Tolerance — If one server fails, others continue; no single point of failure
•Cost Efficiency at Scale — Commodity hardware is cheaper than premium; costs scale linearly
•Incremental Growth — Add capacity precisely when needed
•Geographic Distribution — Servers can be placed near users globally

Challenges

•Distributed System Complexity — Coordination, consistency, network partitions
•State Management — Where does data live? How is it synchronized?
•Load Balancing — How are requests distributed? How to handle failures?
•Operational Overhead — More servers to deploy, monitor, update, debug
•Application Design — Code must be designed for distribution (stateless, etc.)

Requirements for Horizontal Scaling:

To scale horizontally, your application must meet certain requirements:

Stateless or Externalized State — Servers can't rely on local state if any server might handle any request. Either design stateless, or store state in shared systems (database, cache).
Shared-Nothing Architecture — Each server operates independently; doesn't assume what's in other servers' memory or storage.
Idempotent Operations — Requests may be retried on different servers; operations must handle this safely.
External Session Storage — If sessions are needed, store in Redis/database accessible to all servers.
Centralized or Distributed Data Store — Servers need somewhere to read/write persistent data.

horizontal-scaling-example.pseudo

Stateless Design

Vertical vs. Horizontal Scaling Comparison
Aspect	Vertical Scaling	Horizontal Scaling
Approach	Bigger machine	More machines
Limit	Hardware maximum	Theoretically unlimited
Failure tolerance	Single point of failure	Redundancy built-in
Complexity	Low (single server)	High (distributed system)
Cost curve	Exponential (premium pricing)	Linear (commodity)
Downtime for scaling	Often required	Usually zero-downtime
Application changes	None required	Must be designed for it

Load Balancing

Load balancing is the distribution of incoming traffic across multiple servers. It's the essential mechanism that enables horizontal scaling—without it, clients would still go to a single server.

Converting Mermaid diagram...

Load Balancing Algorithms:

Common Load Balancing Algorithms
Algorithm	How It Works	Best For	Drawbacks
Round Robin	Rotate through servers in order	Equal-capacity servers, similar request costs	Ignores server load and request complexity
Weighted Round Robin	Round robin with weights for server capacity	Servers with different capacities	Weights must be manually configured
Least Connections	Send to server with fewest active connections	Varying request duration	Doesn't account for connection cost variation
Weighted Least Connections	Least connections adjusted by weight	Different capacities + varying request duration	More complex configuration
IP Hash	Hash client IP to determine server	Session persistence without sticky sessions	Uneven distribution if IP distribution skewed
Least Response Time	Send to server with fastest recent responses	When response time is the key metric	Requires continuous monitoring
Random	Choose server randomly	Simplicity; large number of servers	Can be uneven with small server count

Health Checks:

Load balancers must detect when servers are unhealthy and stop sending traffic to them:

// Health check configuration example
healthCheck:
  path: /health              # Endpoint to check
  interval: 10s              # How often to check
  timeout: 5s                # How long to wait for response
  healthyThreshold: 2        # Consecutive successes to mark healthy
  unhealthyThreshold: 3      # Consecutive failures to mark unhealthy

Types of Health Checks:

Type	What It Checks	Reliability
TCP	Can connect to port	Low (process may be up but broken)
HTTP	Returns 2xx on health endpoint	Medium (endpoint may be healthy but app broken)
Deep/Synthetic	Performs actual operation (DB query, etc.)	High (verifies full stack)

Load Balancer Types:

Layer 4 (Transport) — Balances based on TCP/UDP connection; doesn't inspect content
- Very fast; handles any protocol
- Can't make decisions based on URL, headers, etc.
Layer 7 (Application) — Inspects HTTP (or other protocol) content
- Can route based on URL path, headers, cookies
- Can terminate SSL, inject headers, modify requests
- Higher overhead; protocol-specific

DNS-Based Load Balancing

The simplest load balancing: DNS returns multiple IPs for a domain, and clients choose one. This is global-scale load balancing used by major services. However, it's slow to update (DNS caching), and clients may not behave correctly with multiple IPs. Usually combined with other load balancing at each data center.

Caching Strategies

Caching stores copies of frequently accessed data closer to where it's needed, reducing load on origin servers and improving response times. It's one of the most powerful scalability techniques.

Converting Mermaid diagram...

Caching Layers:

Layer	What's Cached	Cache Location	TTL Range
Browser Cache	Static assets, API responses	Client device	Seconds to months
CDN Cache	Static content, sometimes dynamic	Edge servers globally	Minutes to days
Reverse Proxy Cache	HTTP responses	Before app servers	Seconds to hours
Application Cache	Computed results, session data	In-memory (Redis, Memcached)	Seconds to hours
Database Cache	Query results, indexes	Database server memory	Automatic/managed
Object Cache	Serialized objects	Distributed cache	Minutes to days

Caching Patterns:

Cache-Aside (Lazy Loading):

async function getUser(userId) {
    // 1. Check cache first
    let user = await cache.get(`user:${userId}`);
    if (user) return user;  // Cache hit
    
    // 2. Cache miss: fetch from database
    user = await db.query('SELECT * FROM users WHERE id = ?', [userId]);
    
    // 3. Populate cache for next time
    await cache.set(`user:${userId}`, user, {ttl: 3600});
    
    return user;
}

Application manages cache explicitly
Only requested data is cached (no wasted cache space)
First request always hits database (cold cache problem)

Write-Through:

async function updateUser(userId, data) {
    // 1. Write to database
    await db.query('UPDATE users SET ? WHERE id = ?', [data, userId]);
    
    // 2. Immediately update cache
    await cache.set(`user:${userId}`, {...existingData, ...data});
}

Cache always consistent with database
Write latency includes cache update
Every write updates cache (even if rarely read)

Write-Behind (Write-Back):

async function updateUser(userId, data) {
    // 1. Write to cache only (fast return)
    await cache.set(`user:${userId}`, data);
    
    // 2. Asynchronously persist to database
    writeQueue.enqueue({table: 'users', id: userId, data});
}

Very fast writes (cache only)
Risk of data loss if cache fails before persist
Complexity of managing write queue

Cache Invalidation Strategies
Strategy	How It Works	Consistency	Complexity
TTL (Time-To-Live)	Cache entries expire after fixed time	Eventually consistent	Simplest
Invalidate on Write	Delete/update cache when data changes	Strongly consistent	Moderate
Event-Based Invalidation	Publish events when data changes; consumers invalidate	Eventually consistent	Complex
Version Tags (ETag)	Cache includes version; check if current	Can be strongly consistent	Moderate

Cache Invalidation Is Hard

"There are only two hard things in Computer Science: cache invalidation and naming things." — Phil Karlton

Serving stale data is often worse than serving no data. Design your invalidation strategy carefully. When in doubt, use shorter TTLs and accept cache misses over serving stale data for critical information.

Database Scaling

Databases are often the hardest component to scale because they maintain authoritative state and must preserve consistency. While application servers can be stateless and easily replicated, databases require more sophisticated approaches.

Read Replicas:

Most applications have far more reads than writes. Read replicas scale read capacity while maintaining a single write source.

                    ┌──────────────┐
      Writes ──────▶│   Primary    │───── Replication ─────┐
                    │   Database   │                       │
                    └──────────────┘                       ▼
                                                    ┌──────────────┐
      Reads ───────────────────────────────────────▶│   Replica 1  │
                                                    └──────────────┘
                                                           │
                                                           ▼
                                                    ┌──────────────┐
      Reads ───────────────────────────────────────▶│   Replica 2  │
                                                    └──────────────┘

Pros: Simple setup; widely supported; no application changes for reads
Cons: Replication lag means replicas may be slightly stale; doesn't scale writes

Sharding (Horizontal Partitioning):

Divide data across multiple databases, each holding a subset.

                    ┌──────────────┐
    user_id 1-1M ──▶│   Shard 1    │
                    └──────────────┘
                    
                    ┌──────────────┐
   user_id 1M-2M ──▶│   Shard 2    │
                    └──────────────┘
                    
                    ┌──────────────┐
   user_id 2M-3M ──▶│   Shard 3    │
                    └──────────────┘

Sharding Strategies:

Strategy	Example	Pros	Cons
Range-Based	user_id 1-1M → shard 1	Simple; range queries easy	Hot spots if ranges unequal
Hash-Based	hash(user_id) % N → shard	Even distribution	Cross-shard queries hard
Directory-Based	Lookup table maps key → shard	Flexible	Lookup table is bottleneck
Geographic	US users → US shard	Data locality	Cross-region queries slow

Sharding Challenges

•Cross-Shard Queries — JOINs across shards are very complex or impossible. Application must handle aggregation.
•Transactions Across Shards — Distributed transactions are expensive and complex. Design to avoid them.
•Resharding — Adding shards requires data migration. Plan for growth from the start.
•Hot Spots — If sharding key is skewed (e.g., celebrity user_id), one shard gets overloaded.
•Operational Complexity — Multiple databases to backup, monitor, upgrade.

Modern Database Options:

Category	Examples	Scaling Model	Best For
Traditional RDBMS	PostgreSQL, MySQL	Vertical + Read replicas	Complex queries, ACID
NewSQL	CockroachDB, TiDB, Spanner	Horizontal with SQL	SQL + horizontal scale
NoSQL Document	MongoDB, Couchbase	Built-in sharding	Flexible schemas, scale
NoSQL Wide-Column	Cassandra, ScyllaDB	Masterless horizontal	Write-heavy, time-series
NoSQL Key-Value	Redis, DynamoDB	Horizontal partitioning	Simple access patterns

Scalability Patterns and Principles

Beyond specific techniques, certain patterns and principles guide the design of scalable systems.

Scalability Principles

•Decouple Components — Independent services can scale independently. If image processing is the bottleneck, scale just that component, not the entire application.
•Embrace Asynchronous Processing — Not everything needs immediate response. Background jobs, message queues, and eventual consistency enable scaling by spreading work over time.
•Partition by Any Means — Partition data, partition processing, partition traffic. Independence enables parallelism.
•Cache Aggressively but Correctly — Caching can 100x your effective capacity. But stale data is a bug. Design invalidation carefully.
•Choose Consistency Wisely — Strong consistency is expensive at scale. Many use cases tolerate eventual consistency. Choose per use case, not globally.
•Measure Everything — You can't scale what you can't measure. Instrument thoroughly; know where bottlenecks are before they're critical.
•Design for Failure — Components will fail. Design so failures are isolated and recovery is automatic.

Asynchronous Processing Pattern:

Synchronous (doesn't scale well):
[User Request] → [Process Immediately] → [Wait] → [Return Result]

Asynchronous (scales much better):
[User Request] → [Enqueue Job] → [Return Accepted]
                      ↓
         [Worker Pool Processes When Available]
                      ↓
              [Notify User When Done]

Use asynchronous processing when:

Operation takes more than a few seconds
Result isn't needed immediately
Load is spiky (queue absorbs spikes)
Operation can be retried on failure

The CQRS Pattern (Command Query Responsibility Segregation):

Separate read and write paths, allowing each to scale independently:

┌───────────────┐     ┌─────────────────┐     ┌─────────────────┐
│  Write Path   │────▶│  Write Database │────▶│  Event Stream   │
│  (Commands)   │     │  (Normalized)   │     │  (Updates)      │
└───────────────┘     └─────────────────┘     └────────┬────────┘
                                                       │
┌───────────────┐     ┌─────────────────┐              │
│   Read Path   │◀────│   Read Store    │◀─────────────┘
│   (Queries)   │     │  (Denormalized) │
└───────────────┘     └─────────────────┘

Reads scale differently than writes
Read store can be optimized for query patterns
Eventual consistency between write and read stores

The 10x Rule

Design your system to handle 10x current load without fundamental architecture changes. This gives headroom for growth while keeping the system practical for current needs. If you expect 100K users, design for 1M. If that requires a different architecture than what you'd build for 100K, you're probably over-engineering for now—but have a plan.

Auto-Scaling

Auto-scaling automatically adjusts the number of running server instances based on demand. It's the operational realization of horizontal scaling—you don't add servers manually; the system does it automatically.

Auto-Scaling Strategies:

Strategy	Trigger	Example	Pros/Cons
Reactive (Metric-Based)	CPU > 70% for 5 min	Add 2 instances	Simple; can be slow to react
Predictive	ML predicts traffic increase	Scale before demand	Handles spikes; complex to set up
Scheduled	Time-based (9 AM weekdays)	Scale up for business hours	Predictable patterns; inflexible
Target Tracking	Maintain 1000 req/instance	Add/remove to maintain target	Self-adjusting; requires good metrics

Key Metrics for Scaling Decisions:

CPU Utilization — Classic; works for compute-bound workloads
Request Rate — Scale based on requests per instance
Queue Depth — Scale workers when queue backs up
Response Latency — Scale when p99 latency increases
Custom Metrics — Application-specific (active users, job count)

auto-scaling-config.yaml

Kubernetes HPA

Auto-Scaling Pitfalls

Auto-scaling isn't magic. Watch for: Cold start delays (new instances take time to warm up), scaling oscillation (rapid scale up/down), cost surprises (runaway scaling), and cascading failures (scaling up when downstream is the bottleneck). Test auto-scaling behavior before relying on it in production.

Summary: Scalability

We've comprehensively explored scalability in client-server systems. Let's consolidate the key insights:

Key Takeaways

•Scalability is the ability to handle growth — By adding resources (vertical) or machines (horizontal), scalable systems maintain performance as load increases.
•Vertical scaling is simple but limited — Making machines bigger is easy but hits hardware ceilings and creates single points of failure.
•Horizontal scaling is the foundation of scale — Adding machines enables theoretically unlimited growth but requires stateless design and distributed system management.
•Load balancing distributes traffic — Algorithms like round-robin, least connections, and IP hash distribute requests across servers, with health checks ensuring availability.
•Caching dramatically improves capacity — Multiple caching layers (browser, CDN, application, database) reduce load on origin systems by orders of magnitude.
•Database scaling is particularly challenging — Read replicas handle read scale; sharding handles write scale, but both add significant complexity.
•Scalability patterns guide design — Decoupling, asynchronous processing, and CQRS are architectural patterns that enable scale.
•Auto-scaling operationalizes horizontal scaling — Automatically adjusting instance count based on metrics enables systems to handle variable load efficiently.

Module Complete:

You've now completed the Client-Server Model module. You understand:

The client's role as initiator and consumer of services
The server's role as listener and provider of services
The request-response pattern that governs communication
The taxonomy of server types and architectures
The strategies and patterns for building scalable systems

This foundational knowledge underlies virtually all networked computing and prepares you for understanding specific application-layer protocols in subsequent modules.

Module Complete

Congratulations on completing the Client-Server Model module! You now have a comprehensive understanding of this fundamental paradigm: from the roles of clients and servers, through request-response communication, to server types and scalability strategies. This knowledge forms the foundation for understanding all application-layer protocols and distributed systems.

Scalability

Building Systems That Grow

What You Will Learn

Understanding Scalability

Dimensions of Scalability:

Dimension	Description	Example
Load Scalability	Handle more concurrent requests	From 100 to 100,000 requests/second
Data Scalability	Handle more data storage/processing	From 1GB to 1PB of data
Geographic Scalability	Serve more geographic regions	From single region to global deployment
Administrative Scalability	More teams/services with independence	From one team to hundreds of services

Scalability vs. Performance:

These related concepts are often confused:

Performance: How fast a system responds to individual requests (latency) or how many it handles per second (throughput)
Scalability: How performance changes as load increases, and ability to improve by adding resources

A system can be performant but not scalable (fast for low load but degrades severely under high load), or scalable but not performant (handles load gracefully but each request is slow).

Scalability Indicators
Indicator	Scalable System	Non-Scalable System
Throughput vs. load	Linear or near-linear increase	Plateaus or decreases at higher load
Latency vs. load	Relatively constant until saturation	Increases rapidly with load
Resource addition	Adding resources increases capacity proportionally	Diminishing returns quickly
Bottleneck behavior	Multiple components can be the limiter	Single component always limits

Amdahl's Law

Vertical Scaling (Scale Up)

Vertical scaling means adding more power to an existing machine: more CPU, more RAM, faster storage. It's the conceptually simplest approach—make the server bigger.

Converting Mermaid diagram...

Advantages

•Simplicity — No code changes; no distributed system complexity; single point of management
•No Consistency Issues — All data on one machine; no synchronization between nodes
•Lower Latency — No network hops between components
•Simpler Operations — One server to monitor, backup, update
•Works for Any Application — Even poorly designed apps scale vertically (to a point)

Limitations

•Physical Limits — Hardware has maximum specs; can't add more than the biggest available
•Cost Escalation — High-end hardware costs disproportionately more (2x CPU might cost 4x)
•Single Point of Failure — One machine means one failure crashes everything
•Downtime for Upgrade — Often requires shutdown to add resources
•Vendor Lock-in — May depend on specific hardware/cloud instance types

When Vertical Scaling Works Well:

Early-stage systems — Simpler to start with a bigger server than design for distribution
Database servers — Many databases scale vertically better than horizontally (ACID is hard to distribute)
Low-latency requirements — Single server avoids network latency between components
Limited concurrent users — If peak load fits on one machine, why add complexity?

Practical Limits:

As of 2024, the largest cloud instances offer:

CPUs: 400+ cores
Memory: 24+ TB RAM
Storage: Millions of IOPS

Vertical Scaling in Practice:

Load increases 5x:
  → Upgrade from 8 CPU to 32 CPU: $500/mo → $2,000/mo
  → Still one server; works immediately
  → If load increases 10x more, hit ceiling

Compare to horizontal scaling:
  → Add 5 servers at original spec: 5 × $500/mo = $2,500/mo
  → More complex setup, but no ceiling
  → Built-in redundancy

Horizontal Scaling (Scale Out)

Horizontal scaling means adding more machines to distribute the load. Instead of one big server, you have many smaller servers working together.

Converting Mermaid diagram...

Advantages

•Theoretically Unlimited — Add more servers as needed; no hardware ceiling
•Fault Tolerance — If one server fails, others continue; no single point of failure
•Cost Efficiency at Scale — Commodity hardware is cheaper than premium; costs scale linearly
•Incremental Growth — Add capacity precisely when needed
•Geographic Distribution — Servers can be placed near users globally

Challenges

•Distributed System Complexity — Coordination, consistency, network partitions
•State Management — Where does data live? How is it synchronized?
•Load Balancing — How are requests distributed? How to handle failures?
•Operational Overhead — More servers to deploy, monitor, update, debug
•Application Design — Code must be designed for distribution (stateless, etc.)

Requirements for Horizontal Scaling:

To scale horizontally, your application must meet certain requirements:

Stateless or Externalized State — Servers can't rely on local state if any server might handle any request. Either design stateless, or store state in shared systems (database, cache).
Shared-Nothing Architecture — Each server operates independently; doesn't assume what's in other servers' memory or storage.
Idempotent Operations — Requests may be retried on different servers; operations must handle this safely.
External Session Storage — If sessions are needed, store in Redis/database accessible to all servers.
Centralized or Distributed Data Store — Servers need somewhere to read/write persistent data.

horizontal-scaling-example.pseudo

Stateless Design

Vertical vs. Horizontal Scaling Comparison
Aspect	Vertical Scaling	Horizontal Scaling
Approach	Bigger machine	More machines
Limit	Hardware maximum	Theoretically unlimited
Failure tolerance	Single point of failure	Redundancy built-in
Complexity	Low (single server)	High (distributed system)
Cost curve	Exponential (premium pricing)	Linear (commodity)
Downtime for scaling	Often required	Usually zero-downtime
Application changes	None required	Must be designed for it

Load Balancing

Converting Mermaid diagram...

Load Balancing Algorithms:

Common Load Balancing Algorithms
Algorithm	How It Works	Best For	Drawbacks
Round Robin	Rotate through servers in order	Equal-capacity servers, similar request costs	Ignores server load and request complexity
Weighted Round Robin	Round robin with weights for server capacity	Servers with different capacities	Weights must be manually configured
Least Connections	Send to server with fewest active connections	Varying request duration	Doesn't account for connection cost variation
Weighted Least Connections	Least connections adjusted by weight	Different capacities + varying request duration	More complex configuration
IP Hash	Hash client IP to determine server	Session persistence without sticky sessions	Uneven distribution if IP distribution skewed
Least Response Time	Send to server with fastest recent responses	When response time is the key metric	Requires continuous monitoring
Random	Choose server randomly	Simplicity; large number of servers	Can be uneven with small server count

Health Checks:

Load balancers must detect when servers are unhealthy and stop sending traffic to them:

// Health check configuration example
healthCheck:
  path: /health              # Endpoint to check
  interval: 10s              # How often to check
  timeout: 5s                # How long to wait for response
  healthyThreshold: 2        # Consecutive successes to mark healthy
  unhealthyThreshold: 3      # Consecutive failures to mark unhealthy

Types of Health Checks:

Type	What It Checks	Reliability
TCP	Can connect to port	Low (process may be up but broken)
HTTP	Returns 2xx on health endpoint	Medium (endpoint may be healthy but app broken)
Deep/Synthetic	Performs actual operation (DB query, etc.)	High (verifies full stack)

Load Balancer Types:

Layer 4 (Transport) — Balances based on TCP/UDP connection; doesn't inspect content
- Very fast; handles any protocol
- Can't make decisions based on URL, headers, etc.
Layer 7 (Application) — Inspects HTTP (or other protocol) content
- Can route based on URL path, headers, cookies
- Can terminate SSL, inject headers, modify requests
- Higher overhead; protocol-specific

DNS-Based Load Balancing

Caching Strategies

Caching stores copies of frequently accessed data closer to where it's needed, reducing load on origin servers and improving response times. It's one of the most powerful scalability techniques.

Converting Mermaid diagram...

Caching Layers:

Layer	What's Cached	Cache Location	TTL Range
Browser Cache	Static assets, API responses	Client device	Seconds to months
CDN Cache	Static content, sometimes dynamic	Edge servers globally	Minutes to days
Reverse Proxy Cache	HTTP responses	Before app servers	Seconds to hours
Application Cache	Computed results, session data	In-memory (Redis, Memcached)	Seconds to hours
Database Cache	Query results, indexes	Database server memory	Automatic/managed
Object Cache	Serialized objects	Distributed cache	Minutes to days

Caching Patterns:

Cache-Aside (Lazy Loading):

async function getUser(userId) {
    // 1. Check cache first
    let user = await cache.get(`user:${userId}`);
    if (user) return user;  // Cache hit
    
    // 2. Cache miss: fetch from database
    user = await db.query('SELECT * FROM users WHERE id = ?', [userId]);
    
    // 3. Populate cache for next time
    await cache.set(`user:${userId}`, user, {ttl: 3600});
    
    return user;
}

Application manages cache explicitly
Only requested data is cached (no wasted cache space)
First request always hits database (cold cache problem)

Write-Through:

async function updateUser(userId, data) {
    // 1. Write to database
    await db.query('UPDATE users SET ? WHERE id = ?', [data, userId]);
    
    // 2. Immediately update cache
    await cache.set(`user:${userId}`, {...existingData, ...data});
}

Cache always consistent with database
Write latency includes cache update
Every write updates cache (even if rarely read)

Write-Behind (Write-Back):

async function updateUser(userId, data) {
    // 1. Write to cache only (fast return)
    await cache.set(`user:${userId}`, data);
    
    // 2. Asynchronously persist to database
    writeQueue.enqueue({table: 'users', id: userId, data});
}

Very fast writes (cache only)
Risk of data loss if cache fails before persist
Complexity of managing write queue

Cache Invalidation Strategies
Strategy	How It Works	Consistency	Complexity
TTL (Time-To-Live)	Cache entries expire after fixed time	Eventually consistent	Simplest
Invalidate on Write	Delete/update cache when data changes	Strongly consistent	Moderate
Event-Based Invalidation	Publish events when data changes; consumers invalidate	Eventually consistent	Complex
Version Tags (ETag)	Cache includes version; check if current	Can be strongly consistent	Moderate

Cache Invalidation Is Hard

"There are only two hard things in Computer Science: cache invalidation and naming things." — Phil Karlton

Database Scaling

Read Replicas:

Most applications have far more reads than writes. Read replicas scale read capacity while maintaining a single write source.

                    ┌──────────────┐
      Writes ──────▶│   Primary    │───── Replication ─────┐
                    │   Database   │                       │
                    └──────────────┘                       ▼
                                                    ┌──────────────┐
      Reads ───────────────────────────────────────▶│   Replica 1  │
                                                    └──────────────┘
                                                           │
                                                           ▼
                                                    ┌──────────────┐
      Reads ───────────────────────────────────────▶│   Replica 2  │
                                                    └──────────────┘

Pros: Simple setup; widely supported; no application changes for reads
Cons: Replication lag means replicas may be slightly stale; doesn't scale writes

Sharding (Horizontal Partitioning):

Divide data across multiple databases, each holding a subset.

                    ┌──────────────┐
    user_id 1-1M ──▶│   Shard 1    │
                    └──────────────┘
                    
                    ┌──────────────┐
   user_id 1M-2M ──▶│   Shard 2    │
                    └──────────────┘
                    
                    ┌──────────────┐
   user_id 2M-3M ──▶│   Shard 3    │
                    └──────────────┘

Sharding Strategies:

Strategy	Example	Pros	Cons
Range-Based	user_id 1-1M → shard 1	Simple; range queries easy	Hot spots if ranges unequal
Hash-Based	hash(user_id) % N → shard	Even distribution	Cross-shard queries hard
Directory-Based	Lookup table maps key → shard	Flexible	Lookup table is bottleneck
Geographic	US users → US shard	Data locality	Cross-region queries slow

Sharding Challenges

•Cross-Shard Queries — JOINs across shards are very complex or impossible. Application must handle aggregation.
•Transactions Across Shards — Distributed transactions are expensive and complex. Design to avoid them.
•Resharding — Adding shards requires data migration. Plan for growth from the start.
•Hot Spots — If sharding key is skewed (e.g., celebrity user_id), one shard gets overloaded.
•Operational Complexity — Multiple databases to backup, monitor, upgrade.

Modern Database Options:

Category	Examples	Scaling Model	Best For
Traditional RDBMS	PostgreSQL, MySQL	Vertical + Read replicas	Complex queries, ACID
NewSQL	CockroachDB, TiDB, Spanner	Horizontal with SQL	SQL + horizontal scale
NoSQL Document	MongoDB, Couchbase	Built-in sharding	Flexible schemas, scale
NoSQL Wide-Column	Cassandra, ScyllaDB	Masterless horizontal	Write-heavy, time-series
NoSQL Key-Value	Redis, DynamoDB	Horizontal partitioning	Simple access patterns

Scalability Patterns and Principles

Beyond specific techniques, certain patterns and principles guide the design of scalable systems.

Scalability Principles

•Decouple Components — Independent services can scale independently. If image processing is the bottleneck, scale just that component, not the entire application.
•Embrace Asynchronous Processing — Not everything needs immediate response. Background jobs, message queues, and eventual consistency enable scaling by spreading work over time.
•Partition by Any Means — Partition data, partition processing, partition traffic. Independence enables parallelism.
•Cache Aggressively but Correctly — Caching can 100x your effective capacity. But stale data is a bug. Design invalidation carefully.
•Choose Consistency Wisely — Strong consistency is expensive at scale. Many use cases tolerate eventual consistency. Choose per use case, not globally.
•Measure Everything — You can't scale what you can't measure. Instrument thoroughly; know where bottlenecks are before they're critical.
•Design for Failure — Components will fail. Design so failures are isolated and recovery is automatic.

Asynchronous Processing Pattern:

Synchronous (doesn't scale well):
[User Request] → [Process Immediately] → [Wait] → [Return Result]

Asynchronous (scales much better):
[User Request] → [Enqueue Job] → [Return Accepted]
                      ↓
         [Worker Pool Processes When Available]
                      ↓
              [Notify User When Done]

Use asynchronous processing when:

Operation takes more than a few seconds
Result isn't needed immediately
Load is spiky (queue absorbs spikes)
Operation can be retried on failure

The CQRS Pattern (Command Query Responsibility Segregation):

Separate read and write paths, allowing each to scale independently:

┌───────────────┐     ┌─────────────────┐     ┌─────────────────┐
│  Write Path   │────▶│  Write Database │────▶│  Event Stream   │
│  (Commands)   │     │  (Normalized)   │     │  (Updates)      │
└───────────────┘     └─────────────────┘     └────────┬────────┘
                                                       │
┌───────────────┐     ┌─────────────────┐              │
│   Read Path   │◀────│   Read Store    │◀─────────────┘
│   (Queries)   │     │  (Denormalized) │
└───────────────┘     └─────────────────┘

Reads scale differently than writes
Read store can be optimized for query patterns
Eventual consistency between write and read stores

The 10x Rule

Auto-Scaling

Auto-Scaling Strategies:

Strategy	Trigger	Example	Pros/Cons
Reactive (Metric-Based)	CPU > 70% for 5 min	Add 2 instances	Simple; can be slow to react
Predictive	ML predicts traffic increase	Scale before demand	Handles spikes; complex to set up
Scheduled	Time-based (9 AM weekdays)	Scale up for business hours	Predictable patterns; inflexible
Target Tracking	Maintain 1000 req/instance	Add/remove to maintain target	Self-adjusting; requires good metrics

Key Metrics for Scaling Decisions:

CPU Utilization — Classic; works for compute-bound workloads
Request Rate — Scale based on requests per instance
Queue Depth — Scale workers when queue backs up
Response Latency — Scale when p99 latency increases
Custom Metrics — Application-specific (active users, job count)

auto-scaling-config.yaml

Kubernetes HPA

Auto-Scaling Pitfalls

Summary: Scalability

We've comprehensively explored scalability in client-server systems. Let's consolidate the key insights:

Key Takeaways

•Scalability is the ability to handle growth — By adding resources (vertical) or machines (horizontal), scalable systems maintain performance as load increases.
•Vertical scaling is simple but limited — Making machines bigger is easy but hits hardware ceilings and creates single points of failure.
•Horizontal scaling is the foundation of scale — Adding machines enables theoretically unlimited growth but requires stateless design and distributed system management.
•Load balancing distributes traffic — Algorithms like round-robin, least connections, and IP hash distribute requests across servers, with health checks ensuring availability.
•Caching dramatically improves capacity — Multiple caching layers (browser, CDN, application, database) reduce load on origin systems by orders of magnitude.
•Database scaling is particularly challenging — Read replicas handle read scale; sharding handles write scale, but both add significant complexity.
•Scalability patterns guide design — Decoupling, asynchronous processing, and CQRS are architectural patterns that enable scale.
•Auto-scaling operationalizes horizontal scaling — Automatically adjusting instance count based on metrics enables systems to handle variable load efficiently.

Module Complete:

You've now completed the Client-Server Model module. You understand:

The client's role as initiator and consumer of services
The server's role as listener and provider of services
The request-response pattern that governs communication
The taxonomy of server types and architectures
The strategies and patterns for building scalable systems

This foundational knowledge underlies virtually all networked computing and prepares you for understanding specific application-layer protocols in subsequent modules.

Module Complete