System Design (HLD)BASE Properties

BASE Properties: The Alternative to ACID

LevelIntermediate

Duration75 mins

TopicBASE Properties

1 / 5

Basically Available

The Availability Imperative

In the world of distributed systems, availability isn't a nice-to-have—it's an existential requirement. When Amazon's website goes down for even a minute, the company loses an estimated $220,000 in sales. When banking systems become unavailable, financial transactions halt, businesses suffer, and regulators take notice. When social media platforms become unreachable, users flee to competitors.

Yet achieving availability in distributed systems presents a fundamental challenge: the more machines you add to handle load and provide redundancy, the more likely it becomes that something will fail at any given moment. A system with 100 servers, each with 99.9% uptime, will experience failures multiple times per day.

This is where the BASE consistency model enters the picture—and its first principle, Basically Available, represents a radical rethinking of how distributed systems should behave when things go wrong.

What You Will Learn

By the end of this page, you will understand what 'Basically Available' means in the context of distributed databases, how it differs from traditional availability concepts, the theoretical foundations from the CAP theorem, and the architectural patterns that enable systems to remain available even during partial failures. You'll also learn how to design systems that gracefully degrade rather than catastrophically fail.

Understanding Basic Availability

The term "Basically Available" might seem like a vague or weak guarantee, but it represents a carefully considered engineering philosophy. To understand it, we must first distinguish it from absolute availability and strong consistency availability.

Absolute Availability promises that every request will receive a response—guaranteed. This is theoretically impossible in a distributed system subject to network partitions and node failures.

Strong Consistency Availability promises that every request will receive the correct response—the most up-to-date data. This requires coordination between nodes, which becomes impossible when nodes can't communicate.

Basic Availability takes a different approach: the system will always attempt to provide a response, even if that response might be slightly stale or incomplete. The system remains functional even when it can't be perfect.

The Philosophy Behind 'Basically'

The word 'Basically' in 'Basically Available' is intentional. It acknowledges that perfect availability is impossible in distributed systems. Instead, the system guarantees to remain functional 'to the greatest extent possible'—always responding to requests, always serving data, even if that data isn't perfectly consistent across all nodes. This pragmatic approach prioritizes user experience over theoretical purity.

The Fundamental Insight:

Basic availability is built on a crucial observation: for most applications, an available but slightly stale response is far more valuable than no response at all. Consider these scenarios:

A user checking their social media feed doesn't need second-to-second accuracy. Showing posts from 5 seconds ago is perfectly acceptable.
A product catalog displaying prices can tolerate minor inconsistencies. A 30-second delay in price updates rarely matters.
A user's shopping cart can be reconstructed or reconciled later. Losing availability means losing the sale entirely.

The alternative—refusing to serve requests until perfect consistency is achieved—often means:

Users see error messages or timeouts
Business is lost to competitors
User trust erodes with every failed request

Basic availability chooses functionality over perfection.

Theoretical Foundations: The CAP Theorem

Basic availability's theoretical underpinning comes from the CAP theorem, one of the most important results in distributed systems theory. Proposed by Eric Brewer in 2000 and formally proven by Seth Gilbert and Nancy Lynch in 2002, the CAP theorem states:

A distributed data store cannot simultaneously provide more than two of the following three guarantees:

The Three CAP Guarantees

•Consistency (C) — Every read receives the most recent write or an error. All nodes see the same data at the same time.
•Availability (A) — Every request receives a (non-error) response, without guarantee that it contains the most recent write.
•Partition Tolerance (P) — The system continues to operate despite an arbitrary number of messages being dropped or delayed by the network between nodes.

The Partition Tolerance Reality

In a distributed system, network partitions are not a question of 'if' but 'when.' Networks fail. Cables get cut. Routers malfunction. Data centers lose connectivity. This means partition tolerance (P) is not optional—it's a requirement. The real choice in modern distributed systems is between Consistency (CP) and Availability (AP) during network partitions.

Why Partitions Are Inevitable:

Consider what happens in a real distributed system spanning multiple data centers:

Physical failures: Fiber optic cables are cut by construction crews (happens surprisingly often)
Network equipment failures: Routers, switches, and load balancers fail
Software bugs: A misconfigured firewall blocks traffic between nodes
Congestion: Network saturation causes packet loss and timeouts
Geographic distance: Even at the speed of light, latency exists

When a partition occurs, nodes on either side of the partition can't communicate. At this moment, the system must make a choice:

Option 1 (Choose Consistency): Nodes refuse to serve requests until the partition heals and they can verify data consistency. Users see errors or timeouts.

Option 2 (Choose Availability): Nodes continue serving requests using their local data, accepting that different nodes might have temporarily different views of the data.

BASE systems choose Option 2. They prioritize availability over strong consistency during partitions.

CAP Theorem Trade-offs in Practice
System Type	During Normal Operation	During Network Partition	Examples
CP (Consistent + Partition-Tolerant)	Strong consistency, full availability	Consistency maintained, availability sacrificed	Google Spanner, etcd, ZooKeeper
AP (Available + Partition-Tolerant)	Eventual consistency, full availability	Availability maintained, consistency sacrificed	Cassandra, DynamoDB, CouchDB
CA (Consistent + Available)	Theoretically impossible in distributed systems	Cannot exist—partitions will occur	Single-node databases only

Availability Metrics and SLAs

Understanding basic availability requires understanding how availability is measured and what different availability targets mean in practice. The industry standard for measuring availability is the number of nines—expressed as a percentage of uptime over a given period.

The Nines of Availability
Availability %	Nines	Downtime per Year	Downtime per Month	Downtime per Day
99%	Two nines	3.65 days	7.31 hours	14.40 minutes
99.9%	Three nines	8.77 hours	43.83 minutes	1.44 minutes
99.99%	Four nines	52.60 minutes	4.38 minutes	8.64 seconds
99.999%	Five nines	5.26 minutes	26.30 seconds	864 milliseconds
99.9999%	Six nines	31.56 seconds	2.63 seconds	86.4 milliseconds

The Cost of Each Nine

Each additional nine of availability typically requires an order of magnitude more engineering effort and infrastructure cost. Going from 99% to 99.9% might require redundancy. Going from 99.99% to 99.999% might require multi-region deployments, sophisticated failover mechanisms, and extensive monitoring. The decision of how many nines to target should be driven by business requirements, not engineering ambition.

How Basic Availability Affects SLAs:

When we say a system is 'Basically Available,' we're making a specific claim: the system will respond to requests, but the response might not reflect the absolute latest state. This distinction affects how we measure availability:

Traditional Availability Measurement:

Success: Request returns with correct, up-to-date data
Failure: Request times out, returns error, or returns stale data

Basic Availability Measurement:

Success: Request returns with some valid data (possibly stale)
Failure: Request times out or returns error

This relaxed definition allows basically available systems to achieve higher availability numbers by counting 'stale but served' as successful responses.

Practical Example:

Consider a global e-commerce system with data centers in US, Europe, and Asia:

A user in Europe updates their shopping cart
The European data center confirms the update immediately
The US and Asian data centers receive the update asynchronously (typically within seconds)
If a user in the US reads the cart during this window, they might see stale data

Under traditional availability, this stale read might be counted as a failure. Under basic availability, it's counted as a success—the user got a response, and the data will eventually be consistent.

Architectural Patterns for Availability

Achieving basic availability in distributed systems requires a combination of architectural patterns that work together to ensure the system can respond to requests even when components fail. These patterns form the foundation of the world's most reliable systems.

Core Availability Patterns

•Replication — Data is copied to multiple nodes. If one node fails, others can serve requests. This is the foundation of all availability strategies.
•Partitioning (Sharding) — Data is divided across nodes. A failure affects only a subset of data, limiting the blast radius of failures.
•Quorum-Based Reads/Writes — Rather than requiring all replicas to agree, operations proceed if a majority (quorum) responds.
•Failover — When a primary node fails, a secondary automatically takes over, often within seconds.
•Graceful Degradation — When capacity is limited, the system serves reduced functionality rather than failing completely.

Deep Dive: Replication for Availability

Replication is the cornerstone of availability. By maintaining multiple copies of data, we ensure that no single failure can make data inaccessible. However, replication introduces a fundamental challenge: keeping replicas synchronized.

There are three primary replication strategies:

Synchronous Replication:

Writes wait until all replicas confirm
Strong consistency guaranteed
Availability suffers during failures (can't complete writes if any replica is unavailable)
Example: Traditional relational databases with synchronous standby

Asynchronous Replication:

Writes complete as soon as one replica confirms
Other replicas updated 'eventually'
High availability (writes always succeed if primary is available)
Can lose committed writes if primary fails before replication
Example: MySQL master-slave replication

Quorum-Based Replication:

Writes complete when W of N replicas confirm
Reads consult R of N replicas
If W + R > N, at least one replica has the latest data
Tunable trade-off between consistency and availability
Example: Cassandra, DynamoDB

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
// Example: Quorum-based write in a distributed system
// Assuming N = 3 replicas, W = 2 (write quorum)
 
interface WriteResult {
  success: boolean;
  confirmedReplicas: number;
  errors: Error[];
}
 
async function quorumWrite(
  key: string, 
  value: any, 
  replicas: Replica[],
  writeQuorum: number
): Promise<WriteResult> {
  const N = replicas.length;
  const W = writeQuorum;
  
  // Send write to all replicas in parallel
  const writePromises = replicas.map(async (replica) => {
    try {
      await replica.write(key, value, Date.now());
      return { success: true, replica };
    } catch (error) {
      return { success: false, replica, error };
    }
  });
  
  // Wait for responses with timeout
  const results = await Promise.allSettled(
    writePromises.map(p => withTimeout(p, 5000))
  );
  
  const successes = results.filter(
    r => r.status === 'fulfilled' && r.value.success
  ).length;
  
  // Quorum achieved?
  if (successes >= W) {
    // Write is considered successful - available!
    // Remaining replicas will eventually receive the write
    return {
      success: true,
      confirmedReplicas: successes,
      errors: []
    };
  } else {
    // Quorum not achieved - write failed
    // In AP systems, might still succeed with warning
    return {
      success: false,
      confirmedReplicas: successes,
      errors: results
        .filter(r => r.status === 'rejected')
        .map(r => r.reason)
    };
  }
}

Graceful Degradation: Staying Available Under Pressure

Graceful degradation is a key strategy for maintaining basic availability. Rather than failing completely when resources are constrained or components fail, the system reduces functionality incrementally, prioritizing the most critical features.

Non-Degrading Failure

•Database becomes slow → entire site returns 500 errors
•Recommendation service fails → product pages won't load
•Payment processing unavailable → users can't even browse
•Search index corrupted → application crashes
•CDN fails → site completely inaccessible

Graceful Degradation

•Database becomes slow → serve cached data, queue writes
•Recommendation service fails → show popular items instead
•Payment processing unavailable → allow browsing, disable checkout
•Search index corrupted → fallback to category browsing
•CDN fails → serve directly from origin (slower but works)

Implementing Graceful Degradation:

Effective graceful degradation requires planning. You must identify:

Feature Priority Tiers:
- Critical: Core functionality that must always work (login, basic product viewing, checkout)
- Important: Valuable features that should degrade gracefully (recommendations, search, reviews)
- Nice-to-Have: Features that can be disabled entirely under pressure (personalization, analytics)
Fallback Strategies:
- Cached data for read-heavy operations
- Queue writes for later processing
- Default values when services are unavailable
- Simplified UI when client-side resources fail
Circuit Breakers:
- Automatically detect failing dependencies
- Switch to fallback mode quickly
- Periodically probe to detect recovery

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
enum CircuitState {
  CLOSED,    // Normal operation - requests flow through
  OPEN,      // Failure detected - requests blocked, fallback used
  HALF_OPEN  // Testing recovery - limited requests allowed
}
 
class CircuitBreaker {
  private state: CircuitState = CircuitState.CLOSED;
  private failureCount: number = 0;
  private lastFailureTime: number = 0;
  
  constructor(
    private failureThreshold: number = 5,
    private resetTimeout: number = 30000 // 30 seconds
  ) {}
  
  async execute<T>(
    primaryFn: () => Promise<T>,
    fallbackFn: () => Promise<T>
  ): Promise<T> {
    // Check if circuit should transition from OPEN to HALF_OPEN
    if (this.state === CircuitState.OPEN) {
      if (Date.now() - this.lastFailureTime > this.resetTimeout) {
        this.state = CircuitState.HALF_OPEN;
      } else {
        // Circuit is open - use fallback for availability
        return fallbackFn();
      }
    }
    
    try {
      const result = await primaryFn();
      
      // Success - reset circuit
      if (this.state === CircuitState.HALF_OPEN) {
        this.state = CircuitState.CLOSED;
      }
      this.failureCount = 0;
      
      return result;
    } catch (error) {
      this.failureCount++;
      this.lastFailureTime = Date.now();
      
      if (this.failureCount >= this.failureThreshold) {
        this.state = CircuitState.OPEN;
        console.log('Circuit opened - switching to fallback');
      }
      
      // Provide graceful degradation via fallback
      return fallbackFn();
    }
  }
}
 
// Usage Example
const recommendationBreaker = new CircuitBreaker(5, 30000);
 
async function getProductRecommendations(userId: string) {
  return recommendationBreaker.execute(
    // Primary: real-time personalized recommendations
    async () => {
      return await recommendationService.getPersonalized(userId);
    },
    // Fallback: cached popular items (always available)
    async () => {
      return await cache.get('popular-products');
    }
  );
}

Real-World Basically Available Systems

Let's examine how major companies implement basic availability in their systems. These patterns have been battle-tested at scales of millions of requests per second.

Case Study: Amazon DynamoDB

•Design Goal: Always-on database for Amazon's shopping cart (literally can't afford downtime during holiday shopping)
•Availability Strategy: Data replicated across multiple nodes; reads/writes use quorums
•Partition Handling: Nodes operate independently during partitions; conflicts resolved after partition heals
•Result: 99.999% availability even during data center failures
•Trade-off: Applications must handle potential temporary inconsistency

Case Study: Netflix

•Design Goal: Streaming must work; customers pay for availability, not perfect metadata
•Availability Strategy: Multiple levels of caching; fallback at every service boundary
•Degradation Tiers: HD → SD → audio only before showing error
•Result: Service remains usable even when dozens of microservices fail
•Trade-off: Recommendations might be stale; metadata occasionally inconsistent

Case Study: Apache Cassandra

•Design Goal: Linear scalability without sacrificing availability
•Availability Strategy: Tunable consistency per query (ONE, QUORUM, ALL)
•Partition Handling: Hinted handoff queues writes to unavailable nodes
•Result: Can survive multiple node failures in a cluster without downtime
•Trade-off: Strong consistency requires careful configuration and understanding

The Business Case for Basic Availability

Amazon calculated that every 100ms of latency costs them 1% in sales. But complete unavailability costs 100% of sales. This makes the basic availability trade-off clear: slightly stale data that's always available is vastly more valuable than perfect data that's sometimes unavailable. The business always chooses availability.

Summary: The First Pillar of BASE

We've explored the first and arguably most important pillar of the BASE consistency model: Basically Available. Let's consolidate the key takeaways:

Key Takeaways

•Basic availability prioritizes response over perfection — The system will always attempt to serve a request, accepting that the response might not be perfectly up-to-date.
•The CAP theorem provides the theoretical foundation — In distributed systems, network partitions are inevitable, forcing a choice between consistency and availability.
•Availability is measured in nines — Each additional nine represents an order of magnitude improvement in uptime, with correspondingly higher engineering cost.
•Replication and quorums enable availability — By maintaining multiple copies and requiring only a subset to respond, systems can survive individual node failures.
•Graceful degradation extends availability — Rather than failing completely, systems should reduce functionality incrementally, prioritizing core features.
•Business value drives the trade-off — For most applications, available but slightly stale data is far more valuable than unavailable but consistent data.

What's Next:

Now that we understand basic availability, we'll explore the second pillar of BASE: Soft State. Soft state describes how data in a basically available system isn't permanent—it can change over time even without explicit user input, as the system works to reconcile differences between replicas. This concept fundamentally changes how we think about data management in distributed systems.

Page Complete

You now understand what 'Basically Available' means in the context of distributed databases. This pillar of BASE represents a deliberate trade-off: by relaxing consistency guarantees, distributed systems can remain available even during partial failures. Next, we'll explore how 'Soft State' enables this availability through flexible data management.

1 / 5

Loading learning content...

System Design (HLD)BASE Properties

BASE Properties: The Alternative to ACID

LevelIntermediate

Duration75 mins

TopicBASE Properties

1 / 5

Basically Available

The Availability Imperative

What You Will Learn

Understanding Basic Availability

Absolute Availability promises that every request will receive a response—guaranteed. This is theoretically impossible in a distributed system subject to network partitions and node failures.

The Philosophy Behind 'Basically'

The Fundamental Insight:

Basic availability is built on a crucial observation: for most applications, an available but slightly stale response is far more valuable than no response at all. Consider these scenarios:

A user checking their social media feed doesn't need second-to-second accuracy. Showing posts from 5 seconds ago is perfectly acceptable.
A product catalog displaying prices can tolerate minor inconsistencies. A 30-second delay in price updates rarely matters.
A user's shopping cart can be reconstructed or reconciled later. Losing availability means losing the sale entirely.

The alternative—refusing to serve requests until perfect consistency is achieved—often means:

Users see error messages or timeouts
Business is lost to competitors
User trust erodes with every failed request

Basic availability chooses functionality over perfection.

Theoretical Foundations: The CAP Theorem

A distributed data store cannot simultaneously provide more than two of the following three guarantees:

The Three CAP Guarantees

•Consistency (C) — Every read receives the most recent write or an error. All nodes see the same data at the same time.
•Availability (A) — Every request receives a (non-error) response, without guarantee that it contains the most recent write.
•Partition Tolerance (P) — The system continues to operate despite an arbitrary number of messages being dropped or delayed by the network between nodes.

The Partition Tolerance Reality

Why Partitions Are Inevitable:

Consider what happens in a real distributed system spanning multiple data centers:

Physical failures: Fiber optic cables are cut by construction crews (happens surprisingly often)
Network equipment failures: Routers, switches, and load balancers fail
Software bugs: A misconfigured firewall blocks traffic between nodes
Congestion: Network saturation causes packet loss and timeouts
Geographic distance: Even at the speed of light, latency exists

When a partition occurs, nodes on either side of the partition can't communicate. At this moment, the system must make a choice:

Option 1 (Choose Consistency): Nodes refuse to serve requests until the partition heals and they can verify data consistency. Users see errors or timeouts.

Option 2 (Choose Availability): Nodes continue serving requests using their local data, accepting that different nodes might have temporarily different views of the data.

BASE systems choose Option 2. They prioritize availability over strong consistency during partitions.

CAP Theorem Trade-offs in Practice
System Type	During Normal Operation	During Network Partition	Examples
CP (Consistent + Partition-Tolerant)	Strong consistency, full availability	Consistency maintained, availability sacrificed	Google Spanner, etcd, ZooKeeper
AP (Available + Partition-Tolerant)	Eventual consistency, full availability	Availability maintained, consistency sacrificed	Cassandra, DynamoDB, CouchDB
CA (Consistent + Available)	Theoretically impossible in distributed systems	Cannot exist—partitions will occur	Single-node databases only

Availability Metrics and SLAs

The Nines of Availability
Availability %	Nines	Downtime per Year	Downtime per Month	Downtime per Day
99%	Two nines	3.65 days	7.31 hours	14.40 minutes
99.9%	Three nines	8.77 hours	43.83 minutes	1.44 minutes
99.99%	Four nines	52.60 minutes	4.38 minutes	8.64 seconds
99.999%	Five nines	5.26 minutes	26.30 seconds	864 milliseconds
99.9999%	Six nines	31.56 seconds	2.63 seconds	86.4 milliseconds

The Cost of Each Nine

How Basic Availability Affects SLAs:

Traditional Availability Measurement:

Success: Request returns with correct, up-to-date data
Failure: Request times out, returns error, or returns stale data

Basic Availability Measurement:

Success: Request returns with some valid data (possibly stale)
Failure: Request times out or returns error

This relaxed definition allows basically available systems to achieve higher availability numbers by counting 'stale but served' as successful responses.

Practical Example:

Consider a global e-commerce system with data centers in US, Europe, and Asia:

A user in Europe updates their shopping cart
The European data center confirms the update immediately
The US and Asian data centers receive the update asynchronously (typically within seconds)
If a user in the US reads the cart during this window, they might see stale data

Architectural Patterns for Availability

Core Availability Patterns

•Replication — Data is copied to multiple nodes. If one node fails, others can serve requests. This is the foundation of all availability strategies.
•Partitioning (Sharding) — Data is divided across nodes. A failure affects only a subset of data, limiting the blast radius of failures.
•Quorum-Based Reads/Writes — Rather than requiring all replicas to agree, operations proceed if a majority (quorum) responds.
•Failover — When a primary node fails, a secondary automatically takes over, often within seconds.
•Graceful Degradation — When capacity is limited, the system serves reduced functionality rather than failing completely.

Deep Dive: Replication for Availability

There are three primary replication strategies:

Synchronous Replication:

Writes wait until all replicas confirm
Strong consistency guaranteed
Availability suffers during failures (can't complete writes if any replica is unavailable)
Example: Traditional relational databases with synchronous standby

Asynchronous Replication:

Writes complete as soon as one replica confirms
Other replicas updated 'eventually'
High availability (writes always succeed if primary is available)
Can lose committed writes if primary fails before replication
Example: MySQL master-slave replication

Quorum-Based Replication:

Writes complete when W of N replicas confirm
Reads consult R of N replicas
If W + R > N, at least one replica has the latest data
Tunable trade-off between consistency and availability
Example: Cassandra, DynamoDB

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
// Example: Quorum-based write in a distributed system
// Assuming N = 3 replicas, W = 2 (write quorum)
 
interface WriteResult {
  success: boolean;
  confirmedReplicas: number;
  errors: Error[];
}
 
async function quorumWrite(
  key: string, 
  value: any, 
  replicas: Replica[],
  writeQuorum: number
): Promise<WriteResult> {
  const N = replicas.length;
  const W = writeQuorum;
  
  // Send write to all replicas in parallel
  const writePromises = replicas.map(async (replica) => {
    try {
      await replica.write(key, value, Date.now());
      return { success: true, replica };
    } catch (error) {
      return { success: false, replica, error };
    }
  });
  
  // Wait for responses with timeout
  const results = await Promise.allSettled(
    writePromises.map(p => withTimeout(p, 5000))
  );
  
  const successes = results.filter(
    r => r.status === 'fulfilled' && r.value.success
  ).length;
  
  // Quorum achieved?
  if (successes >= W) {
    // Write is considered successful - available!
    // Remaining replicas will eventually receive the write
    return {
      success: true,
      confirmedReplicas: successes,
      errors: []
    };
  } else {
    // Quorum not achieved - write failed
    // In AP systems, might still succeed with warning
    return {
      success: false,
      confirmedReplicas: successes,
      errors: results
        .filter(r => r.status === 'rejected')
        .map(r => r.reason)
    };
  }
}

Graceful Degradation: Staying Available Under Pressure

Non-Degrading Failure

•Database becomes slow → entire site returns 500 errors
•Recommendation service fails → product pages won't load
•Payment processing unavailable → users can't even browse
•Search index corrupted → application crashes
•CDN fails → site completely inaccessible

Graceful Degradation

•Database becomes slow → serve cached data, queue writes
•Recommendation service fails → show popular items instead
•Payment processing unavailable → allow browsing, disable checkout
•Search index corrupted → fallback to category browsing
•CDN fails → serve directly from origin (slower but works)

Implementing Graceful Degradation:

Effective graceful degradation requires planning. You must identify:

Feature Priority Tiers:
- Critical: Core functionality that must always work (login, basic product viewing, checkout)
- Important: Valuable features that should degrade gracefully (recommendations, search, reviews)
- Nice-to-Have: Features that can be disabled entirely under pressure (personalization, analytics)
Fallback Strategies:
- Cached data for read-heavy operations
- Queue writes for later processing
- Default values when services are unavailable
- Simplified UI when client-side resources fail
Circuit Breakers:
- Automatically detect failing dependencies
- Switch to fallback mode quickly
- Periodically probe to detect recovery

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
enum CircuitState {
  CLOSED,    // Normal operation - requests flow through
  OPEN,      // Failure detected - requests blocked, fallback used
  HALF_OPEN  // Testing recovery - limited requests allowed
}
 
class CircuitBreaker {
  private state: CircuitState = CircuitState.CLOSED;
  private failureCount: number = 0;
  private lastFailureTime: number = 0;
  
  constructor(
    private failureThreshold: number = 5,
    private resetTimeout: number = 30000 // 30 seconds
  ) {}
  
  async execute<T>(
    primaryFn: () => Promise<T>,
    fallbackFn: () => Promise<T>
  ): Promise<T> {
    // Check if circuit should transition from OPEN to HALF_OPEN
    if (this.state === CircuitState.OPEN) {
      if (Date.now() - this.lastFailureTime > this.resetTimeout) {
        this.state = CircuitState.HALF_OPEN;
      } else {
        // Circuit is open - use fallback for availability
        return fallbackFn();
      }
    }
    
    try {
      const result = await primaryFn();
      
      // Success - reset circuit
      if (this.state === CircuitState.HALF_OPEN) {
        this.state = CircuitState.CLOSED;
      }
      this.failureCount = 0;
      
      return result;
    } catch (error) {
      this.failureCount++;
      this.lastFailureTime = Date.now();
      
      if (this.failureCount >= this.failureThreshold) {
        this.state = CircuitState.OPEN;
        console.log('Circuit opened - switching to fallback');
      }
      
      // Provide graceful degradation via fallback
      return fallbackFn();
    }
  }
}
 
// Usage Example
const recommendationBreaker = new CircuitBreaker(5, 30000);
 
async function getProductRecommendations(userId: string) {
  return recommendationBreaker.execute(
    // Primary: real-time personalized recommendations
    async () => {
      return await recommendationService.getPersonalized(userId);
    },
    // Fallback: cached popular items (always available)
    async () => {
      return await cache.get('popular-products');
    }
  );
}

Real-World Basically Available Systems

Let's examine how major companies implement basic availability in their systems. These patterns have been battle-tested at scales of millions of requests per second.

Case Study: Amazon DynamoDB

•Design Goal: Always-on database for Amazon's shopping cart (literally can't afford downtime during holiday shopping)
•Availability Strategy: Data replicated across multiple nodes; reads/writes use quorums
•Partition Handling: Nodes operate independently during partitions; conflicts resolved after partition heals
•Result: 99.999% availability even during data center failures
•Trade-off: Applications must handle potential temporary inconsistency

Case Study: Netflix

•Design Goal: Streaming must work; customers pay for availability, not perfect metadata
•Availability Strategy: Multiple levels of caching; fallback at every service boundary
•Degradation Tiers: HD → SD → audio only before showing error
•Result: Service remains usable even when dozens of microservices fail
•Trade-off: Recommendations might be stale; metadata occasionally inconsistent

Case Study: Apache Cassandra

•Design Goal: Linear scalability without sacrificing availability
•Availability Strategy: Tunable consistency per query (ONE, QUORUM, ALL)
•Partition Handling: Hinted handoff queues writes to unavailable nodes
•Result: Can survive multiple node failures in a cluster without downtime
•Trade-off: Strong consistency requires careful configuration and understanding

The Business Case for Basic Availability

Summary: The First Pillar of BASE

We've explored the first and arguably most important pillar of the BASE consistency model: Basically Available. Let's consolidate the key takeaways:

Key Takeaways

•Basic availability prioritizes response over perfection — The system will always attempt to serve a request, accepting that the response might not be perfectly up-to-date.
•The CAP theorem provides the theoretical foundation — In distributed systems, network partitions are inevitable, forcing a choice between consistency and availability.
•Availability is measured in nines — Each additional nine represents an order of magnitude improvement in uptime, with correspondingly higher engineering cost.
•Replication and quorums enable availability — By maintaining multiple copies and requiring only a subset to respond, systems can survive individual node failures.
•Graceful degradation extends availability — Rather than failing completely, systems should reduce functionality incrementally, prioritizing core features.
•Business value drives the trade-off — For most applications, available but slightly stale data is far more valuable than unavailable but consistent data.

What's Next:

Page Complete

1 / 5