System Design (HLD)Time in Distributed Systems

Time in Distributed Systems

LevelIntermediate

Duration90 mins

TopicTime in Distributed Systems

2 / 5

Physical Clocks and Clock Drift

The Imperfection of Physical Time

Every computer you've ever used lies to you about time. The clock displayed on your screen, the timestamps in your log files, the 'current time' returned by Date.now() or time.time()—none of these represent true, absolute time. They represent the output of imperfect physical oscillators, processed through layers of hardware and software that introduce errors at every stage.

Understanding physical clocks—how they work, why they fail, and what 'clock drift' really means—is essential for building robust distributed systems. Engineers who treat System.currentTimeMillis() as a source of truth will inevitably build systems that fail in subtle, hard-to-debug ways. This page provides the deep technical foundation needed to reason correctly about physical time.

What You Will Learn

By the end of this page, you will understand how computer clocks physically function, what causes drift (and how to quantify it), the difference between wall-clock and monotonic time, and why even high-quality clocks introduce uncertainty that impacts distributed system design. This knowledge is foundational for understanding NTP, logical clocks, and hybrid approaches.

How Computer Clocks Work

At the heart of every computer's timekeeping is a crystal oscillator—typically a small piece of quartz that vibrates at a precise frequency when electric current passes through it. This vibration generates electrical pulses that form the fundamental 'tick' of the computer's clock.

The Timekeeping Chain:

From Quartz to Timestamp: The Clock Hierarchy

•Crystal Oscillator — A quartz crystal (typically 32.768 kHz for RTC or MHz range for CPU) vibrates piezoelectrically. This is the fundamental timing source, accurate to parts per million (ppm).
•Real-Time Clock (RTC) — A dedicated chip (often battery-backed) that maintains time even when the system is powered off. Uses a low-power oscillator, typically less accurate than the main CPU clock.
•Hardware Timer/Counter — The CPU's internal counter that increments with each clock cycle or dedicated timer interrupt. Provides high-resolution time within a running system.
•Operating System Time — The OS kernel maintains system time by initializing from RTC at boot, then counting timer interrupts. Software corrections (NTP adjustments) are applied here.
•Application Time API — The time(), gettimeofday(), clock_gettime(), or System.currentTimeMillis() calls that applications use, which wrap kernel time services.

The Physical Layer: Quartz Crystal Oscillators

Quartz oscillators work through the piezoelectric effect: applying voltage to a quartz crystal causes it to deform, and deforming it generates voltage. When correctly cut and mounted, a quartz crystal will oscillate at a very stable frequency determined by its physical dimensions.

For computer RTCs, the standard frequency is 32.768 kHz (32,768 Hz), chosen because 32,768 = 2^15, making it trivial to divide down to 1 Hz (one tick per second) using a 15-stage binary counter. This simplifies circuit design but comes at the cost of precision—cheaper crystals at this frequency have significant drift.

Types of Clock Sources and Their Characteristics
Clock Type	Typical Frequency	Accuracy (ppm)	Drift per Day	Use Case
Cheap quartz (RTC)	32.768 kHz	±50-100 ppm	4-9 seconds	Battery-backed RTC, consumer devices
Standard quartz (RTC)	32.768 kHz	±20-50 ppm	2-4 seconds	Server RTC, moderate environments
Temperature-compensated (TCXO)	Various	±1-10 ppm	0.1-0.9 seconds	Telecom, financial systems
Oven-controlled (OCXO)	Various	±0.01-0.1 ppm	1-9 milliseconds	Precision instrumentation
Rubidium atomic	Various	±0.0001 ppm	~0.01 ms	Primary frequency standards
Cesium atomic	9.192... GHz	±10^-12	~nanoseconds	Definition of the second
GPS-disciplined	Various	~±10^-9	~microseconds	Data center time (Google TrueTime)

Parts Per Million (ppm) Explained

Clock accuracy is measured in parts per million (ppm). A clock with 50 ppm accuracy runs fast or slow by up to 50 microseconds per second. That's 50 × 10^-6 × 86,400 seconds/day = 4.32 seconds of drift per day. Over a month, a 50 ppm clock can drift by over 2 minutes. This error accumulates continuously between synchronization events.

Understanding Clock Drift

Clock drift refers to the gradual deviation of a clock from true time. Unlike a broken clock that gives random readings, a drifting clock appears to work correctly but runs slightly fast or slow relative to a reference time source. This drift is continuous and cumulative—small per-second errors compound into significant errors over hours, days, and weeks.

Why Clocks Drift:

Quartz crystal oscillators drift because their resonant frequency depends on physical conditions:

Primary Causes of Clock Drift

•Temperature Variation — The biggest factor for standard oscillators. Quartz frequency varies with temperature in a parabolic curve. A typical 32 kHz crystal has a 'turnover' temperature (often around 25°C) where it's most stable. Moving away from this temperature causes drift. Temperature changes of 10°C can cause frequency changes of several ppm.
•Aging — Crystal frequency changes slowly over time due to mass transfer, stress relief, and contamination. New crystals age fastest; aging slows over time. Typical aging is 1-5 ppm per year for consumer-grade crystals.
•Power Supply Variation — Changes in supply voltage affect oscillator circuits. While modern circuits minimize this (using voltage regulators), it still contributes, especially during power fluctuations.
•Crystal Quality — Manufacturing tolerances, impurities in the quartz, mounting stress, and cut angle all affect initial accuracy and stability. Higher-quality crystals cost significantly more.
•Mechanical Stress and Vibration — Physical forces can temporarily or permanently shift crystal frequency. This matters in mobile devices, vehicles, and industrial environments.
•Humidity and Contamination — Environmental factors can deposit on crystal surfaces, changing mass and thus frequency. Hermetically sealed crystals reduce but don't eliminate this.

Quantifying Drift: The Math That Matters

For distributed systems design, you need to calculate drift bounds. The formula is straightforward:

drift_calculations.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
# Clock Drift Calculations for Distributed Systems
 
def calculate_max_drift(ppm: float, duration_seconds: float) -> float:
    """
    Calculate maximum clock drift over a time period.
    
    Args:
        ppm: Clock accuracy in parts per million (e.g., 50.0 for 50 ppm)
        duration_seconds: Time in seconds since last synchronization
    
    Returns:
        Maximum drift in seconds
    """
    return (ppm / 1_000_000) * duration_seconds
 
 
def calculate_max_skew(ppm_a: float, ppm_b: float, duration_seconds: float) -> float:
    """
    Calculate maximum skew between two clocks.
    Both clocks could drift in opposite directions (worst case).
    
    Args:
        ppm_a: Clock A accuracy in ppm
        ppm_b: Clock B accuracy in ppm  
        duration_seconds: Time since both were synchronized
    
    Returns:
        Maximum skew in seconds (difference between the two clocks)
    """
    # Worst case: one clock runs fast, other runs slow
    max_drift_a = calculate_max_drift(ppm_a, duration_seconds)
    max_drift_b = calculate_max_drift(ppm_b, duration_seconds)
    return max_drift_a + max_drift_b
 
 
# Example: Server farm with 50 ppm clocks, NTP sync every 10 minutes
ppm_typical_server = 50.0
sync_interval_seconds = 10 * 60  # 10 minutes = 600 seconds
 
max_drift = calculate_max_drift(ppm_typical_server, sync_interval_seconds)
print(f"Max drift per clock: {max_drift * 1000:.1f} ms")  # 30.0 ms
 
max_skew = calculate_max_skew(ppm_typical_server, ppm_typical_server, sync_interval_seconds)
print(f"Max skew between any two servers: {max_skew * 1000:.1f} ms")  # 60.0 ms
 
 
# Extended example: Worst-case scenarios
scenarios = [
    ("Same rack, NTP every 1 min", 50, 60),
    ("Same datacenter, NTP every 10 min", 50, 600),
    ("Cross-datacenter, NTP every 1 hour", 50, 3600),
    ("Poor NTP, sync every 24 hours", 50, 86400),
    ("Server with TCXO (5 ppm), 10 min sync", 5, 600),
    ("Google TrueTime scenario", 0.001, 600),  # GPS-disciplined
]
 
print("\nMaximum clock skew scenarios:")
print("-" * 55)
for scenario, ppm, interval in scenarios:
    skew = calculate_max_skew(ppm, ppm, interval)
    print(f"{scenario}:")
    print(f"  Max skew: {skew * 1000:.2f} ms ({skew:.4f} s)")

The Cumulative Nature of Drift

Drift is not a one-time error—it accumulates continuously. Between NTP synchronizations, clocks diverge. If your NTP sync fails for an hour, a 50 ppm clock could drift by 180 milliseconds. In 24 hours, that's 4.3 seconds. Systems must be designed to handle both normal drift and extended periods without synchronization.

Wall-Clock Time vs Monotonic Time

Operating systems provide two fundamentally different types of clocks, and confusing them is one of the most common sources of time-related bugs. Understanding the distinction is critical:

Wall-Clock Time (Real Time):

Wall-clock time attempts to reflect the actual time of day—what a clock on your wall would show. It's synchronized to external references (NTP servers) and corresponds to human-meaningful time (e.g., 'January 7, 2026 at 3:42 PM').

Monotonic Time:

Monotonic time is a counter that starts at some arbitrary point (often system boot) and only moves forward. It's not synchronized to external sources and doesn't correspond to real-world time, but it never jumps backward and advances at a steady rate.

Wall-Clock vs Monotonic Time: Critical Differences
Characteristic	Wall-Clock Time	Monotonic Time
Meaning	Time of day (UTC, local timezone)	Duration since arbitrary epoch
Can jump forward	Yes (NTP step adjustment, DST)	No (only during extreme conditions)
Can jump backward	Yes (NTP correction, leap seconds)	No (never under normal operation)
Synchronized to external source	Yes (NTP)	No (local hardware only)
Affected by DST changes	Yes (in local time)	No
Affected by timezone changes	Yes (in local time)	No
Suitable for timeouts	No (jumps cause issues)	Yes (monotonically increasing)
Suitable for scheduling	Yes (calendar-based events)	No (no real-world meaning)
Suitable for logging	Yes (human-readable)	Partial (needs context)
Suitable for distributed ordering	Limited (uncertainty bounds)	No (not synchronized)

The Danger of Using Wall-Clock for Durations

Consider this common anti-pattern:

clock_antipatterns.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
import time
 
# DANGEROUS: Using wall-clock time for timeout/duration
# This code WILL break when NTP adjusts the clock
 
def dangerous_timeout():
    """This function has a subtle bug that causes production failures."""
    
    start = time.time()  # Wall-clock time
    timeout_seconds = 30
    
    while True:
        # Process some work...
        process_next_item()
        
        elapsed = time.time() - start
        if elapsed > timeout_seconds:
            print("Timeout reached")
            break
    
    # BUG: If NTP jumps time backward by 1 minute during this loop,
    # 'elapsed' becomes negative, and this loop runs for 1 minute
    # longer than intended.
    #
    # If NTP jumps time forward by 1 hour, the loop exits immediately
    # even if it just started, potentially before processing any items.
 
 
def safe_timeout():
    """Correct implementation using monotonic clock."""
    
    start = time.monotonic()  # Monotonic time - never moves backward
    timeout_seconds = 30
    
    while True:
        # Process some work...
        process_next_item()
        
        elapsed = time.monotonic() - start
        if elapsed > timeout_seconds:
            print("Timeout reached")
            break
    
    # SAFE: time.monotonic() is guaranteed to never move backward.
    # NTP adjustments don't affect it. The timeout works correctly
    # regardless of wall-clock adjustments.
 
 
# CRITICAL: Different languages have different APIs
# 
# Python:
#   time.time()       -> Wall-clock (AVOID for durations)
#   time.monotonic()  -> Monotonic (USE for durations)
#
# Java:
#   System.currentTimeMillis() -> Wall-clock (AVOID for durations)
#   System.nanoTime()          -> Monotonic (USE for durations)
#
# C (Linux):
#   clock_gettime(CLOCK_REALTIME, ...)  -> Wall-clock
#   clock_gettime(CLOCK_MONOTONIC, ...) -> Monotonic
#
# Go:
#   time.Now()             -> Wall-clock
#   time.Now().UnixNano()  after runtime changes tracks monotonic
#   // Go 1.9+ time.Time includes monotonic component
#
# Rust:
#   std::time::SystemTime::now() -> Wall-clock
#   std::time::Instant::now()    -> Monotonic

The Simple Rule

Use wall-clock time when you need to know 'what time is it?' (logging, scheduling, user display). Use monotonic time when you need to measure 'how much time has passed?' (timeouts, rate limiting, performance measurement). Mixing these up is a guaranteed source of production bugs.

Clock Adjustments: Steps, Slews, and Jumps

Wall-clock time doesn't just drift—it gets adjusted. These adjustments happen when the OS reconciles local time with an authoritative source (typically NTP). Understanding the types of adjustments is crucial because each has different implications for system behavior:

Types of Clock Adjustments:

How Clocks Get Corrected

•Step Adjustment (Abrupt Jump) When the offset is too large (typically >128ms in NTP), the clock is instantly set to the correct time. This causes a discontinuity—time literally jumps forward or backward. Applications see a sudden change in time() that doesn't correspond to elapsed real-world time.
•Slew Adjustment (Gradual Correction) For small offsets, NTP adjusts the clock frequency slightly. The clock runs faster or slower than real-time until the error is corrected. This avoids discontinuities but means the clock temporarily runs at an incorrect rate. A 100ms correction via slew might take 200 seconds (at 500 ppm adjustment rate).
•Leap Second Insertion UTC occasionally inserts (or theoretically removes) a leap second to keep atomic time synchronized with Earth's rotation. At 23:59:59 UTC, instead of rolling to 00:00:00, the clock shows 23:59:60. Some systems handle this by 'smearing' the leap second over a longer period.
•Administrative Changes System administrators or automated processes may adjust system time directly. VM migrations can cause time discontinuities. Container start/stop can expose different host clocks. Hibernation/resume causes large forward jumps.

The Leap Second Problem:

Leap seconds have caused numerous production incidents. The issue is that 23:59:60 is not a valid time in most software libraries, and systems handle it inconsistently:

Leap Second Handling Strategies
Strategy	How It Works	Pros	Cons
Step (POSIX)	Clock jumps from 23:59:59 to 23:59:59 (1s repeat) or to 00:00:00	Simple, immediate	Time appears to go backward; breaks monotonicity assumptions
Stop (freeze)	Clock stops for one second at 23:59:59	Maintains monotonicity	All timers and timeouts affected; scheduling chaos
Smear (Google/AWS)	Spread leap second over hours (12 or 24)	No discontinuity, gradual	Time technically 'wrong' during smear; incompatible across smear strategies
Ignore	Pretend it didn't happen	Simple	Clock drifts by 1 second; eventually NTP corrects with step

The 2012 Leap Second Incident

On June 30, 2012, the leap second insertion crashed or degraded numerous systems worldwide. Reddit, Mozilla, Gawker, LinkedIn, and others experienced outages. The Linux kernel's leap second handling had a bug that caused high CPU usage. Java applications using Thread.sleep() were affected. After this incident, Google developed 'leap smear' and many organizations moved to smearing strategies.

Detecting and Handling Adjustments:

Robust distributed systems should detect and handle clock adjustments gracefully:

adjustment_detection.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
import time
from dataclasses import dataclass
from typing import Optional
 
@dataclass
class ClockHealth:
    """Tracks the health and stability of the system clock."""
    last_wall_time: float
    last_mono_time: float
    detected_jumps: int = 0
    max_jump_detected: float = 0.0
    
 
def detect_clock_adjustment(previous: ClockHealth) -> tuple[ClockHealth, Optional[float]]:
    """
    Detect if a clock adjustment has occurred since last check.
    
    Returns:
        Tuple of (updated ClockHealth, adjustment amount if detected else None)
    """
    current_wall = time.time()
    current_mono = time.monotonic()
    
    # Calculate elapsed time according to each clock
    wall_elapsed = current_wall - previous.last_wall_time
    mono_elapsed = current_mono - previous.last_mono_time
    
    # If these differ significantly, wall-clock was adjusted
    # Small differences are normal due to timer resolution
    discrepancy = wall_elapsed - mono_elapsed
    
    jump_detected = None
    if abs(discrepancy) > 0.1:  # More than 100ms discrepancy
        jump_detected = discrepancy
        previous.detected_jumps += 1
        if abs(discrepancy) > abs(previous.max_jump_detected):
            previous.max_jump_detected = discrepancy
            
        if discrepancy > 0:
            print(f"⚠️ Clock jumped FORWARD by {discrepancy:.3f} seconds")
        else:
            print(f"🚨 Clock jumped BACKWARD by {-discrepancy:.3f} seconds")
    
    # Update tracking
    previous.last_wall_time = current_wall
    previous.last_mono_time = current_mono
    
    return previous, jump_detected
 
 
def monitor_clock(check_interval_seconds: float = 1.0):
    """
    Continuously monitor for clock adjustments.
    In production, this would emit metrics and potentially alerts.
    """
    health = ClockHealth(
        last_wall_time=time.time(),
        last_mono_time=time.monotonic()
    )
    
    print("Starting clock monitoring...")
    while True:
        time.sleep(check_interval_seconds)
        health, jump = detect_clock_adjustment(health)
        
        if jump is not None:
            # In production, emit metrics or alerts here
            # metrics.emit("clock.adjustment", jump)
            # if abs(jump) > THRESHOLD:
            #     alert_oncall("Large clock adjustment detected")
            pass
 
 
# Design pattern: Clock-safe timestamp generation
class SafeTimestamp:
    """
    Generates timestamps that are safe for ordering within a node.
    Handles clock jumps by using monotonic time to ensure monotonicity.
    """
    
    def __init__(self):
        self._last_timestamp = 0
        self._wall_offset = time.time() - time.monotonic()
    
    def now(self) -> int:
        """
        Returns monotonically increasing timestamp in microseconds.
        Uses wall-clock for approximate absolute time, but ensures
        strict monotonicity using monotonic clock as backstop.
        """
        # Use wall clock for absolute time
        wall_time_us = int(time.time() * 1_000_000)
        
        # Ensure monotonicity: never return a timestamp <= previous
        if wall_time_us <= self._last_timestamp:
            # Wall clock went backward or didn't advance
            # Use last timestamp + 1 (microsecond increment)
            self._last_timestamp += 1
        else:
            self._last_timestamp = wall_time_us
            
        return self._last_timestamp

Temperature and Environmental Effects on Clocks

In real-world data center and cloud environments, clock accuracy is heavily influenced by physical conditions. Engineers often underestimate these effects because lab conditions differ dramatically from production.

Temperature is the Dominant Factor:

Standard quartz crystals have a parabolic frequency-temperature relationship. At the 'turnover temperature' (typically 20-30°C, depending on the crystal cut), the crystal is at its nominal frequency. Moving away from this temperature causes the frequency to decrease, following an approximately parabolic curve.

For a typical 32.768 kHz 'tuning fork' crystal:

At the turnover point: Nominal frequency
At ±10°C from turnover: ~0.035 ppm deviation
At ±25°C from turnover: ~0.22 ppm deviation (typical spec: ±20 ppm from -40°C to +85°C)

Environmental Factors Affecting Clock Accuracy
Factor	Typical Effect	Mitigation	Real-World Scenario
Temperature variation	1-100+ ppm	Temperature-compensated (TCXO) or oven-controlled (OCXO) oscillators	Servers in regions with HVAC failures; outdoor edge nodes
CPU thermal cycling	1-10 ppm	Isolate RTC from CPU heat; use dedicated clock chip	Burst compute loads cause temperature swings
Altitude/pressure	Negligible for quartz	N/A	Only matters for mechanical clocks or extreme altitudes
Humidity	<1 ppm (unless condensation)	Hermetically sealed crystals	Data centers with humidity control issues
Aging	1-5 ppm/year	Periodic calibration or replacement	Long-lived servers with original RTCs
Shock/vibration	Temporary shift (g-sensitivity)	Mounting design, shock isolation	Mobile devices, vehicles, industrial settings

The Edge Computing Challenge

Edge and IoT deployments face extreme clock challenges. Devices may operate in uncontrolled temperatures (-40°C to +85°C), experience intermittent network connectivity (preventing NTP sync), and use cheap oscillators to save cost. A 100 ppm clock that syncs only daily can drift by 8.6 seconds per day. Systems designed for these environments need fundamentally different time strategies than data center systems.

Case Study: Data Center Temperature Event

Consider a scenario where a data center cooling system fails:

T+0 minutes: Cooling fails, temperature starts rising from 22°C
T+15 minutes: Server inlet temperature reaches 35°C (~13°C above typical)
T+15 minutes: Servers with ±50 ppm crystals (spec'd at 25°C) now have actual drift potentially reaching 75-100 ppm due to temperature coefficient
T+30 minutes: During this 30-minute period, clocks may drift an additional ~100-180 ms beyond normal
Cooling restored: Temperatures return to normal, but clock error has accumulated
Next NTP sync: Large step correction needed

For systems relying on tight clock bounds for correctness (e.g., Spanner's TrueTime), such temperature excursions must be detected and handled—typically by widening uncertainty bounds during the event.

Hardware Clock Implementations

Understanding the actual hardware used for timekeeping helps predict behavior and failure modes. Different systems use different approaches, with vastly different accuracy characteristics.

Common Hardware Clock Architectures:

Timekeeping Hardware in Computing Systems

•Real-Time Clock (RTC) Chip A dedicated IC (e.g., DS1307, PCF8523, or integrated in the southbridge/PCH) with a 32.768 kHz crystal and battery backup. Maintains time when the system is off. Typically accurate to ±20-50 ppm. Provides seconds resolution; subsecond data comes from other sources.
•High-Precision Event Timer (HPET) Modern x86 timer running at ~14.3 MHz or higher. Provides high-resolution timestamps but can have latency accessing it. Some systems avoid HPET due to performance issues.
•Time Stamp Counter (TSC) CPU register that increments every clock cycle. Extremely high resolution (nanoseconds on modern CPUs). Modern 'invariant TSC' maintains consistent frequency regardless of CPU power states, making it usable as a monotonic clock source. Fast to read (single instruction).
•Programmable Interval Timer (PIT) Legacy 8254 timer at 1.193 MHz. Poor resolution by modern standards but still exists for compatibility. Rarely used as primary time source.
•GPS Receivers Used in high-precision timing systems. GPS time is derived from atomic clocks on satellites, accurate to ~10-30 nanoseconds. GPS receivers provide a highly accurate pulse-per-second (PPS) signal used to discipline local oscillators.
•Atomic Clocks (Rubidium, Cesium) Found in data centers requiring extreme accuracy (financial trading, telecom). Rubidium clocks: ~$2,000, ±10^-10 accuracy. Cesium clocks: ~$50,000+, ±10^-12 accuracy. Google uses GPS + atomic clocks for TrueTime.

How Operating Systems Use These Clocks:

Boot and Runtime Clock Usage on Linux
Phase	Clock Source Used	Purpose	Accuracy
Boot (early)	RTC	Initialize kernel time to approximate wall-clock	±seconds (until NTP)
Boot (late)	TSC or HPET	High-resolution timekeeping begins	Depends on NTP sync
Runtime	TSC (preferred)	Monotonic and wall-clock time	μs resolution, drift between NTP
NTP adjustment	External NTP servers	Correct wall-clock; discipline local oscillator	Typically 1-10ms to sources
Suspend/resume	RTC	Restore approximate time after power state	May have jumped; NTP re-syncs

Cloud and Virtual Machine Implications

Virtual machines add complexity. The hypervisor may expose a virtualized TSC or HPET that doesn't directly correspond to physical hardware. VM migration can cause time discontinuities. Cloud providers have varying quality of time synchronization—Amazon Time Sync Service, Google NTP, and Azure NTP provide tight synchronization within their networks, but cross-cloud time coordination is still challenging.

Implications for Distributed Systems Design

Understanding physical clock behavior directly informs distributed systems design. Here's how to translate clock characteristics into engineering decisions:

Clock Realities

•50 ppm drift = 4.3s/day between syncs
•NTP sync achieves 1-10ms precision typically
•Clock can jump forward or backward
•Temperature affects accuracy significantly
•VM clocks add virtualization uncertainty

Design Implications

•Can't rely on timestamps alone for ordering
•Must add safety margins to time bounds
•Need monotonic timestamps for durations
•Consider logical clocks for causality
•Test with clock skew injection

Specific Design Patterns:

Pattern 1: Bounded Clock Uncertainty

For systems that need to order events using physical time (like Spanner), explicitly track uncertainty bounds. Never assume clock is exactly correct; always operate with [earliest_possible, latest_possible] intervals.

Pattern 2: Monotonic-Safe Timestamps

Generate timestamps that are guaranteed monotonic locally by using max(wall_clock, last_timestamp + 1). This prevents backward jumps from causing ordering inversions within a single node.

Pattern 3: Hybrid Logical Clocks

Combine physical timestamps with logical clock extensions. Use physical component for rough ordering and efficiency, but fall back to logical component when physical times are within uncertainty window.

Pattern 4: Clock Health Monitoring

Continuously monitor clock health: detect jumps, track NTP sync status, measure drift. If clock health degrades, adjust behavior (widen uncertainty bounds, increase safety margins, or alert operators).

Pattern 5: Grace Periods

For time-based operations (TTL, lease expiry), add grace periods that exceed expected clock skew. If max skew is 10ms, a 100ms grace period provides margin for extreme cases.

The Principal Engineer's Heuristic

When designing time-sensitive distributed systems, multiply your expected clock error by 3-5x for safety margin. If you think clocks are synchronized to 10ms, design assuming 50ms. This accounts for NTP delays, outliers, partial failures, and conditions you haven't anticipated. Systems that work precisely at expected bounds fail spectacularly at real-world bounds.

Summary: Physical Clocks and Drift

We've explored the physical foundations of timekeeping in computer systems. Let's consolidate the key insights:

Key Takeaways

•Quartz oscillators drift — Typical server clocks drift 20-50 ppm, accumulating seconds of error per day. Temperature, aging, and manufacturing variation all contribute.
•Wall-clock vs monotonic time — Wall-clock time can jump and is unsuitable for measuring durations. Monotonic time never runs backward and should be used for timeouts, rate limiting, and performance measurement.
•Clocks are adjusted, not just corrected — NTP performs step adjustments (jumps) for large errors and slew adjustments (frequency changes) for small errors. Both can break naive timestamp logic.
•Leap seconds cause outages — The irregular insertion of leap seconds has caused numerous production incidents. Modern systems often use 'leap smearing' to avoid discontinuities.
•Hardware varies enormously — From cheap RTC chips to GPS-disciplined atomic references, clock quality spans orders of magnitude in accuracy and cost.
•Design with margins — Never assume clock precision is better than measured worst-case. Add safety factors to time-based logic.

What's Next:

Now that we understand physical clock limitations, the next page explores NTP (Network Time Protocol) and time synchronization. We'll examine how NTP achieves the best practical synchronization, its limitations, and advanced alternatives like PTP and Google's TrueTime. Understanding synchronization protocols is essential before we can reason about clock uncertainty bounds in distributed systems.

Page Complete

You now have a deep understanding of physical clocks: how they work, why they drift, and what causes time discontinuities. This knowledge is foundational for all time-related distributed systems design. Remember: every timestamp you see is the output of imperfect hardware through layers of software—treat it with appropriate skepticism.

2 / 5

Loading learning content...

System Design (HLD)Time in Distributed Systems

Time in Distributed Systems

LevelIntermediate

Duration90 mins

TopicTime in Distributed Systems

2 / 5

Physical Clocks and Clock Drift

The Imperfection of Physical Time

What You Will Learn

How Computer Clocks Work

The Timekeeping Chain:

From Quartz to Timestamp: The Clock Hierarchy

•Crystal Oscillator — A quartz crystal (typically 32.768 kHz for RTC or MHz range for CPU) vibrates piezoelectrically. This is the fundamental timing source, accurate to parts per million (ppm).
•Real-Time Clock (RTC) — A dedicated chip (often battery-backed) that maintains time even when the system is powered off. Uses a low-power oscillator, typically less accurate than the main CPU clock.
•Hardware Timer/Counter — The CPU's internal counter that increments with each clock cycle or dedicated timer interrupt. Provides high-resolution time within a running system.
•Operating System Time — The OS kernel maintains system time by initializing from RTC at boot, then counting timer interrupts. Software corrections (NTP adjustments) are applied here.
•Application Time API — The time(), gettimeofday(), clock_gettime(), or System.currentTimeMillis() calls that applications use, which wrap kernel time services.

The Physical Layer: Quartz Crystal Oscillators

Types of Clock Sources and Their Characteristics
Clock Type	Typical Frequency	Accuracy (ppm)	Drift per Day	Use Case
Cheap quartz (RTC)	32.768 kHz	±50-100 ppm	4-9 seconds	Battery-backed RTC, consumer devices
Standard quartz (RTC)	32.768 kHz	±20-50 ppm	2-4 seconds	Server RTC, moderate environments
Temperature-compensated (TCXO)	Various	±1-10 ppm	0.1-0.9 seconds	Telecom, financial systems
Oven-controlled (OCXO)	Various	±0.01-0.1 ppm	1-9 milliseconds	Precision instrumentation
Rubidium atomic	Various	±0.0001 ppm	~0.01 ms	Primary frequency standards
Cesium atomic	9.192... GHz	±10^-12	~nanoseconds	Definition of the second
GPS-disciplined	Various	~±10^-9	~microseconds	Data center time (Google TrueTime)

Parts Per Million (ppm) Explained

Understanding Clock Drift

Why Clocks Drift:

Quartz crystal oscillators drift because their resonant frequency depends on physical conditions:

Primary Causes of Clock Drift

•Temperature Variation — The biggest factor for standard oscillators. Quartz frequency varies with temperature in a parabolic curve. A typical 32 kHz crystal has a 'turnover' temperature (often around 25°C) where it's most stable. Moving away from this temperature causes drift. Temperature changes of 10°C can cause frequency changes of several ppm.
•Aging — Crystal frequency changes slowly over time due to mass transfer, stress relief, and contamination. New crystals age fastest; aging slows over time. Typical aging is 1-5 ppm per year for consumer-grade crystals.
•Power Supply Variation — Changes in supply voltage affect oscillator circuits. While modern circuits minimize this (using voltage regulators), it still contributes, especially during power fluctuations.
•Crystal Quality — Manufacturing tolerances, impurities in the quartz, mounting stress, and cut angle all affect initial accuracy and stability. Higher-quality crystals cost significantly more.
•Mechanical Stress and Vibration — Physical forces can temporarily or permanently shift crystal frequency. This matters in mobile devices, vehicles, and industrial environments.
•Humidity and Contamination — Environmental factors can deposit on crystal surfaces, changing mass and thus frequency. Hermetically sealed crystals reduce but don't eliminate this.

Quantifying Drift: The Math That Matters

For distributed systems design, you need to calculate drift bounds. The formula is straightforward:

drift_calculations.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
# Clock Drift Calculations for Distributed Systems
 
def calculate_max_drift(ppm: float, duration_seconds: float) -> float:
    """
    Calculate maximum clock drift over a time period.
    
    Args:
        ppm: Clock accuracy in parts per million (e.g., 50.0 for 50 ppm)
        duration_seconds: Time in seconds since last synchronization
    
    Returns:
        Maximum drift in seconds
    """
    return (ppm / 1_000_000) * duration_seconds
 
 
def calculate_max_skew(ppm_a: float, ppm_b: float, duration_seconds: float) -> float:
    """
    Calculate maximum skew between two clocks.
    Both clocks could drift in opposite directions (worst case).
    
    Args:
        ppm_a: Clock A accuracy in ppm
        ppm_b: Clock B accuracy in ppm  
        duration_seconds: Time since both were synchronized
    
    Returns:
        Maximum skew in seconds (difference between the two clocks)
    """
    # Worst case: one clock runs fast, other runs slow
    max_drift_a = calculate_max_drift(ppm_a, duration_seconds)
    max_drift_b = calculate_max_drift(ppm_b, duration_seconds)
    return max_drift_a + max_drift_b
 
 
# Example: Server farm with 50 ppm clocks, NTP sync every 10 minutes
ppm_typical_server = 50.0
sync_interval_seconds = 10 * 60  # 10 minutes = 600 seconds
 
max_drift = calculate_max_drift(ppm_typical_server, sync_interval_seconds)
print(f"Max drift per clock: {max_drift * 1000:.1f} ms")  # 30.0 ms
 
max_skew = calculate_max_skew(ppm_typical_server, ppm_typical_server, sync_interval_seconds)
print(f"Max skew between any two servers: {max_skew * 1000:.1f} ms")  # 60.0 ms
 
 
# Extended example: Worst-case scenarios
scenarios = [
    ("Same rack, NTP every 1 min", 50, 60),
    ("Same datacenter, NTP every 10 min", 50, 600),
    ("Cross-datacenter, NTP every 1 hour", 50, 3600),
    ("Poor NTP, sync every 24 hours", 50, 86400),
    ("Server with TCXO (5 ppm), 10 min sync", 5, 600),
    ("Google TrueTime scenario", 0.001, 600),  # GPS-disciplined
]
 
print("\nMaximum clock skew scenarios:")
print("-" * 55)
for scenario, ppm, interval in scenarios:
    skew = calculate_max_skew(ppm, ppm, interval)
    print(f"{scenario}:")
    print(f"  Max skew: {skew * 1000:.2f} ms ({skew:.4f} s)")

The Cumulative Nature of Drift

Wall-Clock Time vs Monotonic Time

Operating systems provide two fundamentally different types of clocks, and confusing them is one of the most common sources of time-related bugs. Understanding the distinction is critical:

Wall-Clock Time (Real Time):

Monotonic Time:

Wall-Clock vs Monotonic Time: Critical Differences
Characteristic	Wall-Clock Time	Monotonic Time
Meaning	Time of day (UTC, local timezone)	Duration since arbitrary epoch
Can jump forward	Yes (NTP step adjustment, DST)	No (only during extreme conditions)
Can jump backward	Yes (NTP correction, leap seconds)	No (never under normal operation)
Synchronized to external source	Yes (NTP)	No (local hardware only)
Affected by DST changes	Yes (in local time)	No
Affected by timezone changes	Yes (in local time)	No
Suitable for timeouts	No (jumps cause issues)	Yes (monotonically increasing)
Suitable for scheduling	Yes (calendar-based events)	No (no real-world meaning)
Suitable for logging	Yes (human-readable)	Partial (needs context)
Suitable for distributed ordering	Limited (uncertainty bounds)	No (not synchronized)

The Danger of Using Wall-Clock for Durations

Consider this common anti-pattern:

clock_antipatterns.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
import time
 
# DANGEROUS: Using wall-clock time for timeout/duration
# This code WILL break when NTP adjusts the clock
 
def dangerous_timeout():
    """This function has a subtle bug that causes production failures."""
    
    start = time.time()  # Wall-clock time
    timeout_seconds = 30
    
    while True:
        # Process some work...
        process_next_item()
        
        elapsed = time.time() - start
        if elapsed > timeout_seconds:
            print("Timeout reached")
            break
    
    # BUG: If NTP jumps time backward by 1 minute during this loop,
    # 'elapsed' becomes negative, and this loop runs for 1 minute
    # longer than intended.
    #
    # If NTP jumps time forward by 1 hour, the loop exits immediately
    # even if it just started, potentially before processing any items.
 
 
def safe_timeout():
    """Correct implementation using monotonic clock."""
    
    start = time.monotonic()  # Monotonic time - never moves backward
    timeout_seconds = 30
    
    while True:
        # Process some work...
        process_next_item()
        
        elapsed = time.monotonic() - start
        if elapsed > timeout_seconds:
            print("Timeout reached")
            break
    
    # SAFE: time.monotonic() is guaranteed to never move backward.
    # NTP adjustments don't affect it. The timeout works correctly
    # regardless of wall-clock adjustments.
 
 
# CRITICAL: Different languages have different APIs
# 
# Python:
#   time.time()       -> Wall-clock (AVOID for durations)
#   time.monotonic()  -> Monotonic (USE for durations)
#
# Java:
#   System.currentTimeMillis() -> Wall-clock (AVOID for durations)
#   System.nanoTime()          -> Monotonic (USE for durations)
#
# C (Linux):
#   clock_gettime(CLOCK_REALTIME, ...)  -> Wall-clock
#   clock_gettime(CLOCK_MONOTONIC, ...) -> Monotonic
#
# Go:
#   time.Now()             -> Wall-clock
#   time.Now().UnixNano()  after runtime changes tracks monotonic
#   // Go 1.9+ time.Time includes monotonic component
#
# Rust:
#   std::time::SystemTime::now() -> Wall-clock
#   std::time::Instant::now()    -> Monotonic

The Simple Rule

Clock Adjustments: Steps, Slews, and Jumps

Types of Clock Adjustments:

How Clocks Get Corrected

•Step Adjustment (Abrupt Jump) When the offset is too large (typically >128ms in NTP), the clock is instantly set to the correct time. This causes a discontinuity—time literally jumps forward or backward. Applications see a sudden change in time() that doesn't correspond to elapsed real-world time.
•Slew Adjustment (Gradual Correction) For small offsets, NTP adjusts the clock frequency slightly. The clock runs faster or slower than real-time until the error is corrected. This avoids discontinuities but means the clock temporarily runs at an incorrect rate. A 100ms correction via slew might take 200 seconds (at 500 ppm adjustment rate).
•Leap Second Insertion UTC occasionally inserts (or theoretically removes) a leap second to keep atomic time synchronized with Earth's rotation. At 23:59:59 UTC, instead of rolling to 00:00:00, the clock shows 23:59:60. Some systems handle this by 'smearing' the leap second over a longer period.
•Administrative Changes System administrators or automated processes may adjust system time directly. VM migrations can cause time discontinuities. Container start/stop can expose different host clocks. Hibernation/resume causes large forward jumps.

The Leap Second Problem:

Leap seconds have caused numerous production incidents. The issue is that 23:59:60 is not a valid time in most software libraries, and systems handle it inconsistently:

Leap Second Handling Strategies
Strategy	How It Works	Pros	Cons
Step (POSIX)	Clock jumps from 23:59:59 to 23:59:59 (1s repeat) or to 00:00:00	Simple, immediate	Time appears to go backward; breaks monotonicity assumptions
Stop (freeze)	Clock stops for one second at 23:59:59	Maintains monotonicity	All timers and timeouts affected; scheduling chaos
Smear (Google/AWS)	Spread leap second over hours (12 or 24)	No discontinuity, gradual	Time technically 'wrong' during smear; incompatible across smear strategies
Ignore	Pretend it didn't happen	Simple	Clock drifts by 1 second; eventually NTP corrects with step

The 2012 Leap Second Incident

Detecting and Handling Adjustments:

Robust distributed systems should detect and handle clock adjustments gracefully:

adjustment_detection.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
import time
from dataclasses import dataclass
from typing import Optional
 
@dataclass
class ClockHealth:
    """Tracks the health and stability of the system clock."""
    last_wall_time: float
    last_mono_time: float
    detected_jumps: int = 0
    max_jump_detected: float = 0.0
    
 
def detect_clock_adjustment(previous: ClockHealth) -> tuple[ClockHealth, Optional[float]]:
    """
    Detect if a clock adjustment has occurred since last check.
    
    Returns:
        Tuple of (updated ClockHealth, adjustment amount if detected else None)
    """
    current_wall = time.time()
    current_mono = time.monotonic()
    
    # Calculate elapsed time according to each clock
    wall_elapsed = current_wall - previous.last_wall_time
    mono_elapsed = current_mono - previous.last_mono_time
    
    # If these differ significantly, wall-clock was adjusted
    # Small differences are normal due to timer resolution
    discrepancy = wall_elapsed - mono_elapsed
    
    jump_detected = None
    if abs(discrepancy) > 0.1:  # More than 100ms discrepancy
        jump_detected = discrepancy
        previous.detected_jumps += 1
        if abs(discrepancy) > abs(previous.max_jump_detected):
            previous.max_jump_detected = discrepancy
            
        if discrepancy > 0:
            print(f"⚠️ Clock jumped FORWARD by {discrepancy:.3f} seconds")
        else:
            print(f"🚨 Clock jumped BACKWARD by {-discrepancy:.3f} seconds")
    
    # Update tracking
    previous.last_wall_time = current_wall
    previous.last_mono_time = current_mono
    
    return previous, jump_detected
 
 
def monitor_clock(check_interval_seconds: float = 1.0):
    """
    Continuously monitor for clock adjustments.
    In production, this would emit metrics and potentially alerts.
    """
    health = ClockHealth(
        last_wall_time=time.time(),
        last_mono_time=time.monotonic()
    )
    
    print("Starting clock monitoring...")
    while True:
        time.sleep(check_interval_seconds)
        health, jump = detect_clock_adjustment(health)
        
        if jump is not None:
            # In production, emit metrics or alerts here
            # metrics.emit("clock.adjustment", jump)
            # if abs(jump) > THRESHOLD:
            #     alert_oncall("Large clock adjustment detected")
            pass
 
 
# Design pattern: Clock-safe timestamp generation
class SafeTimestamp:
    """
    Generates timestamps that are safe for ordering within a node.
    Handles clock jumps by using monotonic time to ensure monotonicity.
    """
    
    def __init__(self):
        self._last_timestamp = 0
        self._wall_offset = time.time() - time.monotonic()
    
    def now(self) -> int:
        """
        Returns monotonically increasing timestamp in microseconds.
        Uses wall-clock for approximate absolute time, but ensures
        strict monotonicity using monotonic clock as backstop.
        """
        # Use wall clock for absolute time
        wall_time_us = int(time.time() * 1_000_000)
        
        # Ensure monotonicity: never return a timestamp <= previous
        if wall_time_us <= self._last_timestamp:
            # Wall clock went backward or didn't advance
            # Use last timestamp + 1 (microsecond increment)
            self._last_timestamp += 1
        else:
            self._last_timestamp = wall_time_us
            
        return self._last_timestamp

Temperature and Environmental Effects on Clocks

Temperature is the Dominant Factor:

For a typical 32.768 kHz 'tuning fork' crystal:

At the turnover point: Nominal frequency
At ±10°C from turnover: ~0.035 ppm deviation
At ±25°C from turnover: ~0.22 ppm deviation (typical spec: ±20 ppm from -40°C to +85°C)

Environmental Factors Affecting Clock Accuracy
Factor	Typical Effect	Mitigation	Real-World Scenario
Temperature variation	1-100+ ppm	Temperature-compensated (TCXO) or oven-controlled (OCXO) oscillators	Servers in regions with HVAC failures; outdoor edge nodes
CPU thermal cycling	1-10 ppm	Isolate RTC from CPU heat; use dedicated clock chip	Burst compute loads cause temperature swings
Altitude/pressure	Negligible for quartz	N/A	Only matters for mechanical clocks or extreme altitudes
Humidity	<1 ppm (unless condensation)	Hermetically sealed crystals	Data centers with humidity control issues
Aging	1-5 ppm/year	Periodic calibration or replacement	Long-lived servers with original RTCs
Shock/vibration	Temporary shift (g-sensitivity)	Mounting design, shock isolation	Mobile devices, vehicles, industrial settings

The Edge Computing Challenge

Case Study: Data Center Temperature Event

Consider a scenario where a data center cooling system fails:

T+0 minutes: Cooling fails, temperature starts rising from 22°C
T+15 minutes: Server inlet temperature reaches 35°C (~13°C above typical)
T+15 minutes: Servers with ±50 ppm crystals (spec'd at 25°C) now have actual drift potentially reaching 75-100 ppm due to temperature coefficient
T+30 minutes: During this 30-minute period, clocks may drift an additional ~100-180 ms beyond normal
Cooling restored: Temperatures return to normal, but clock error has accumulated
Next NTP sync: Large step correction needed

Hardware Clock Implementations

Understanding the actual hardware used for timekeeping helps predict behavior and failure modes. Different systems use different approaches, with vastly different accuracy characteristics.

Common Hardware Clock Architectures:

Timekeeping Hardware in Computing Systems

•Real-Time Clock (RTC) Chip A dedicated IC (e.g., DS1307, PCF8523, or integrated in the southbridge/PCH) with a 32.768 kHz crystal and battery backup. Maintains time when the system is off. Typically accurate to ±20-50 ppm. Provides seconds resolution; subsecond data comes from other sources.
•High-Precision Event Timer (HPET) Modern x86 timer running at ~14.3 MHz or higher. Provides high-resolution timestamps but can have latency accessing it. Some systems avoid HPET due to performance issues.
•Time Stamp Counter (TSC) CPU register that increments every clock cycle. Extremely high resolution (nanoseconds on modern CPUs). Modern 'invariant TSC' maintains consistent frequency regardless of CPU power states, making it usable as a monotonic clock source. Fast to read (single instruction).
•Programmable Interval Timer (PIT) Legacy 8254 timer at 1.193 MHz. Poor resolution by modern standards but still exists for compatibility. Rarely used as primary time source.
•GPS Receivers Used in high-precision timing systems. GPS time is derived from atomic clocks on satellites, accurate to ~10-30 nanoseconds. GPS receivers provide a highly accurate pulse-per-second (PPS) signal used to discipline local oscillators.
•Atomic Clocks (Rubidium, Cesium) Found in data centers requiring extreme accuracy (financial trading, telecom). Rubidium clocks: ~$2,000, ±10^-10 accuracy. Cesium clocks: ~$50,000+, ±10^-12 accuracy. Google uses GPS + atomic clocks for TrueTime.

How Operating Systems Use These Clocks:

Boot and Runtime Clock Usage on Linux
Phase	Clock Source Used	Purpose	Accuracy
Boot (early)	RTC	Initialize kernel time to approximate wall-clock	±seconds (until NTP)
Boot (late)	TSC or HPET	High-resolution timekeeping begins	Depends on NTP sync
Runtime	TSC (preferred)	Monotonic and wall-clock time	μs resolution, drift between NTP
NTP adjustment	External NTP servers	Correct wall-clock; discipline local oscillator	Typically 1-10ms to sources
Suspend/resume	RTC	Restore approximate time after power state	May have jumped; NTP re-syncs

Cloud and Virtual Machine Implications

Implications for Distributed Systems Design

Understanding physical clock behavior directly informs distributed systems design. Here's how to translate clock characteristics into engineering decisions:

Clock Realities

•50 ppm drift = 4.3s/day between syncs
•NTP sync achieves 1-10ms precision typically
•Clock can jump forward or backward
•Temperature affects accuracy significantly
•VM clocks add virtualization uncertainty

Design Implications

•Can't rely on timestamps alone for ordering
•Must add safety margins to time bounds
•Need monotonic timestamps for durations
•Consider logical clocks for causality
•Test with clock skew injection

Specific Design Patterns:

Pattern 1: Bounded Clock Uncertainty

Pattern 2: Monotonic-Safe Timestamps

Generate timestamps that are guaranteed monotonic locally by using max(wall_clock, last_timestamp + 1). This prevents backward jumps from causing ordering inversions within a single node.

Pattern 3: Hybrid Logical Clocks

Pattern 4: Clock Health Monitoring

Pattern 5: Grace Periods

For time-based operations (TTL, lease expiry), add grace periods that exceed expected clock skew. If max skew is 10ms, a 100ms grace period provides margin for extreme cases.

The Principal Engineer's Heuristic

Summary: Physical Clocks and Drift

We've explored the physical foundations of timekeeping in computer systems. Let's consolidate the key insights:

Key Takeaways

•Quartz oscillators drift — Typical server clocks drift 20-50 ppm, accumulating seconds of error per day. Temperature, aging, and manufacturing variation all contribute.
•Wall-clock vs monotonic time — Wall-clock time can jump and is unsuitable for measuring durations. Monotonic time never runs backward and should be used for timeouts, rate limiting, and performance measurement.
•Clocks are adjusted, not just corrected — NTP performs step adjustments (jumps) for large errors and slew adjustments (frequency changes) for small errors. Both can break naive timestamp logic.
•Leap seconds cause outages — The irregular insertion of leap seconds has caused numerous production incidents. Modern systems often use 'leap smearing' to avoid discontinuities.
•Hardware varies enormously — From cheap RTC chips to GPS-disciplined atomic references, clock quality spans orders of magnitude in accuracy and cost.
•Design with margins — Never assume clock precision is better than measured worst-case. Add safety factors to time-based logic.

What's Next:

Page Complete

2 / 5