Loading content...
Of all the failure modes in distributed systems, clock-related failures are perhaps the most insidious. Unlike network failures (which cause visible errors) or resource exhaustion (which causes observable slowdowns), clock skew can cause silent correctness violations—your system continues running, returns no errors, yet produces wrong results.
Distributed systems make countless implicit assumptions about time:
Every one of these assumptions can be violated when clocks disagree. And in distributed systems, clocks always disagree. The question is only by how much, and whether your system handles that disagreement correctly.
By the end of this page, you will understand how to inject and analyze clock-related failures: clock skew between nodes, sudden time jumps, NTP synchronization failures, and timezone anomalies. You'll learn to identify code that makes unsafe time assumptions, test distributed coordination mechanisms, and implement clock-resilient patterns.
Before injecting clock failures, we must understand how computers track time and why perfect synchronization is impossible.
Types of Clocks:
| Clock Type | What It Measures | Monotonicity | Use Case |
|---|---|---|---|
| Wall clock (time-of-day) | Current date/time | Can jump backward | Timestamps, scheduling |
| Monotonic clock | Duration since arbitrary point | Always increases | Measuring elapsed time |
| Logical clock (Lamport) | Causal ordering | Always increases | Distributed ordering |
| Vector clock | Per-node logical time | Always increases | Causality tracking |
| Hybrid logical clock (HLC) | Wall time + logical | Always increases | Spanner-style ordering |
Why Perfect Synchronization Is Impossible:
Typical Clock Accuracy:
| Environment | Expected Accuracy | Skew in 1 Hour |
|---|---|---|
| NTP over internet | 10-100ms | Minutes |
| NTP in datacenter | 1-10ms | Seconds |
| PTP (Precision Time Protocol) | < 1μs | Microseconds |
| GPS synchronization | < 100ns | Nanoseconds |
| Same machine | Perfectly synchronized | Zero |
| Without NTP | Hours per day | Hours |
For most distributed systems using NTP within a datacenter, you can expect clocks to be within a few milliseconds of each other—but this bound is not guaranteed, and during NTP failures or network issues, skew can grow significantly.
Many developers assume clocks are 'close enough' and write code that breaks with even millisecond skew. Operations like 'find the newest record' or 'check if the token expired' can produce wrong results when clocks disagree. Test what actually happens with realistic skew levels.
Clock skew injection makes different nodes in your distributed system have different notions of the current time. This tests whether your system relies on clock agreement for correctness.
Skew Scenarios to Test:
| Scenario | How To Inject | What It Tests |
|---|---|---|
| Small constant skew | One node +/- 100ms | Subtle ordering issues |
| Large constant skew | One node +/- 10s | Timeout and expiration logic |
| Clock ahead | One node +5 minutes | Token expiration, cache TTL |
| Clock behind | One node -5 minutes | Timestamp ordering, 'newest' logic |
| Variable skew | Skew changes over time | Sensitivity to drift rate |
| Asymmetric communication | A→B normal, B→A skewed | Request/response timing |
What Skew Breaks:
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253
# WARNING: These commands modify system time# Only run in isolated test environments, never in production # Check current time and NTP statustimedatectl statusntpq -p # or chronyc sources # Set time manually (requires root, disables NTP)sudo timedatectl set-ntp falsesudo date -s "$(date -d '+5 minutes')" # Set clock 5 minutes aheadsudo date -s "$(date -d '-5 minutes')" # Set clock 5 minutes behind # Temporary time shift (returns to normal after test)# Uses libfaketime to intercept time callsLD_PRELOAD=/usr/lib/x86_64-linux-gnu/faketime/libfaketime.so.1 \ FAKETIME="+5m" \ ./my-service # Container-level time manipulation using Chaos Meshcat <<EOF | kubectl apply -f -apiVersion: chaos-mesh.org/v1alpha1kind: TimeChaosmetadata: name: clock-skew-aheadspec: mode: one selector: labelSelectors: app: service-a timeOffset: "5m" # 5 minutes ahead duration: "10m"EOF # Skew clock backwardcat <<EOF | kubectl apply -f -apiVersion: chaos-mesh.org/v1alpha1kind: TimeChaosmetadata: name: clock-skew-behindspec: mode: one selector: labelSelectors: app: service-b timeOffset: "-5m" # 5 minutes behind duration: "10m"EOF # Using Docker with faketimedocker run --rm \ -e LD_PRELOAD=/usr/lib/x86_64-linux-gnu/faketime/libfaketime.so.1 \ -e FAKETIME="+1h" \ my-service:latestClock ahead and clock behind cause different failures. A clock running ahead causes premature expiration; a clock behind causes stale data to appear valid. Test both directions for each component.
Time jumps are sudden, large changes in the system clock—as opposed to gradual drift. They occur when:
Why Time Jumps Are Dangerous:
Unlike gradual skew (which might add or remove a few milliseconds per second), time jumps change the clock by seconds, minutes, or even hours instantaneously. Code that assumes time moves forward smoothly can:
1234567891011121314151617181920212223242526272829303132333435363738394041
# Forward time jumpsudo date -s "$(date -d '+1 hour')" # Backward time jump (DANGEROUS - can cause issues)sudo date -s "$(date -d '-30 seconds')" # Kubernetes TimeChaos for jumpscat <<EOF | kubectl apply -f -apiVersion: chaos-mesh.org/v1alpha1kind: TimeChaosmetadata: name: time-jump-forwardspec: mode: all selector: labelSelectors: app: scheduler-service timeOffset: "1h" # Jump 1 hour ahead clockIds: - CLOCK_REALTIME # Affects wall clock, not monotonic duration: "30m"EOF # Test time jump during operation (script example)#!/bin/bashecho "Starting long operation at $(date)"./start-long-operation &OP_PID=$!sleep 5echo "Jumping time forward 10 minutes"sudo date -s "$(date -d '+10 minutes')"wait $OP_PIDecho "Operation completed at $(date)" # Using faketime for time jumps# Start at normal timeFAKETIME="@2024-01-01 00:00:00" ./my-service &sleep 10# Kill and restart with jumped timekill $!FAKETIME="@2024-01-01 01:00:00" ./my-service &Observations During Time Jump Testing:
| Component | Forward Jump Effect | Backward Jump Effect |
|---|---|---|
| Scheduled jobs | Jobs may be skipped | Jobs may run twice |
| Cache entries | Mass expiration | Entries appear fresh |
| Session tokens | Mass invalidation | Expired tokens appear valid |
| Rate limiters | Quotas reset | Negative available tokens |
| Distributed locks | Leases expire | Holder thinks lease is valid |
| Metrics | Gaps in time series | Duplicate data points |
| Logs | Timestamps jump | Non-monotonic timestamps |
Many systems don't handle backward time jumps at all. NTP typically 'slews' the clock (adjusts rate) rather than stepping backward, but steps can occur after VM resume or manual correction. If your logs show 'negative elapsed time' errors, your code isn't using monotonic clocks where it should.
NTP (Network Time Protocol) keeps distributed system clocks synchronized. When NTP fails or becomes unavailable, clocks drift apart at the rate of their hardware clock inaccuracy—typically 10-100 parts per million (ppm). This means:
NTP Failure Modes:
| Failure Mode | Effect | Time to Significant Skew |
|---|---|---|
| NTP servers unreachable | Clocks drift at hardware rate | Hours to days |
| NTP returning wrong time | Clocks steered toward wrong value | Minutes |
| NTP latency variations | Reduced accuracy | Gradual |
| NTP server compromise | Malicious time values | Immediate |
| Leap second insertion | 61-second minute | Once per insertion |
| Clock stepping vs slewing | Sudden vs gradual correction | Immediate |
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253
# Check current NTP statustimedatectl statussystemctl status chronyd # or systemctl status ntpd # Check NTP server connectivitychronyc sources # for chronydntpq -p # for ntpd # Block NTP traffic (UDP port 123)sudo iptables -A OUTPUT -p udp --dport 123 -j DROPsudo iptables -A INPUT -p udp --sport 123 -j DROP # Monitor clock drift during NTP outagewhile true; do echo "$(date +%s.%N) - $(curl -s http://time.google.com | grep -oP '\d+\.\d+')" sleep 60done # Add latency to NTP packetssudo tc qdisc add dev eth0 root handle 1: priosudo tc qdisc add dev eth0 parent 1:3 handle 30: netem delay 500mssudo tc filter add dev eth0 protocol ip parent 1:0 prio 3 u32 \ match ip dport 123 0xffff flowid 1:3 # Disable NTP synchronization temporarilysudo timedatectl set-ntp false# ... run experiment ...sudo timedatectl set-ntp true # Force NTP resync after experimentsudo chronyc makestep # for chronydsudo ntpdate -u pool.ntp.org # for ntpd # Kubernetes: Block NTP at pod levelcat <<EOF | kubectl apply -f -apiVersion: networking.k8s.io/v1kind: NetworkPolicymetadata: name: block-ntpspec: podSelector: matchLabels: app: test-service policyTypes: - Egress egress: - ports: - protocol: UDP port: 53 # Allow DNS - to: - podSelector: {} # Allow pod-to-pod # NTP (UDP 123) is not listed, so it's blockedEOFThe rate of clock drift without NTP depends on hardware quality and temperature. Virtual machines often have higher drift rates than physical hardware. Measure your actual drift rate during NTP outage testing—it affects how quickly skew becomes problematic.
Leap seconds are occasional one-second adjustments to UTC to account for irregularities in Earth's rotation. While seemingly minor, leap seconds have caused numerous production incidents because they create situations that 'normal' time handling doesn't anticipate.
Famous Leap Second Incidents:
| Year | Affected System | What Happened |
|---|---|---|
| 2012 | Linux kernel, Java | High CPU from spinning processes, crashes |
| 2012 | Multiple hour outage | |
| 2012 | Mozilla | Bug in code handling 61-second minute |
| 2015 | Cloudflare | Negative duration calculation, DNS issues |
| 2016 | Multiple | Similar issues, better preparation |
How Operating Systems Handle Leap Seconds:
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556
# Testing leap second handling in codeimport datetimefrom dateutil import parser # Test parsing a leap second timestamptry: # This is technically a valid UTC timestamp during leap second leap_second = "2016-12-31T23:59:60Z" parsed = parser.parse(leap_second) print(f"Parsed leap second: {parsed}")except Exception as e: print(f"Failed to parse leap second: {e}") # Test duration calculation across leap secondt1 = datetime.datetime(2016, 12, 31, 23, 59, 59)t2 = datetime.datetime(2017, 1, 1, 0, 0, 0)duration = (t2 - t1).total_seconds()print(f"Duration: {duration}s (should be 1s, or 2s with leap)") # Test for negative duration after time jumpdef measure_operation(): start = datetime.datetime.now() # ... operation ... end = datetime.datetime.now() duration = (end - start).total_seconds() if duration < 0: raise RuntimeError(f"Negative duration: {duration}") return duration # Better: use monotonic clock for durationsimport time def measure_operation_safe(): start = time.monotonic() # ... operation ... end = time.monotonic() duration = end - start # Always non-negative return duration # Test DST transitionimport pytz tz = pytz.timezone('US/Eastern')# This time happens twice during fall-backambiguous = datetime.datetime(2024, 11, 3, 1, 30, 0)try: localized = tz.localize(ambiguous)except pytz.exceptions.AmbiguousTimeError: print("Ambiguous time during DST transition") # This time doesn't exist during spring-forwardmissing = datetime.datetime(2024, 3, 10, 2, 30, 0)try: localized = tz.localize(missing)except pytz.exceptions.NonExistentTimeError: print("Non-existent time during DST transition")If your cloud provider offers leap second smearing (Google, AWS), use it. Smearing makes the leap second invisible to applications by gradually adjusting clocks, eliminating the need for applications to handle 61-second minutes. Verify your provider's approach and configure accordingly.
Certain distributed system patterns are inherently sensitive to clock accuracy. Understanding which patterns are at risk helps focus testing efforts.
| Pattern | Clock Dependency | Risk of Failure | Mitigation |
|---|---|---|---|
| Distributed locks with TTL | Lease expiration timing | High | Use fencing tokens, conservative TTLs |
| Token-based authentication | Token expiry validation | High | Include server time in response, use generous buffers |
| Last-write-wins conflict resolution | Timestamp comparison | High | Use vector clocks or HLC instead |
| Event sourcing timestamps | Event ordering | Medium | Use logical clocks for ordering |
| Time-windowed rate limiting | Window boundary detection | Medium | Use sliding windows, be conservative |
| Cache TTL | Expiration timing | Medium | Use monotonic time for durations |
| Scheduled jobs | Trigger timing | Medium | Use dedicated scheduler, handle missed runs |
| Audit logging | Timestamp accuracy | Low | Include source and sync status |
clock_gettime(CLOCK_MONOTONIC) or language equivalents.1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768
// Examples of clock-resilient coding patterns // BAD: Using wall clock for elapsed timefunction measureOperationBad(): number { const start = Date.now(); doOperation(); const end = Date.now(); return end - start; // Can be negative if clock jumps backward!} // GOOD: Using monotonic time (performance.now() in JS)function measureOperationGood(): number { const start = performance.now(); doOperation(); const end = performance.now(); return end - start; // Always non-negative} // BAD: Last-write-wins using wall clockinterface Record { value: string; timestamp: number;} function mergeRecordsBad(a: Record, b: Record): Record { return a.timestamp > b.timestamp ? a : b; // Sensitive to clock skew!} // GOOD: Using vector clock for conflict detectioninterface VectorClock { [nodeId: string]: number;} interface RecordWithVector { value: string; vectorClock: VectorClock;} function mergeRecordsGood(a: RecordWithVector, b: RecordWithVector): RecordWithVector[] { if (happensBefore(a.vectorClock, b.vectorClock)) return [b]; if (happensBefore(b.vectorClock, a.vectorClock)) return [a]; return [a, b]; // Concurrent - return both for conflict resolution} // BAD: Distributed lock with only TTLasync function acquireLockBad(key: string, ttlMs: number): Promise<boolean> { const acquired = await redis.set(key, 'locked', 'NX', 'PX', ttlMs); return acquired === 'OK';} // GOOD: Distributed lock with fencing tokenasync function acquireLockGood(key: string, ttlMs: number): Promise<number | null> { const token = await redis.incr('lock:token:' + key); const acquired = await redis.set( key, JSON.stringify({ token, holder: nodeId }), 'NX', 'PX', ttlMs ); return acquired === 'OK' ? token : null;} // When using the lock, pass fencing token to downstream:async function writeWithFencingToken(value: string, fencingToken: number) { // Storage system rejects writes with lower token await database.write(value, { fencingToken });}Search your codebase for Date.now(), System.currentTimeMillis(), time.time(), and similar calls. Each one is a potential clock skew vulnerability. Most should be replaced with monotonic time or logical clocks.
Clock-related failures are notoriously difficult to observe because they often don't cause visible errors—just incorrect behavior. Special instrumentation is needed to detect and diagnose these issues.
12345678910111213141516171819202122232425262728293031323334353637383940414243444546
# Export NTP metrics to Prometheus # For chronyd, use chrony_exporter# https://github.com/superq/chrony_exporter # Key metrics to monitor:# - chrony_tracking_stratum: NTP stratum level# - chrony_tracking_root_delay_seconds: delay to root time source# - chrony_tracking_root_dispersion_seconds: dispersion from root# - chrony_tracking_system_time: current system time adjustment # Alert on clock skewcat <<EOF | kubectl apply -f -apiVersion: monitoring.coreos.com/v1kind: PrometheusRulemetadata: name: clock-skew-alertsspec: groups: - name: clock rules: - alert: ClockSkewHigh expr: abs(chrony_tracking_system_time) > 0.1 for: 5m labels: severity: warning annotations: summary: "Clock skew > 100ms on {{ $labels.instance }}" - alert: NTPNotSynced expr: chrony_tracking_stratum == 0 for: 5m labels: severity: critical annotations: summary: "NTP not synced on {{ $labels.instance }}"EOF # Log cluster clock comparison (run on each node)#!/bin/bashwhile true; do TIMESTAMP=$(date +%s.%N) NTP_OFFSET=$(chronyc tracking | grep "System time" | awk '{print $4}') echo "{"node": "$(hostname)", "time": $TIMESTAMP, "offset": $NTP_OFFSET}" sleep 10done| Symptom | Possible Clock Issue | Diagnostic Steps |
|---|---|---|
| Tokens expire immediately | Local clock is ahead | Compare local time to token issue time |
| Expired tokens accepted | Local clock is behind | Check NTP sync status |
| Cache entries evict too soon | Clock skew in cache cluster | Compare time across cache nodes |
| Events appear out of order in logs | Clock skew between producers | Add source timestamp to log entries |
| Last-write-wins gives wrong result | Writer clock is skewed | Use vector clocks instead |
| Distributed lock contention | Lease timing incorrect | Add fencing tokens to locks |
| Scheduled jobs run early/late | Scheduler clock drifted | Check NTP offset on scheduler nodes |
Clock skew testing reveals the hidden time dependencies in distributed systems. Unlike other failure modes that cause visible errors, clock issues often cause silent correctness violations—making them particularly dangerous and worthy of proactive testing.
Module Complete: Failure Injection Mastery
You have now completed the comprehensive study of failure injection in chaos engineering. You understand how to inject and analyze:
Together, these failure categories cover the full taxonomy of what can go wrong in distributed systems. Armed with this knowledge, you can systematically probe your systems for weaknesses and build confidence in their resilience.
Congratulations! You now understand the complete landscape of failure injection for chaos engineering. From network failures through service disruptions through resource exhaustion to clock skew, you have the knowledge to systematically test your distributed systems' resilience. The next module will cover GameDays—organized exercises that put these failure injection techniques into practice.