Loading content...
You've just discovered that your financial application is losing money. Not much—a fraction of a cent per transaction—but across millions of transactions, it adds up to thousands of dollars. Or perhaps your physics simulation is producing impossible results: objects drifting through walls, energy appearing from nowhere, simulations becoming unstable after running for a few hours.
The culprit? Floating-point precision errors.
These errors aren't bugs in the traditional sense—they're fundamental consequences of representing infinite mathematical precision in finite bits. Every programmer who works with numerical computing eventually encounters them, often in the form of mysterious bugs that seem to violate basic mathematics.
This page demystifies floating-point precision errors. You'll learn why they occur, how they manifest, when they matter, and most importantly—how to manage them like a professional engineer.
By the end of this page, you will understand the fundamental sources of precision error (representation, operation, and accumulation), recognize the most dangerous operations that magnify error, learn professional techniques for managing precision in critical applications, and develop the 'precision awareness' that characterizes experienced numerical programmers.
Precision errors in floating-point arithmetic arise from three distinct sources, each with different characteristics and mitigation strategies.
Type 1: Representation Error
This is the most fundamental source of error: many decimal numbers simply cannot be represented exactly in binary floating-point.
Consider 0.1 in decimal. In binary, 0.1 requires an infinite repeating fraction:
0.1 (decimal) = 0.0001100110011001100110011... (binary, repeating forever)
Since we can only store a finite number of bits, we must truncate this infinite sequence, introducing an immediate error before any computation even happens.
This isn't a flaw in IEEE 754—it's an inherent limitation of binary representation. Just as 1/3 cannot be written exactly in decimal (0.333...), 1/10 cannot be written exactly in binary.
The representation error for 0.1 in double precision is approximately:
True value: 0.1 Stored value: 0.1000000000000000055511151231257827021181583404541015625 Error: ~5.55 × 10⁻¹⁸ (relative error ~5.55 × 10⁻¹⁷)
Type 2: Operation Error (Rounding)
Even when operands are exactly representable, the result often isn't. Consider multiplying two 53-bit significands—the exact product is 106 bits, but we can only store 53. The extra bits must be rounded away.
IEEE 754's "correctly rounded" guarantee means the returned result is the closest representable value to the true result. But "closest" still means imprecise.
For basic operations (+, −, ×, ÷, √), the error is at most half an "ulp" (unit in the last place). This is extremely precise—the relative error is roughly 10⁻¹⁶ for double precision. But over many operations, these tiny errors add up.
Type 3: Accumulation Error
The most dangerous source of error: small rounding errors from individual operations compound through computation chains.
Consider summing 10 million values of 0.1. Each addition introduces a small rounding error. After millions of operations, the accumulated error can become significant—potentially affecting the first few significant digits of the result.
| Type | When It Occurs | Magnitude | Example |
|---|---|---|---|
| Representation | Storing a value | ~10⁻¹⁶ relative | float x = 0.1; |
| Operation (Rounding) | Each arithmetic op | ~10⁻¹⁶ relative | z = x * y; |
| Accumulation | Chains of operations | Can grow arbitrarily | sum += values[i]; (in loop) |
A single operation's error of 10⁻¹⁶ seems negligible. But 10 million operations can produce 10⁻¹⁶ × 10⁷ = 10⁻⁹ cumulative error—and that's the best case. Certain operation patterns (like catastrophic cancellation) can amplify errors by factors of billions.
Catastrophic cancellation is the most dangerous source of precision loss in numerical computing. It occurs when subtracting two nearly equal numbers, causing dramatic loss of significant digits.
Understanding the Mechanism
Consider subtracting 1.23456789 from 1.23456788 with only 5 significant digits of precision:
1.2346 (rounded from 1.23456789)
- 1.2346 (rounded from 1.23456788)
--------
0.0000
The true answer is 0.00000001, but our 5-digit arithmetic gives 0!
This is catastrophic cancellation: when the leading significant digits cancel out, the remaining digits are dominated by rounding errors introduced earlier. The relative error explodes from ~10⁻⁵ (normal) to ~1.0 (total loss).
A Realistic Example
Consider computing the root of a quadratic equation x² − 10000.001x + 1 = 0 using the standard formula:
x = (−b ± √(b² − 4ac)) / (2a)
With b = 10000.001, we compute:
Now we need:
The second root loses almost all significant digits because 10000.001 and 10000.0008 are nearly equal.
Mitigating Catastrophic Cancellation
Skilled numerical programmers use algebraic reformulations to avoid dangerous subtractions:
Quadratic formula fix: Instead of computing both roots directly, compute the accurate root first, then use the relationship x₁ × x₂ = c/a:
if b > 0:
x1 = (−b − √(b² − 4ac)) / (2a) # Accurate, both terms same sign
x2 = c / (a × x1) # Computed from x1, avoiding cancellation
Numerical derivative fix: Use central differences instead of forward differences:
# Forward difference (catastrophic for small h):
derivative ≈ (f(x+h) − f(x)) / h
# Central difference (much more stable):
derivative ≈ (f(x+h) − f(x−h)) / (2h)
The central approach subtracts values at different points, reducing cancellation.
When you see a subtraction of similar-magnitude values in a numerical formula, immediately ask: 'Can this be reformulated?' Often, algebraically equivalent expressions have dramatically different numerical stability. This reformulation skill is what separates casual programmers from numerical experts.
Absorption (also called swamping) occurs when adding a very small number to a very large one. The small number simply disappears due to limited precision.
Understanding Absorption
In double precision, we have about 15-16 significant decimal digits. Consider adding:
1,000,000,000,000,000 + 1 = ?
True answer: 1,000,000,000,000,001
But 1,000,000,000,000,000 (10¹⁵) represented in double precision already uses all available precision for its 16 significant digits. The "1" we're adding falls below the representable precision—it gets absorbed (rounded away).
Result: 1,000,000,000,000,000 + 1 = 1,000,000,000,000,000 in floating-point!
When Absorption Becomes Dangerous
Absorption is especially problematic in accumulation:
total = 1e16 // Large initial value
for i = 1 to 1_000_000:
total = total + 0.001 // Each addition absorbed!
After 1 million iterations, total should be 1e16 + 1000 = 10,000,000,000,001,000. But due to absorption, every single addition was absorbed—total remains exactly 1e16.
The Rearrangement Solution
The fix is simple: accumulate small values first, large values later:
// BAD: Large + small + small + small + ...
sum = 1e16
for v in small_values:
sum += v # Absorbed!
// GOOD: small + small + ... + large
sum = 0
for v in small_values:
sum += v # Adds up correctly
sum += 1e16 # Add large value at the end
By adding similar-magnitude values together, we build up the small sum before combining with the large value.
Floating-point addition is not associative! (a + b) + c may differ from a + (b + c). This violates our mathematical intuition but is an unavoidable consequence of finite precision. Skilled numerical programmers exploit this by choosing summation orders that minimize error.
Understanding how errors accumulate through computation is essential for predicting and controlling precision loss in real applications.
Error Growth Patterns
Additive errors (best case): When errors are random and independent, they tend to cancel somewhat. Summing n values with random ±ε errors each produces accumulated error ≈ √n × ε (square root growth).
Linear error growth: In some algorithms, errors accumulate linearly. Each iteration adds its error to the total: n operations produce ~n × ε accumulated error.
Exponential error growth (worst case): In certain computations (like unstable differential equations or iterative maps), errors multiply rather than add. Each iteration amplifies the previous error. After n iterations: ~(1+ε)ⁿ × ε₀ ≈ exp(nε) × ε₀.
A Practical Example: Summing 10 Million Values
Suppose we sum 10 million values of 0.1 using naive summation:
sum = 0.0
for _ in range(10_000_000):
sum += 0.1
Expected result: 1,000,000.0 Actual result: 999,999.9998389754 (or similar)
The error of ~1.6 × 10⁻⁴ relative to the true sum arises from 10 million rounding errors, each ~10⁻¹⁷ in magnitude, accumulating sub-linearly (closer to √n growth due to cancellation).
| Pattern | Error After n Ops | Example Scenario | Severity |
|---|---|---|---|
| Random cancellation | ~√n × ε | Summing independent values | Mild |
| Linear accumulation | ~n × ε | Biased rounding, sorted sums | Moderate |
| Exponential growth | ~exp(nε) × ε₀ | Unstable iterations, chaos | Severe |
Condition Number: Measuring Algorithm Sensitivity
Numerical analysts use condition number to quantify how sensitive a problem is to input perturbations.
A problem with condition number κ amplifies relative input errors by factor κ in the output:
For example, subtracting nearly equal numbers has high condition number proportional to 1/(|a-b|/|a|). If a and b agree to 10 digits, the condition number is ~10¹⁰, and we expect to lose 10 digits of precision—leaving only 5-6 usable digits in double precision.
Key Insight: Sometimes precision loss is inherent to the problem itself, not the algorithm. No algorithm can solve an ill-conditioned problem more accurately than its condition number allows (without restructuring the problem).
Using more precision (double vs. single) doesn't help if the problem is ill-conditioned. A condition number of 10¹⁰ loses 10 digits whether you start with 7 (single) or 16 (double). The only solution is reformulating the problem to reduce its condition number.
When accumulation errors threaten precision, compensated summation (also known as Kahan summation) provides an elegant solution.
The Core Idea
Kahan's algorithm tracks the running sum plus a "compensation" term that captures the accumulated rounding error. Each addition computes not just the new sum, but also the error introduced, which is then corrected in the next iteration.
The Algorithm
sum = 0.0
compensation = 0.0 // Running error compensation
for each value in input:
y = value - compensation // Compensated value to add
t = sum + y // Tentative new sum
compensation = (t - sum) - y // Recover lost low-order bits
sum = t // Update sum
// Final result is in 'sum'
How It Works
The magic is in the line compensation = (t - sum) - y:
(t - sum) gives us what was actually added to sum (might differ from y due to rounding)Dramatic Improvement
Kahan summation reduces accumulated error from O(n×ε) or O(√n×ε) to O(ε)—constant regardless of how many values are summed! The error is essentially that of a single floating-point operation.
Using our 10-million 0.1 example:
| Method | Result | Error |
|---|---|---|
| Naive sum | 999,999.9998389754 | ~1.6 × 10⁻⁴ |
| Kahan sum | 1,000,000.0 | ~0 (within ε) |
Kahan summation recovers essentially perfect precision at the cost of just 4 operations per addition instead of 1.
When to Use Kahan Summation
Many numerical libraries use Kahan summation or variants internally. Python's math.fsum() function, for example, uses an even more sophisticated algorithm that provides correctly-rounded sums.
Kahan summation has been refined into 'pairwise summation' (used by NumPy) and 'cascaded accumulators.' Python's math.fsum() uses Shewchuk's algorithm, which tracks multiple partial sums and produces correctly-rounded results regardless of input. When precision matters, use your language's built-in precise summation function.
One of the most common floating-point bugs is using exact equality comparison (==) with computed values. Due to accumulated rounding errors, two computations that should mathematically produce identical results often differ by tiny amounts.
The Classic Gotcha
x = 0.1 + 0.1 + 0.1
y = 0.3
print(x == y) # False!
print(x) # 0.30000000000000004
print(y) # 0.3
Both x and y "should" be 0.3, but x accumulated rounding errors through addition while y was assigned directly. They differ in their least significant bits.
The Epsilon Comparison Pattern
The standard solution is to check if two values are "close enough":
def approximately_equal(a, b, epsilon=1e-9):
return abs(a - b) < epsilon
But what should epsilon be? This seemingly simple question has a surprisingly complex answer.
Absolute vs. Relative Epsilon
Absolute tolerance (|a − b| < ε) works poorly for large values:
Relative tolerance (|a − b| < ε × max(|a|, |b|)) works poorly near zero:
Combined tolerance handles both cases:
def approximately_equal(a, b, rel_tol=1e-9, abs_tol=1e-12):
return abs(a - b) <= max(rel_tol * max(abs(a), abs(b)), abs_tol)
Python's math.isclose() implements exactly this combined approach.
diff = x - expected; if abs(diff) < tol avoids redundant subtractionChecking if a float equals zero is a special case. 'x == 0.0' is actually safe when x was never computed (e.g., explicitly assigned). But 'computed_value == 0.0' is dangerous. Use 'abs(computed_value) < small_epsilon' instead.
Beyond understanding the sources of error, professional numerical programmers employ a suite of techniques to manage precision in critical applications.
1. Use Higher Precision for Intermediate Calculations
Even if your final result needs only single precision, computing in double precision and rounding at the end often improves accuracy dramatically:
float result = (float)((double)a * (double)b + (double)c);
The intermediate computation in double precision accumulates less error, producing a more accurate final single-precision result.
2. Algebraic Reformulation
Mathematically equivalent expressions can have vastly different numerical properties:
| Original | Reformulated | Why Better |
|---|---|---|
| x² − y² | (x+y)(x−y) | Avoids subtracting similar squares |
| (1 − cos(x))/x² | 2sin²(x/2)/x² | Avoids cancellation near x=0 |
| √(x²+1) − 1 | x²/(√(x²+1) + 1) | Avoids cancellation for small x |
3. Use Specialized Library Functions
Numerical libraries provide functions specifically designed to minimize precision loss:
log1p(x) computes log(1+x) accurately for small x (where log(1+x) ≈ x)expm1(x) computes exp(x) − 1 accurately for small xhypot(x, y) computes √(x²+y²) avoiding overflow and maximizing precisionfma(a, b, c) computes a×b+c in a single fused operation with no intermediate rounding4. Scale to Similar Magnitudes
When possible, scale your data so operands have similar magnitudes:
# Instead of computing with tiny and huge values mixed:
scale = 1e10
scaled_result = compute_with_scaled_values(values * scale)
result = scaled_result / scale
5. Order Operations to Minimize Error
Most everyday code doesn't need careful precision management. Worry when: (1) comparing computed values for equality, (2) long computation chains, (3) values of vastly different magnitudes, (4) subtracting similar values, or (5) the application is financial, safety-critical, or scientific. For casual arithmetic, just use double precision and move on.
We've explored the sources, manifestations, and management of floating-point precision errors—knowledge that transforms floating-point from a mysterious bug source into a predictable, manageable tool.
What's Next
We've built a solid foundation of floating-point knowledge. The final page in this module tackles the most famous floating-point quirk of all: why 0.1 + 0.2 ≠ 0.3. This seemingly simple bug has confused millions of programmers and spawned countless Stack Overflow questions.
With everything you've learned, you'll finally be able to explain—and resolve—this notorious issue with confidence.
You now have professional-grade understanding of floating-point precision errors. You can identify dangerous patterns, apply mitigation techniques, and reason about accumulated error. This knowledge puts you ahead of most programmers who treat floating-point as a mysterious black box. Next, we'll put this knowledge to work explaining the most famous floating-point puzzle of all.