Floating-Point Types - Learning Module

Loading content...

0/276

Precision Errors and Rounding Issues

When Math Betrays Us

You've just discovered that your financial application is losing money. Not much—a fraction of a cent per transaction—but across millions of transactions, it adds up to thousands of dollars. Or perhaps your physics simulation is producing impossible results: objects drifting through walls, energy appearing from nowhere, simulations becoming unstable after running for a few hours.

The culprit? Floating-point precision errors.

These errors aren't bugs in the traditional sense—they're fundamental consequences of representing infinite mathematical precision in finite bits. Every programmer who works with numerical computing eventually encounters them, often in the form of mysterious bugs that seem to violate basic mathematics.

This page demystifies floating-point precision errors. You'll learn why they occur, how they manifest, when they matter, and most importantly—how to manage them like a professional engineer.

What You Will Learn

By the end of this page, you will understand the fundamental sources of precision error (representation, operation, and accumulation), recognize the most dangerous operations that magnify error, learn professional techniques for managing precision in critical applications, and develop the 'precision awareness' that characterizes experienced numerical programmers.

The Three Types of Precision Error

Precision errors in floating-point arithmetic arise from three distinct sources, each with different characteristics and mitigation strategies.

Type 1: Representation Error

This is the most fundamental source of error: many decimal numbers simply cannot be represented exactly in binary floating-point.

Consider 0.1 in decimal. In binary, 0.1 requires an infinite repeating fraction:

0.1 (decimal) = 0.0001100110011001100110011... (binary, repeating forever)

Since we can only store a finite number of bits, we must truncate this infinite sequence, introducing an immediate error before any computation even happens.

This isn't a flaw in IEEE 754—it's an inherent limitation of binary representation. Just as 1/3 cannot be written exactly in decimal (0.333...), 1/10 cannot be written exactly in binary.

The representation error for 0.1 in double precision is approximately:

True value: 0.1 Stored value: 0.1000000000000000055511151231257827021181583404541015625 Error: ~5.55 × 10⁻¹⁸ (relative error ~5.55 × 10⁻¹⁷)

Type 2: Operation Error (Rounding)

Even when operands are exactly representable, the result often isn't. Consider multiplying two 53-bit significands—the exact product is 106 bits, but we can only store 53. The extra bits must be rounded away.

IEEE 754's "correctly rounded" guarantee means the returned result is the closest representable value to the true result. But "closest" still means imprecise.

For basic operations (+, −, ×, ÷, √), the error is at most half an "ulp" (unit in the last place). This is extremely precise—the relative error is roughly 10⁻¹⁶ for double precision. But over many operations, these tiny errors add up.

Type 3: Accumulation Error

The most dangerous source of error: small rounding errors from individual operations compound through computation chains.

Consider summing 10 million values of 0.1. Each addition introduces a small rounding error. After millions of operations, the accumulated error can become significant—potentially affecting the first few significant digits of the result.

The Three Types of Precision Error
Type	When It Occurs	Magnitude	Example
Representation	Storing a value	~10⁻¹⁶ relative	float x = 0.1;
Operation (Rounding)	Each arithmetic op	~10⁻¹⁶ relative	z = x * y;
Accumulation	Chains of operations	Can grow arbitrarily	sum += values[i]; (in loop)

The Compound Effect

A single operation's error of 10⁻¹⁶ seems negligible. But 10 million operations can produce 10⁻¹⁶ × 10⁷ = 10⁻⁹ cumulative error—and that's the best case. Certain operation patterns (like catastrophic cancellation) can amplify errors by factors of billions.

Catastrophic Cancellation: The Silent Precision Killer

Catastrophic cancellation is the most dangerous source of precision loss in numerical computing. It occurs when subtracting two nearly equal numbers, causing dramatic loss of significant digits.

Understanding the Mechanism

Consider subtracting 1.23456789 from 1.23456788 with only 5 significant digits of precision:

  1.2346 (rounded from 1.23456789)
- 1.2346 (rounded from 1.23456788)
--------
  0.0000

The true answer is 0.00000001, but our 5-digit arithmetic gives 0!

This is catastrophic cancellation: when the leading significant digits cancel out, the remaining digits are dominated by rounding errors introduced earlier. The relative error explodes from ~10⁻⁵ (normal) to ~1.0 (total loss).

A Realistic Example

Consider computing the root of a quadratic equation x² − 10000.001x + 1 = 0 using the standard formula:

x = (−b ± √(b² − 4ac)) / (2a)

With b = 10000.001, we compute:

b² = 100,000,020.000001
4ac = 4
b² − 4ac = 100,000,016.000001
√(b² − 4ac) ≈ 10000.0008

Now we need:

x₁ = (10000.001 + 10000.0008) / 2 ≈ 10000.0009 ✓ (accurate)
x₂ = (10000.001 − 10000.0008) / 2 ≈ 0.0001 ✗ (catastrophically imprecise)

The second root loses almost all significant digits because 10000.001 and 10000.0008 are nearly equal.

Common Catastrophic Cancellation Scenarios

•Quadratic formula — Subtracting b from √(b² − 4ac) when they're similar
•Numerical derivatives — Computing f(x+h) − f(x) for small h
•Variance calculation — Computing E[X²] − (E[X])² when both terms are large
•Series evaluation — Alternating series with large terms that nearly cancel
•Root finding — Newton's method near double roots
•Matrix operations — Subtracting nearly equal row combinations

Mitigating Catastrophic Cancellation

Skilled numerical programmers use algebraic reformulations to avoid dangerous subtractions:

Quadratic formula fix: Instead of computing both roots directly, compute the accurate root first, then use the relationship x₁ × x₂ = c/a:

if b > 0:
    x1 = (−b − √(b² − 4ac)) / (2a)  # Accurate, both terms same sign
    x2 = c / (a × x1)               # Computed from x1, avoiding cancellation

Numerical derivative fix: Use central differences instead of forward differences:

# Forward difference (catastrophic for small h):
derivative ≈ (f(x+h) − f(x)) / h

# Central difference (much more stable):
derivative ≈ (f(x+h) − f(x−h)) / (2h)

The central approach subtracts values at different points, reducing cancellation.

The Reformulation Mindset

When you see a subtraction of similar-magnitude values in a numerical formula, immediately ask: 'Can this be reformulated?' Often, algebraically equivalent expressions have dramatically different numerical stability. This reformulation skill is what separates casual programmers from numerical experts.

Absorption and Swamping: When Small Values Disappear

Absorption (also called swamping) occurs when adding a very small number to a very large one. The small number simply disappears due to limited precision.

Understanding Absorption

In double precision, we have about 15-16 significant decimal digits. Consider adding:

1,000,000,000,000,000 + 1 = ?

True answer: 1,000,000,000,000,001

But 1,000,000,000,000,000 (10¹⁵) represented in double precision already uses all available precision for its 16 significant digits. The "1" we're adding falls below the representable precision—it gets absorbed (rounded away).

Result: 1,000,000,000,000,000 + 1 = 1,000,000,000,000,000 in floating-point!

When Absorption Becomes Dangerous

Absorption is especially problematic in accumulation:

total = 1e16  // Large initial value
for i = 1 to 1_000_000:
    total = total + 0.001  // Each addition absorbed!

After 1 million iterations, total should be 1e16 + 1000 = 10,000,000,000,001,000. But due to absorption, every single addition was absorbed—total remains exactly 1e16.

The Rearrangement Solution

The fix is simple: accumulate small values first, large values later:

// BAD: Large + small + small + small + ...
sum = 1e16
for v in small_values:
    sum += v  # Absorbed!

// GOOD: small + small + ... + large
sum = 0
for v in small_values:
    sum += v  # Adds up correctly
sum += 1e16  # Add large value at the end

By adding similar-magnitude values together, we build up the small sum before combining with the large value.

Strategies to Prevent Absorption

•Sort before summing — Add values in order from smallest to largest magnitude
•Use pairwise summation — Recursively sum pairs to reduce magnitude disparity
•Kahan summation — Track accumulated error and compensate (we'll explore this)
•Use higher precision — Accumulate in double even if final result is single
•Restructure algorithms — Avoid situations where tiny increments are added to huge bases

The Summation Paradox

Floating-point addition is not associative! (a + b) + c may differ from a + (b + c). This violates our mathematical intuition but is an unavoidable consequence of finite precision. Skilled numerical programmers exploit this by choosing summation orders that minimize error.

Error Accumulation in Practice

Understanding how errors accumulate through computation is essential for predicting and controlling precision loss in real applications.

Error Growth Patterns

Additive errors (best case): When errors are random and independent, they tend to cancel somewhat. Summing n values with random ±ε errors each produces accumulated error ≈ √n × ε (square root growth).

Linear error growth: In some algorithms, errors accumulate linearly. Each iteration adds its error to the total: n operations produce ~n × ε accumulated error.

Exponential error growth (worst case): In certain computations (like unstable differential equations or iterative maps), errors multiply rather than add. Each iteration amplifies the previous error. After n iterations: ~(1+ε)ⁿ × ε₀ ≈ exp(nε) × ε₀.

A Practical Example: Summing 10 Million Values

Suppose we sum 10 million values of 0.1 using naive summation:

sum = 0.0
for _ in range(10_000_000):
    sum += 0.1

Expected result: 1,000,000.0 Actual result: 999,999.9998389754 (or similar)

The error of ~1.6 × 10⁻⁴ relative to the true sum arises from 10 million rounding errors, each ~10⁻¹⁷ in magnitude, accumulating sub-linearly (closer to √n growth due to cancellation).

Error Growth Patterns in Different Scenarios
Pattern	Error After n Ops	Example Scenario	Severity
Random cancellation	~√n × ε	Summing independent values	Mild
Linear accumulation	~n × ε	Biased rounding, sorted sums	Moderate
Exponential growth	~exp(nε) × ε₀	Unstable iterations, chaos	Severe

Condition Number: Measuring Algorithm Sensitivity

Numerical analysts use condition number to quantify how sensitive a problem is to input perturbations.

A problem with condition number κ amplifies relative input errors by factor κ in the output:

κ ≈ 1: Well-conditioned; errors stay controlled
κ ≈ 10⁶: Ill-conditioned; expect 6 digits of precision loss
κ ≈ 10¹⁶: Hopeless; all precision lost to conditioning alone

For example, subtracting nearly equal numbers has high condition number proportional to 1/(|a-b|/|a|). If a and b agree to 10 digits, the condition number is ~10¹⁰, and we expect to lose 10 digits of precision—leaving only 5-6 usable digits in double precision.

Key Insight: Sometimes precision loss is inherent to the problem itself, not the algorithm. No algorithm can solve an ill-conditioned problem more accurately than its condition number allows (without restructuring the problem).

The Condition Number Barrier

Using more precision (double vs. single) doesn't help if the problem is ill-conditioned. A condition number of 10¹⁰ loses 10 digits whether you start with 7 (single) or 16 (double). The only solution is reformulating the problem to reduce its condition number.

Compensated Summation: Kahan's Elegant Solution

When accumulation errors threaten precision, compensated summation (also known as Kahan summation) provides an elegant solution.

The Core Idea

Kahan's algorithm tracks the running sum plus a "compensation" term that captures the accumulated rounding error. Each addition computes not just the new sum, but also the error introduced, which is then corrected in the next iteration.

The Algorithm

sum = 0.0
compensation = 0.0  // Running error compensation

for each value in input:
    y = value - compensation       // Compensated value to add
    t = sum + y                     // Tentative new sum
    compensation = (t - sum) - y   // Recover lost low-order bits
    sum = t                         // Update sum

// Final result is in 'sum'

How It Works

The magic is in the line compensation = (t - sum) - y:

(t - sum) gives us what was actually added to sum (might differ from y due to rounding)
Subtracting y from this gives the rounding error introduced by the addition
We store this error and subtract it from the next value, effectively recovering the lost bits

Dramatic Improvement

Kahan summation reduces accumulated error from O(n×ε) or O(√n×ε) to O(ε)—constant regardless of how many values are summed! The error is essentially that of a single floating-point operation.

Using our 10-million 0.1 example:

Method	Result	Error
Naive sum	999,999.9998389754	~1.6 × 10⁻⁴
Kahan sum	1,000,000.0	~0 (within ε)

Kahan summation recovers essentially perfect precision at the cost of just 4 operations per addition instead of 1.

When to Use Kahan Summation

Summing many values (thousands or more)
Values of widely varying magnitudes
Accuracy more important than raw speed
Financial or scientific applications where accumulated error matters

Many numerical libraries use Kahan summation or variants internally. Python's math.fsum() function, for example, uses an even more sophisticated algorithm that provides correctly-rounded sums.

Modern Variants

Kahan summation has been refined into 'pairwise summation' (used by NumPy) and 'cascaded accumulators.' Python's math.fsum() uses Shewchuk's algorithm, which tracks multiple partial sums and produces correctly-rounded results regardless of input. When precision matters, use your language's built-in precise summation function.

Comparison Pitfalls: Why == Often Lies

One of the most common floating-point bugs is using exact equality comparison (==) with computed values. Due to accumulated rounding errors, two computations that should mathematically produce identical results often differ by tiny amounts.

The Classic Gotcha

x = 0.1 + 0.1 + 0.1
y = 0.3
print(x == y)  # False!
print(x)       # 0.30000000000000004
print(y)       # 0.3

Both x and y "should" be 0.3, but x accumulated rounding errors through addition while y was assigned directly. They differ in their least significant bits.

The Epsilon Comparison Pattern

The standard solution is to check if two values are "close enough":

def approximately_equal(a, b, epsilon=1e-9):
    return abs(a - b) < epsilon

But what should epsilon be? This seemingly simple question has a surprisingly complex answer.

Absolute vs. Relative Epsilon

Absolute tolerance (|a − b| < ε) works poorly for large values:

Is 1000000.0 and 1000000.001 "equal"? Absolutely—they agree to 9 significant digits.
But the absolute difference is 0.001, which might exceed a naive epsilon.

Relative tolerance (|a − b| < ε × max(|a|, |b|)) works poorly near zero:

Is 1e-10 and 0.0 "equal"? Maybe—both are tiny.
But the relative difference is infinite (division by near-zero).

Combined tolerance handles both cases:

def approximately_equal(a, b, rel_tol=1e-9, abs_tol=1e-12):
    return abs(a - b) <= max(rel_tol * max(abs(a), abs(b)), abs_tol)

Python's math.isclose() implements exactly this combined approach.

Comparison Best Practices

•Never use == for computed floats — Unless you explicitly want bitwise equality
•Choose epsilon based on problem domain — 1e-9 for everyday use; 1e-12 for scientific; larger if heavy computation
•Use relative comparison for general values — Scales correctly with magnitude
•Add absolute tolerance for values near zero — Relative comparison breaks down
•Use library functions — Python's math.isclose(), NumPy's allclose(), etc.
•Consider computing difference once — diff = x - expected; if abs(diff) < tol avoids redundant subtraction

The Zero Comparison

Checking if a float equals zero is a special case. 'x == 0.0' is actually safe when x was never computed (e.g., explicitly assigned). But 'computed_value == 0.0' is dangerous. Use 'abs(computed_value) < small_epsilon' instead.

Professional Techniques for Precision Management

Beyond understanding the sources of error, professional numerical programmers employ a suite of techniques to manage precision in critical applications.

1. Use Higher Precision for Intermediate Calculations

Even if your final result needs only single precision, computing in double precision and rounding at the end often improves accuracy dramatically:

float result = (float)((double)a * (double)b + (double)c);

The intermediate computation in double precision accumulates less error, producing a more accurate final single-precision result.

2. Algebraic Reformulation

Mathematically equivalent expressions can have vastly different numerical properties:

Original	Reformulated	Why Better
x² − y²	(x+y)(x−y)	Avoids subtracting similar squares
(1 − cos(x))/x²	2sin²(x/2)/x²	Avoids cancellation near x=0
√(x²+1) − 1	x²/(√(x²+1) + 1)	Avoids cancellation for small x

3. Use Specialized Library Functions

Numerical libraries provide functions specifically designed to minimize precision loss:

log1p(x) computes log(1+x) accurately for small x (where log(1+x) ≈ x)
expm1(x) computes exp(x) − 1 accurately for small x
hypot(x, y) computes √(x²+y²) avoiding overflow and maximizing precision
fma(a, b, c) computes a×b+c in a single fused operation with no intermediate rounding

4. Scale to Similar Magnitudes

When possible, scale your data so operands have similar magnitudes:

# Instead of computing with tiny and huge values mixed:
scale = 1e10
scaled_result = compute_with_scaled_values(values * scale)
result = scaled_result / scale

5. Order Operations to Minimize Error

Sum small values before large ones (prevents absorption)
Avoid unnecessary intermediate rounding (use fma when available)
Evaluate stable subexpressions first

The Precision Checklist

•Identify sensitive operations — Subtractions, divisions by small numbers, evaluations near singularities
•Estimate condition number — Is precision loss inherent to the problem?
•Consider reformulation — Is there an equivalent expression with better numerical properties?
•Use appropriate precision — Double for general work; extended precision for sensitive steps
•Use compensated algorithms — Kahan summation, compensated dot products, etc.
•Validate numerically — Test with known-answer cases, vary input precision, check for expected error bounds

When to Worry

Most everyday code doesn't need careful precision management. Worry when: (1) comparing computed values for equality, (2) long computation chains, (3) values of vastly different magnitudes, (4) subtracting similar values, or (5) the application is financial, safety-critical, or scientific. For casual arithmetic, just use double precision and move on.

Summary: Mastering Floating-Point Precision

We've explored the sources, manifestations, and management of floating-point precision errors—knowledge that transforms floating-point from a mysterious bug source into a predictable, manageable tool.

Key Takeaways

•Three types of error — Representation (storing), operation (rounding), and accumulation (compounding). Each has different characteristics and mitigations.
•Catastrophic cancellation is the biggest danger — Subtracting nearly equal numbers can destroy precision. Reformulate expressions to avoid.
•Absorption swamps small values — Adding small values to large ones loses the small contribution. Order sums by magnitude.
•Error accumulation can be controlled — Kahan summation and similar techniques maintain precision across thousands of operations.
•Never use == for computed floats — Use epsilon comparisons with appropriate relative and absolute tolerances.
•Professional techniques exist — Higher intermediate precision, reformulation, library functions, and scaling all help manage precision.
•Condition number sets limits — Some problems are inherently ill-conditioned; no algorithm can beat the condition number barrier without restructuring.

What's Next

We've built a solid foundation of floating-point knowledge. The final page in this module tackles the most famous floating-point quirk of all: why 0.1 + 0.2 ≠ 0.3. This seemingly simple bug has confused millions of programmers and spawned countless Stack Overflow questions.

With everything you've learned, you'll finally be able to explain—and resolve—this notorious issue with confidence.

Page Complete

You now have professional-grade understanding of floating-point precision errors. You can identify dangerous patterns, apply mitigation techniques, and reason about accumulated error. This knowledge puts you ahead of most programmers who treat floating-point as a mysterious black box. Next, we'll put this knowledge to work explaining the most famous floating-point puzzle of all.