Loading content...
Before 1985, every computer manufacturer had their own floating-point format. An IBM mainframe produced different numerical results than a DEC VAX or a Cray supercomputer—even for identical mathematical operations. Scientists couldn't trust that their simulations would reproduce on different machines. Software portability was a nightmare.
Then came IEEE 754: The Standard for Binary Floating-Point Arithmetic.
This single document, published by the Institute of Electrical and Electronics Engineers, established a universal format that every major processor now implements. When you use float in C, double in Java, Number in JavaScript, or np.float64 in Python, you're using IEEE 754. It's arguably the most successful numerical standard in computing history.
This page provides a conceptual tour of IEEE 754—what it specifies, why it makes the choices it does, and what you need to understand as a developer. We'll avoid bit-level arithmetic, focusing instead on the mental model that will help you reason about floating-point behavior.
By the end of this page, you will understand the three components of an IEEE 754 floating-point number (sign, exponent, significand), know about single precision (32-bit) and double precision (64-bit) formats, comprehend special values like infinity, negative zero, and NaN, and appreciate the design philosophy that makes IEEE 754 work so elegantly.
Every IEEE 754 floating-point number consists of exactly three components, packed together in a fixed number of bits:
1. Sign Bit (1 bit)
The simplest component: a single bit that indicates whether the number is positive (0) or negative (1).
This is stored first (leftmost) in all IEEE 754 formats. Interestingly, this design means that comparing the sign of floats is trivial—just check the first bit.
2. Exponent Field (variable size)
The exponent determines the magnitude of the number—how far left or right the radix point floats. Think of it as the "power of 2" that scales the value.
The exponent field uses a "biased" encoding (we'll explain why shortly) where the stored value is offset from the actual exponent by a fixed amount.
3. Significand/Mantissa Field (variable size)
The significand (also called mantissa or fraction) contains the significant digits of the number. This is where precision lives.
A clever optimization called the "implicit leading bit" gives us one extra bit of precision for free—more on this later.
| Format | Total Bits | Sign | Exponent | Significand | ≈ Decimal Digits |
|---|---|---|---|---|---|
| Half (binary16) | 16 | 1 | 5 | 10 | ~3.3 |
| Single (binary32) | 32 | 1 | 8 | 23 | ~7.2 |
| Double (binary64) | 64 | 1 | 11 | 52 | ~15.9 |
| Quadruple (binary128) | 128 | 1 | 15 | 112 | ~34.0 |
The Formula
Conceptually, an IEEE 754 number represents the value:
value = (−1)^sign × significand × 2^exponent
Where:
This should feel familiar—it's scientific notation in base 2 rather than base 10.
Just as 6.02 × 10²³ means "6.02 scaled by 10 to the 23rd power," a floating-point number like 1.5 × 2¹⁰ means "1.5 scaled by 2 to the 10th power" = 1536.
IEEE 754 uses base 2 (binary) rather than base 10 (decimal) because computer hardware natively operates on binary. Multiplication and division by powers of 2 are simple bit shifts. The tradeoff: numbers that are 'round' in decimal (like 0.1) often have infinite representations in binary, leading to the precision issues we'll explore later.
One of IEEE 754's clever design choices involves how it stores the significand.
Normalization: The Key Insight
In scientific notation, we always write numbers with exactly one non-zero digit before the decimal point:
The same principle applies in binary. For any non-zero binary number, we can always shift it so that it starts with a 1 before the radix point:
Notice: the leading digit is always 1.
Getting a Free Bit of Precision
Since normalized floating-point numbers always have 1 as their leading digit, why store it? IEEE 754 makes this bit implicit—it's assumed to be 1 and not stored in the significand bits.
Result: A 23-bit significand field actually stores a 24-bit value (1 + 23 fractional bits). A 52-bit significand field stores a 53-bit value. We get one extra bit of precision "for free."
When you see that single precision has "7 decimal digits of precision," that's because 2²⁴ ≈ 16.7 million ≈ 10^7.2. The implicit bit contributes to this precision.
The Significand Interpretation
Conceptually, the stored significand represents a binary fraction. If the stored bits are b₁b₂b₃...b₂₃, the actual significand value is:
1.b₁b₂b₃...b₂₃ (in binary)
Which equals:
1 + b₁×2⁻¹ + b₂×2⁻² + b₃×2⁻³ + ...
For example, if the stored bits are 10100000...0:
The implicit 1 is prepended, giving us values from 1.0 (all zeros) to just under 2.0 (all ones).
Why This Range Matters
By ensuring the significand is always between 1.0 and 2.0, IEEE 754 maintains a consistent precision relationship. Every representable number can be written as:
value = ± 1.xxx × 2^exponent
This normalization ensures that the relative precision (spacing between adjacent values divided by the value itself) stays approximately constant across all magnitudes—the defining characteristic of floating-point's design.
The implicit leading bit was controversial during IEEE 754's development. Some argued for explicitly storing it for simplicity. But the committee recognized that losing one bit of precision would reduce accuracy by 50%—an unacceptable tradeoff. The implicit bit became standard, and every processor manufacturer optimized their hardware to handle it transparently.
The exponent field presents an interesting design challenge: we need to represent both positive exponents (for large numbers) and negative exponents (for small numbers). How should we encode negative exponents?
Option 1: Two's Complement (Rejected)
We could use two's complement encoding, the standard for signed integers. An 8-bit two's complement would give exponents from −128 to +127.
Problem: Comparing two floating-point numbers becomes expensive. Consider comparing 2.0 × 2⁵ with 2.0 × 2⁻⁵. With two's complement, we can't just compare bit patterns—the sign bit of the exponent complicates things.
Option 2: Biased Encoding (Adopted)
IEEE 754 uses a "biased" encoding where a fixed value (the bias) is added to the actual exponent before storage:
stored_value = actual_exponent + bias
For single precision (8-bit exponent), the bias is 127:
For double precision (11-bit exponent), the bias is 1023.
| Actual Exponent | Stored Value | Represents Powers of |
|---|---|---|
| +127 | 254 | 2¹²⁷ ≈ 1.7 × 10³⁸ |
| +10 | 137 | 2¹⁰ = 1024 |
| +1 | 128 | 2¹ = 2 |
| 0 | 127 | 2⁰ = 1 |
| −1 | 126 | 2⁻¹ = 0.5 |
| −10 | 117 | 2⁻¹⁰ ≈ 0.001 |
| −126 | 1 | 2⁻¹²⁶ ≈ 1.2 × 10⁻³⁸ |
Why Biased Encoding Is Elegant
The brilliance of biased encoding is that it makes comparison trivial. For two positive floating-point numbers:
This means positive floats can be compared using ordinary unsigned integer comparison! The bit pattern of a larger positive float is numerically greater than the bit pattern of a smaller one.
For example, +2.0 (stored as 0x40000000 in hex) is byte-by-byte greater than +1.5 (stored as 0x3FC00000). A single integer comparison determines ordering.
This property—called "lexicographic ordering"—dramatically speeds up floating-point comparison in hardware. What could require unpacking three fields and multiple comparisons becomes a single integer compare.
Reserved Exponent Values
The stored exponent values 0 (all zeros) and 255/2047 (all ones) are reserved for special cases:
This leaves 254 (single) or 2046 (double) regular exponent values for normal numbers.
The bias is 2^(k−1) − 1, where k is the number of exponent bits. For 8 bits: 2⁷ − 1 = 127. For 11 bits: 2¹⁰ − 1 = 1023. This formula ensures roughly equal range for positive and negative exponents while reserving extreme values for special purposes.
One of IEEE 754's most important contributions is standardizing special values for edge cases. Before the standard, division by zero might crash programs, return undefined garbage, or behave unpredictably. IEEE 754 defines exactly what happens in every edge case.
Positive and Negative Infinity (±∞)
When a computation produces a result too large to represent, IEEE 754 returns infinity:
1.0 / 0.0 → +∞−1.0 / 0.0 → −∞1e308 * 1e308 → +∞ (overflow)Infinity behaves mathematically as you'd expect:
∞ + 1 → ∞∞ + ∞ → ∞−∞ < 100 is true∞ > −∞ is trueThis allows computations to continue even when overflow occurs, with the result indicating "too large to represent."
Positive and Negative Zero (±0)
IEEE 754 defines both +0 and −0. While mathematically identical in value, they arise from different operations:
+1.0 / +∞ → +0−1.0 / +∞ → −0For most purposes, +0 and −0 are equal:
+0 == −0 is true1.0 / +0 → +∞1.0 / −0 → −∞ (preserving sign information)The signed zero is primarily useful in complex analysis and for maintaining sign information through limit operations. For everyday programming, you can mostly ignore the distinction.
Not a Number (NaN)
NaN represents undefined or unrepresentable results:
0.0 / 0.0 → NaN∞ − ∞ → NaN√(−1) → NaN∞ × 0 → NaNNaN has unique properties that make it unmistakable:
nan < x, nan > x, and nan == x are all false for any x (including NaN).x != x is true only for NaN.Encoding of Special Values
Special values are encoded using reserved exponent patterns:
This encoding scheme is elegant: the most extreme exponent values are dedicated to special cases, leaving the middle range for regular numbers.
IEEE 754 actually defines two kinds of NaN: 'signaling' NaN (sNaN) raises an exception when used; 'quiet' NaN (qNaN) propagates silently through computations. Most programming languages expose only quiet NaN. Signaling NaN can be used to detect use of uninitialized variables in debugging builds.
When numbers become extremely small—smaller than the smallest normal floating-point value—what should happen? Before IEEE 754, the typical answer was "underflow to zero." But this created problems.
The Underflow Gap Problem
With a minimum exponent of −126 (single precision), the smallest positive normal number is approximately:
1.0 × 2⁻¹²⁶ ≈ 1.175 × 10⁻³⁸
Without special handling, any result smaller than this would round to zero. This created a "gap" between zero and the smallest positive number—a gap that's proportionally enormous compared to the spacing between other adjacent floats.
This gap caused problems:
x − y == 0 could be true even when x != y (if both were in the gap)Denormalized Numbers Fill the Gap
IEEE 754 solves this with "denormalized" (or "subnormal") numbers. When the exponent field is all zeros, the implicit leading bit changes from 1 to 0:
This allows representation of values smaller than the normal minimum, filling the gap between zero and the smallest normal number with gradually decreasing values.
For single precision:
The gap is now filled with ~2²³ (8+ million) denormal values, providing gradual underflow to zero.
The Price of Gradual Underflow
Denormalized numbers have reduced precision because the implicit leading bit is 0, not 1. A denormal with only a few set bits in the significand has fewer significant figures than a normal number.
Also, denormal operations can be slower on some hardware—10x to 100x slower on certain older processors. Modern CPUs handle denormals at full speed, but this remains a consideration in performance-critical code.
For most programmers, denormalized numbers are invisible—they just work. The only time you might care is in performance-critical numerical code where you might consider 'flush-to-zero' mode, or when debugging numerical instabilities very close to zero. Understanding they exist helps explain why floating-point handles very small numbers gracefully.
Since floating-point numbers can only represent a finite subset of real numbers, almost every operation requires rounding the true mathematical result to the nearest representable value.
IEEE 754 defines five rounding modes, giving programmers control over how this approximation happens:
1. Round to Nearest, Ties to Even (Default)
The most common mode: round to the closest representable value. When exactly halfway between two representable values, round to the one with an even least significant bit.
The "ties to even" rule (also called "banker's rounding") prevents systematic bias that would occur if we always rounded up or always rounded down on ties.
2. Round to Nearest, Ties Away from Zero
This matches the rounding most people learned in school. It's less common in computing because it introduces slight upward bias for positive numbers.
3. Round Toward Zero (Truncation)
Simply discard the bits that don't fit:
This is equivalent to integer truncation. It's fast but introduces bias toward zero.
4. Round Toward +∞ (Ceiling)
Always round "up" (toward positive infinity).
5. Round Toward −∞ (Floor)
Always round "down" (toward negative infinity).
| Value | Nearest Even | Nearest Away | Toward Zero | Toward +∞ | Toward −∞ |
|---|---|---|---|---|---|
| +2.3 | 2 | 2 | 2 | 3 | 2 |
| +2.5 | 2 | 3 | 2 | 3 | 2 |
| +2.7 | 3 | 3 | 2 | 3 | 2 |
| −2.3 | −2 | −2 | −2 | −2 | −3 |
| −2.5 | −2 | −3 | −2 | −2 | −3 |
| −2.7 | −3 | −3 | −2 | −2 | −3 |
Why Rounding Mode Matters
For most applications, the default "round to nearest, ties to even" is optimal. But specialized applications benefit from controlling rounding:
The Guarantee of Correct Rounding
IEEE 754 requires that basic operations (+, −, ×, ÷, √) return the correctly rounded result: the result as if computed with infinite precision and then rounded once.
This is a remarkably strong guarantee. It means that a + b gives you the best possible approximation to the true sum, with an error of at most half the spacing between adjacent representable values (half an "ulp"—unit in the last place).
Unless you have specific requirements (interval arithmetic, regulatory compliance, etc.), stick with the default 'round to nearest, ties to even' mode. It minimizes average error, prevents systematic bias, and is what all standard libraries and hardware assume.
IEEE 754 defines multiple precision levels. The two most common in everyday programming are single precision (32-bit) and double precision (64-bit).
When to Use Single Precision (float)
Single precision uses 32 bits total:
Single precision is ideal when:
When to Use Double Precision (double)
Double precision uses 64 bits total:
Double precision is ideal when:
Language Defaults
Different languages make different default choices:
float is 32-bit, double is 64-bit. Neither is default; you choose.float is 32-bit, double is 64-bit. double is preferred for most uses.float is 64-bit (double precision). There's no built-in single precision.Number is 64-bit. All JS numbers are double precision.float32 and float64 explicitly state their size.The trend in modern languages is toward double precision as the default, prioritizing correctness over micro-optimization. Use single precision when you have measured evidence that the performance benefit matters.
Single precision's ~7 decimal digits of precision can bite unexpectedly. Subtracting two similar large numbers (catastrophic cancellation) can leave only 1-2 significant digits. For serious numerical work, double precision's ~16 digits provide much more headroom for error accumulation.
We've taken a conceptual tour through IEEE 754—the standard that unified floating-point computing across the entire industry. Let's consolidate the essential insights:
What's Next
With the IEEE 754 foundation in place, we're ready to explore what happens when this beautifully designed system meets mathematical reality. The next page dives into precision errors and rounding issues—why operations that seem exact can introduce tiny errors, how these errors accumulate, and what you can do about them.
This knowledge transforms floating-point from a mysterious source of bugs into a predictable, manageable tool.
You now have a solid conceptual understanding of IEEE 754—the standard governing floating-point arithmetic worldwide. You understand the three components, the clever optimizations, the special values, and the precision choices. Next, we'll explore the precision errors that arise when this finite representation meets infinite mathematical reality.