Floating-Point Types - Learning Module

Loading content...

0/276

IEEE 754 High-Level Overview (No Bit Math)

The Standard That United Computing

Before 1985, every computer manufacturer had their own floating-point format. An IBM mainframe produced different numerical results than a DEC VAX or a Cray supercomputer—even for identical mathematical operations. Scientists couldn't trust that their simulations would reproduce on different machines. Software portability was a nightmare.

Then came IEEE 754: The Standard for Binary Floating-Point Arithmetic.

This single document, published by the Institute of Electrical and Electronics Engineers, established a universal format that every major processor now implements. When you use float in C, double in Java, Number in JavaScript, or np.float64 in Python, you're using IEEE 754. It's arguably the most successful numerical standard in computing history.

This page provides a conceptual tour of IEEE 754—what it specifies, why it makes the choices it does, and what you need to understand as a developer. We'll avoid bit-level arithmetic, focusing instead on the mental model that will help you reason about floating-point behavior.

What You Will Learn

By the end of this page, you will understand the three components of an IEEE 754 floating-point number (sign, exponent, significand), know about single precision (32-bit) and double precision (64-bit) formats, comprehend special values like infinity, negative zero, and NaN, and appreciate the design philosophy that makes IEEE 754 work so elegantly.

The Three Components of Every Floating-Point Number

Every IEEE 754 floating-point number consists of exactly three components, packed together in a fixed number of bits:

1. Sign Bit (1 bit)

The simplest component: a single bit that indicates whether the number is positive (0) or negative (1).

This is stored first (leftmost) in all IEEE 754 formats. Interestingly, this design means that comparing the sign of floats is trivial—just check the first bit.

2. Exponent Field (variable size)

The exponent determines the magnitude of the number—how far left or right the radix point floats. Think of it as the "power of 2" that scales the value.

In single precision: 8 bits, allowing exponents roughly from 2⁻¹²⁶ to 2¹²⁷
In double precision: 11 bits, allowing exponents roughly from 2⁻¹⁰²² to 2¹⁰²³

The exponent field uses a "biased" encoding (we'll explain why shortly) where the stored value is offset from the actual exponent by a fixed amount.

3. Significand/Mantissa Field (variable size)

The significand (also called mantissa or fraction) contains the significant digits of the number. This is where precision lives.

In single precision: 23 bits, providing about 7 decimal digits of precision
In double precision: 52 bits, providing about 15-16 decimal digits of precision

A clever optimization called the "implicit leading bit" gives us one extra bit of precision for free—more on this later.

IEEE 754 Format Sizes
Format	Total Bits	Sign	Exponent	Significand	≈ Decimal Digits
Half (binary16)	16	1	5	10	~3.3
Single (binary32)	32	1	8	23	~7.2
Double (binary64)	64	1	11	52	~15.9
Quadruple (binary128)	128	1	15	112	~34.0

The Formula

Conceptually, an IEEE 754 number represents the value:

value = (−1)^sign × significand × 2^exponent

Where:

sign is 0 or 1
significand is a value between 1.0 and 2.0 (in normalized form)
exponent is an integer (positive or negative)

This should feel familiar—it's scientific notation in base 2 rather than base 10.

Just as 6.02 × 10²³ means "6.02 scaled by 10 to the 23rd power," a floating-point number like 1.5 × 2¹⁰ means "1.5 scaled by 2 to the 10th power" = 1536.

Why Binary?

IEEE 754 uses base 2 (binary) rather than base 10 (decimal) because computer hardware natively operates on binary. Multiplication and division by powers of 2 are simple bit shifts. The tradeoff: numbers that are 'round' in decimal (like 0.1) often have infinite representations in binary, leading to the precision issues we'll explore later.

The Implicit Leading Bit: A Clever Optimization

One of IEEE 754's clever design choices involves how it stores the significand.

Normalization: The Key Insight

In scientific notation, we always write numbers with exactly one non-zero digit before the decimal point:

123 becomes 1.23 × 10²
0.0456 becomes 4.56 × 10⁻²
1.0 stays 1.0 × 10⁰

The same principle applies in binary. For any non-zero binary number, we can always shift it so that it starts with a 1 before the radix point:

Binary 1100 (12) becomes 1.100 × 2³
Binary 0.0011 (3/16) becomes 1.1 × 2⁻³

Notice: the leading digit is always 1.

Getting a Free Bit of Precision

Since normalized floating-point numbers always have 1 as their leading digit, why store it? IEEE 754 makes this bit implicit—it's assumed to be 1 and not stored in the significand bits.

Result: A 23-bit significand field actually stores a 24-bit value (1 + 23 fractional bits). A 52-bit significand field stores a 53-bit value. We get one extra bit of precision "for free."

When you see that single precision has "7 decimal digits of precision," that's because 2²⁴ ≈ 16.7 million ≈ 10^7.2. The implicit bit contributes to this precision.

The Significand Interpretation

Conceptually, the stored significand represents a binary fraction. If the stored bits are b₁b₂b₃...b₂₃, the actual significand value is:

1.b₁b₂b₃...b₂₃ (in binary)

Which equals:

1 + b₁×2⁻¹ + b₂×2⁻² + b₃×2⁻³ + ...

For example, if the stored bits are 10100000...0:

Significand = 1.101 (binary) = 1 + 0.5 + 0.125 = 1.625 (decimal)

The implicit 1 is prepended, giving us values from 1.0 (all zeros) to just under 2.0 (all ones).

Why This Range Matters

By ensuring the significand is always between 1.0 and 2.0, IEEE 754 maintains a consistent precision relationship. Every representable number can be written as:

value = ± 1.xxx × 2^exponent

This normalization ensures that the relative precision (spacing between adjacent values divided by the value itself) stays approximately constant across all magnitudes—the defining characteristic of floating-point's design.

Historical Note

The implicit leading bit was controversial during IEEE 754's development. Some argued for explicitly storing it for simplicity. But the committee recognized that losing one bit of precision would reduce accuracy by 50%—an unacceptable tradeoff. The implicit bit became standard, and every processor manufacturer optimized their hardware to handle it transparently.

The Biased Exponent: Making Comparisons Easy

The exponent field presents an interesting design challenge: we need to represent both positive exponents (for large numbers) and negative exponents (for small numbers). How should we encode negative exponents?

Option 1: Two's Complement (Rejected)

We could use two's complement encoding, the standard for signed integers. An 8-bit two's complement would give exponents from −128 to +127.

Problem: Comparing two floating-point numbers becomes expensive. Consider comparing 2.0 × 2⁵ with 2.0 × 2⁻⁵. With two's complement, we can't just compare bit patterns—the sign bit of the exponent complicates things.

Option 2: Biased Encoding (Adopted)

IEEE 754 uses a "biased" encoding where a fixed value (the bias) is added to the actual exponent before storage:

stored_value = actual_exponent + bias

For single precision (8-bit exponent), the bias is 127:

Actual exponent +1 is stored as 128
Actual exponent 0 is stored as 127
Actual exponent −126 is stored as 1

For double precision (11-bit exponent), the bias is 1023.

Biased Exponent Examples (Single Precision, Bias = 127)
Actual Exponent	Stored Value	Represents Powers of
+127	254	2¹²⁷ ≈ 1.7 × 10³⁸
+10	137	2¹⁰ = 1024
+1	128	2¹ = 2
0	127	2⁰ = 1
−1	126	2⁻¹ = 0.5
−10	117	2⁻¹⁰ ≈ 0.001
−126	1	2⁻¹²⁶ ≈ 1.2 × 10⁻³⁸

Why Biased Encoding Is Elegant

The brilliance of biased encoding is that it makes comparison trivial. For two positive floating-point numbers:

A larger stored exponent always means a larger magnitude
If exponents are equal, a larger significand means a larger value

This means positive floats can be compared using ordinary unsigned integer comparison! The bit pattern of a larger positive float is numerically greater than the bit pattern of a smaller one.

For example, +2.0 (stored as 0x40000000 in hex) is byte-by-byte greater than +1.5 (stored as 0x3FC00000). A single integer comparison determines ordering.

This property—called "lexicographic ordering"—dramatically speeds up floating-point comparison in hardware. What could require unpacking three fields and multiple comparisons becomes a single integer compare.

Reserved Exponent Values

The stored exponent values 0 (all zeros) and 255/2047 (all ones) are reserved for special cases:

Stored 0: Denormalized numbers and zero
Stored 255/2047: Infinity and NaN

This leaves 254 (single) or 2046 (double) regular exponent values for normal numbers.

The Bias Value

The bias is 2^(k−1) − 1, where k is the number of exponent bits. For 8 bits: 2⁷ − 1 = 127. For 11 bits: 2¹⁰ − 1 = 1023. This formula ensures roughly equal range for positive and negative exponents while reserving extreme values for special purposes.

Special Values: Infinity, Zero, and NaN

One of IEEE 754's most important contributions is standardizing special values for edge cases. Before the standard, division by zero might crash programs, return undefined garbage, or behave unpredictably. IEEE 754 defines exactly what happens in every edge case.

Positive and Negative Infinity (±∞)

When a computation produces a result too large to represent, IEEE 754 returns infinity:

1.0 / 0.0 → +∞
−1.0 / 0.0 → −∞
1e308 * 1e308 → +∞ (overflow)

Infinity behaves mathematically as you'd expect:

∞ + 1 → ∞
∞ + ∞ → ∞
−∞ < 100 is true
∞ > −∞ is true

This allows computations to continue even when overflow occurs, with the result indicating "too large to represent."

Positive and Negative Zero (±0)

IEEE 754 defines both +0 and −0. While mathematically identical in value, they arise from different operations:

+1.0 / +∞ → +0
−1.0 / +∞ → −0

For most purposes, +0 and −0 are equal:

+0 == −0 is true
1.0 / +0 → +∞
1.0 / −0 → −∞ (preserving sign information)

The signed zero is primarily useful in complex analysis and for maintaining sign information through limit operations. For everyday programming, you can mostly ignore the distinction.

Not a Number (NaN)

NaN represents undefined or unrepresentable results:

0.0 / 0.0 → NaN
∞ − ∞ → NaN
√(−1) → NaN
∞ × 0 → NaN

NaN has unique properties that make it unmistakable:

NaN's Unusual Behavior

•NaN != NaN — This is the only value that is not equal to itself. This is by design and enables detection.
•NaN propagates — Any operation involving NaN returns NaN (with few exceptions). nan + 1 = nan.
•Comparisons always false — nan < x, nan > x, and nan == x are all false for any x (including NaN).
•NaN ≠ NaN enables detection — To check if a value is NaN: x != x is true only for NaN.

Encoding of Special Values

Special values are encoded using reserved exponent patterns:

Exponent all 1s, significand all 0s → Infinity (sign bit determines +∞ or −∞)
Exponent all 1s, significand non-zero → NaN
Exponent all 0s, significand all 0s → Zero (sign bit determines +0 or −0)
Exponent all 0s, significand non-zero → Denormalized numbers (we'll touch on these briefly)

This encoding scheme is elegant: the most extreme exponent values are dedicated to special cases, leaving the middle range for regular numbers.

Signaling vs. Quiet NaN

IEEE 754 actually defines two kinds of NaN: 'signaling' NaN (sNaN) raises an exception when used; 'quiet' NaN (qNaN) propagates silently through computations. Most programming languages expose only quiet NaN. Signaling NaN can be used to detect use of uninitialized variables in debugging builds.

Denormalized Numbers: Graceful Underflow

When numbers become extremely small—smaller than the smallest normal floating-point value—what should happen? Before IEEE 754, the typical answer was "underflow to zero." But this created problems.

The Underflow Gap Problem

With a minimum exponent of −126 (single precision), the smallest positive normal number is approximately:

1.0 × 2⁻¹²⁶ ≈ 1.175 × 10⁻³⁸

Without special handling, any result smaller than this would round to zero. This created a "gap" between zero and the smallest positive number—a gap that's proportionally enormous compared to the spacing between other adjacent floats.

This gap caused problems:

x − y == 0 could be true even when x != y (if both were in the gap)
Algorithms relying on gradual decrease toward zero would suddenly jump to exactly zero
Scientific computations near the underflow threshold produced unreliable results

Denormalized Numbers Fill the Gap

IEEE 754 solves this with "denormalized" (or "subnormal") numbers. When the exponent field is all zeros, the implicit leading bit changes from 1 to 0:

Normal: significand = 1.xxxxx (implicit 1)
Denormal: significand = 0.xxxxx (implicit 0)

This allows representation of values smaller than the normal minimum, filling the gap between zero and the smallest normal number with gradually decreasing values.

For single precision:

Smallest normal: 1.0 × 2⁻¹²⁶ ≈ 1.175 × 10⁻³⁸
Smallest denormal: 2⁻²³ × 2⁻¹²⁶ = 2⁻¹⁴⁹ ≈ 1.4 × 10⁻⁴⁵

The gap is now filled with ~2²³ (8+ million) denormal values, providing gradual underflow to zero.

The Price of Gradual Underflow

Denormalized numbers have reduced precision because the implicit leading bit is 0, not 1. A denormal with only a few set bits in the significand has fewer significant figures than a normal number.

Also, denormal operations can be slower on some hardware—10x to 100x slower on certain older processors. Modern CPUs handle denormals at full speed, but this remains a consideration in performance-critical code.

Denormal Advantages

•Gradual underflow to zero
•x − y == 0 implies x == y (a mathematical identity preserved)
•Algorithms behave more predictably near zero
•Fills the gap between zero and minimum normal

Denormal Considerations

•Reduced precision (losing the implicit 1)
•Potentially slower on some hardware
•Added complexity in hardware and specifications
•Some applications flush denormals to zero for performance

Practical Impact

For most programmers, denormalized numbers are invisible—they just work. The only time you might care is in performance-critical numerical code where you might consider 'flush-to-zero' mode, or when debugging numerical instabilities very close to zero. Understanding they exist helps explain why floating-point handles very small numbers gracefully.

Rounding Modes: Controlling Approximation

Since floating-point numbers can only represent a finite subset of real numbers, almost every operation requires rounding the true mathematical result to the nearest representable value.

IEEE 754 defines five rounding modes, giving programmers control over how this approximation happens:

1. Round to Nearest, Ties to Even (Default)

The most common mode: round to the closest representable value. When exactly halfway between two representable values, round to the one with an even least significant bit.

2.5 rounds to 2 (even)
3.5 rounds to 4 (even)
4.5 rounds to 4 (even)
5.5 rounds to 6 (even)

The "ties to even" rule (also called "banker's rounding") prevents systematic bias that would occur if we always rounded up or always rounded down on ties.

2. Round to Nearest, Ties Away from Zero

2.5 rounds to 3
−2.5 rounds to −3

This matches the rounding most people learned in school. It's less common in computing because it introduces slight upward bias for positive numbers.

3. Round Toward Zero (Truncation)

Simply discard the bits that don't fit:

2.7 rounds to 2
−2.7 rounds to −2

This is equivalent to integer truncation. It's fast but introduces bias toward zero.

4. Round Toward +∞ (Ceiling)

2.3 rounds to 3
−2.7 rounds to −2

Always round "up" (toward positive infinity).

5. Round Toward −∞ (Floor)

2.7 rounds to 2
−2.3 rounds to −3

Always round "down" (toward negative infinity).

Rounding Mode Comparison
Value	Nearest Even	Nearest Away	Toward Zero	Toward +∞	Toward −∞
+2.3	2	2	2	3	2
+2.5	2	3	2	3	2
+2.7	3	3	2	3	2
−2.3	−2	−2	−2	−2	−3
−2.5	−2	−3	−2	−2	−3
−2.7	−3	−3	−2	−2	−3

Why Rounding Mode Matters

For most applications, the default "round to nearest, ties to even" is optimal. But specialized applications benefit from controlling rounding:

Interval arithmetic — Uses "toward +∞" and "toward −∞" to compute upper and lower bounds on a result, providing guaranteed error bounds.
Financial calculations — Sometimes mandate specific rounding rules for regulatory compliance.
Reproducibility — Consistent rounding ensures identical results across platforms.

The Guarantee of Correct Rounding

IEEE 754 requires that basic operations (+, −, ×, ÷, √) return the correctly rounded result: the result as if computed with infinite precision and then rounded once.

This is a remarkably strong guarantee. It means that a + b gives you the best possible approximation to the true sum, with an error of at most half the spacing between adjacent representable values (half an "ulp"—unit in the last place).

Default Is Usually Best

Unless you have specific requirements (interval arithmetic, regulatory compliance, etc.), stick with the default 'round to nearest, ties to even' mode. It minimizes average error, prevents systematic bias, and is what all standard libraries and hardware assume.

Single vs Double Precision: Choosing Your Width

IEEE 754 defines multiple precision levels. The two most common in everyday programming are single precision (32-bit) and double precision (64-bit).

When to Use Single Precision (float)

Single precision uses 32 bits total:

1 sign bit
8 exponent bits (bias 127)
23 significand bits (24 effective with implicit bit)
~7 decimal digits of precision
Range: ~±1.2 × 10⁻³⁸ to ~±3.4 × 10³⁸

Single precision is ideal when:

Memory is constrained (half the size of double)
Bandwidth is limited (twice as many values fit in cache)
GPU computing (GPUs historically optimized for 32-bit)
Graphics (vertex positions, colors, normals don't need 16 digits)
Games and real-time applications (where speed trumps precision)

When to Use Double Precision (double)

Double precision uses 64 bits total:

1 sign bit
11 exponent bits (bias 1023)
52 significand bits (53 effective with implicit bit)
~15-16 decimal digits of precision
Range: ~±2.2 × 10⁻³⁰⁸ to ~±1.8 × 10³⁰⁸

Double precision is ideal when:

Scientific/engineering precision required
Error accumulation over many operations
Financial calculations that must be precise
General-purpose programming (the default in most languages)
When in doubt (safety over optimization)

Single Precision (32-bit)

•32 bits / 4 bytes per value
•~7 decimal digits precision
•Faster on some hardware (GPUs)
•Fits more values in cache
•Common in: games, graphics, ML inference

Double Precision (64-bit)

•64 bits / 8 bytes per value
•~15-16 decimal digits precision
•Standard CPU floating-point unit
•Sufficient for most scientific work
•Common in: science, finance, simulations

Language Defaults

Different languages make different default choices:

C/C++: float is 32-bit, double is 64-bit. Neither is default; you choose.
Java: float is 32-bit, double is 64-bit. double is preferred for most uses.
Python: float is 64-bit (double precision). There's no built-in single precision.
JavaScript: Number is 64-bit. All JS numbers are double precision.
Go: float32 and float64 explicitly state their size.

The trend in modern languages is toward double precision as the default, prioritizing correctness over micro-optimization. Use single precision when you have measured evidence that the performance benefit matters.

The 7-Digit Trap

Single precision's ~7 decimal digits of precision can bite unexpectedly. Subtracting two similar large numbers (catastrophic cancellation) can leave only 1-2 significant digits. For serious numerical work, double precision's ~16 digits provide much more headroom for error accumulation.

Summary: IEEE 754 at a Glance

We've taken a conceptual tour through IEEE 754—the standard that unified floating-point computing across the entire industry. Let's consolidate the essential insights:

Key Takeaways

•Three components define every float — Sign (±), exponent (magnitude/scale), and significand (precision/significant digits).
•The implicit leading bit provides a free bit of precision — Normalized numbers always start with 1.xxx, so the leading 1 is assumed, not stored.
•Biased exponents enable simple comparison — Adding a bias to the exponent makes larger positive floats compare as larger unsigned integers.
•Special values handle edge cases gracefully — Infinity, NaN, and signed zero have precisely defined behavior, preventing crashes and undefined behavior.
•Denormalized numbers provide gradual underflow — Values very close to zero are handled smoothly instead of abruptly jumping to zero.
•Rounding modes control approximation bias — The default 'round to nearest even' minimizes systematic error; other modes serve specialized needs.
•Choose precision based on requirements — Double (64-bit) for general use and science; single (32-bit) for graphics, games, and memory-constrained applications.

What's Next

With the IEEE 754 foundation in place, we're ready to explore what happens when this beautifully designed system meets mathematical reality. The next page dives into precision errors and rounding issues—why operations that seem exact can introduce tiny errors, how these errors accumulate, and what you can do about them.

This knowledge transforms floating-point from a mysterious source of bugs into a predictable, manageable tool.

Page Complete

You now have a solid conceptual understanding of IEEE 754—the standard governing floating-point arithmetic worldwide. You understand the three components, the clever optimizations, the special values, and the precision choices. Next, we'll explore the precision errors that arise when this finite representation meets infinite mathematical reality.