Data Structures & AlgorithmsFloating-Point Data Types

Floating-Point Data Types

LevelBeginner

Duration75 mins

TopicFloating-Point Data Types

2 / 5

Fixed-Point vs Floating-Point (Intuition)

Two Philosophies of Representing Fractions

Imagine you're designing a system to represent non-integer numbers. You have a fixed budget of 32 bits—no more, no less. How do you allocate those bits to represent values like 3.14159, 0.0001, or 1,234,567.89?

This seemingly simple question has two fundamentally different answers, each with profound implications for what you can represent, how precisely you can represent it, and which applications your system serves well.

Fixed-point representation says: "Let's agree that the decimal point always sits in the same position. Some bits represent the integer part, other bits represent the fractional part, and this never changes."

Floating-point representation says: "Let's allow the decimal point to move. We'll use some bits to describe where the point sits, and the remaining bits to describe the significant digits."

This page explores these two philosophies in depth, developing the intuition that will serve you throughout your career as an engineer.

What You Will Learn

By the end of this page, you will understand the core difference between fixed-point and floating-point representation, be able to identify when each approach is appropriate, develop visual intuition for how the 'floating' radix point works, and recognize the engineering tradeoffs that led to floating-point dominance in general-purpose computing.

Understanding Fixed-Point Representation

Fixed-point representation is the simpler and more intuitive approach. The core idea is straightforward: decide in advance where the radix point (decimal point) sits, and keep it there forever.

A Concrete Example: Q16.16 Format

Consider a 32-bit fixed-point format called "Q16.16":

16 bits for the integer part (before the radix point)
16 bits for the fractional part (after the radix point)

With 16 bits for the integer part, we can represent integers from 0 to 65,535 (unsigned) or −32,768 to 32,767 (signed).

With 16 bits for the fractional part, our smallest representable fraction is 1/65,536 ≈ 0.0000153—about 4.5 decimal places of precision.

How Values Are Encoded

In Q16.16, to store the value 3.5:

Integer part: 3 → binary 0000000000000011
Fractional part: 0.5 = 1/2 = 32768/65536 → binary 1000000000000000
Full representation: 0000000000000011.1000000000000000
As a raw 32-bit integer: 229,376 (since 3.5 × 65,536 = 229,376)

Notice the key insight: fixed-point numbers are stored as integers, with an implicit scale factor. To convert a fixed-point value back to a real number, you divide by the scale factor (65,536 for Q16.16).

Q16.16 Fixed-Point Examples
Real Value	Integer Part	Fractional Part	Stored Integer
0.0	0	0	0
1.0	1	0	65,536
3.5	3	32,768	229,376
−1.25	−2	49,152	−81,920
255.99998	255	65,535	16,777,215
0.00001526	0	1	1

Arithmetic in Fixed-Point

One of fixed-point's greatest strengths is that arithmetic uses ordinary integer operations:

Addition/Subtraction: Directly add or subtract the stored integers. If A = 3.5 (stored as 229,376) and B = 1.25 (stored as 81,920), then A + B = 229,376 + 81,920 = 311,296, which represents 4.75.

Multiplication: Multiply the integers, then shift right by the fractional bits. If A × B = 229,376 × 81,920 = 18,790,481,920, then shift right by 16 bits: 18,790,481,920 >> 16 = 286,720, which represents 4.375 (= 3.5 × 1.25).

This integer-based arithmetic is why fixed-point thrived in early computing and remains popular in resource-constrained environments.

The Hidden Scale Factor

Fixed-point is often described as 'integers with an implicit decimal point.' This is accurate but incomplete. The more useful mental model is: 'integers with a hidden multiplication by 1/scale.' Every fixed-point value is conceptually multiplied by 2^(−fractional_bits) to yield its real value.

The Fixed-Point Problem: Limited Dynamic Range

Fixed-point representation works beautifully when your values occupy a narrow, predictable range. But the real world often presents values spanning many orders of magnitude—and here, fixed-point struggles.

The Dynamic Range Problem

Consider our Q16.16 format again:

Largest representable value: ~65,535.99998
Smallest positive value: ~0.0000153
Dynamic range: ~4.3 billion to 1 (2³² : 1)

This might seem adequate until you consider actual scientific and engineering requirements:

Real-World Dynamic Range Requirements

•Atomic physics: Masses range from electrons (9 × 10⁻³¹ kg) to nuclei (10⁻²⁵ kg)—a factor of 10⁶ within atoms alone.
•Astronomy: Distances from Earth-Moon (4 × 10⁸ m) to the observable universe (10²⁶ m)—a factor of 10¹⁸.
•Audio processing: Sound intensity ranges from the threshold of hearing (10⁻¹² W/m²) to physical pain (10¹ W/m²)—a factor of 10¹³.
•Graphics: RGB colors (0-255) combined with HDR luminance spanning 10⁶:1 contrast ratios.
•Financial derivatives: Options pricing involves probabilities from 0.000001 to 0.999999 multiplied by notional values of billions.

The Format Proliferation Problem

Faced with varied requirements, fixed-point users must choose between:

Use different formats for different applications — Q8.24 for audio (more fractional precision), Q24.8 for graphics (more integer range), Q1.31 for normalized values, etc.
Accept precision loss — Use a compromise format that's suboptimal for all use cases.
Use wider representations — 64-bit or 128-bit fixed-point, sacrificing memory and performance.

None of these is satisfying. Different formats mean conversion overhead and complexity. Compromise means mediocrity. Wider representations mean cost.

The Precision Distribution Problem

Fixed-point allocates precision uniformly across its range. Near 65,535, the gap between adjacent values is 0.0000153—the smallest fractional step. Near 0.001, that same gap still applies.

But consider: do we really need 0.0000153 precision when representing 50,000? That's precision at the 0.00003% level—overkill for most applications.

Conversely, when representing 0.0001, a step of 0.0000153 is 15% of the value—far too coarse for many purposes.

This mismatch reveals fixed-point's fundamental limitation: it distributes precision uniformly in absolute terms, but real-world requirements are often relative to magnitude.

The Range-Precision Tradeoff Is Locked

In fixed-point, once you choose the format, your range-precision tradeoff is fixed forever. Need more fractional precision? You lose integer range. Need larger integers? You sacrifice fractional bits. Floating-point breaks this constraint by making the tradeoff dynamic.

The Floating-Point Solution: Dynamic Allocation

Floating-point representation solves fixed-point's limitations through a revolutionary idea: let the radix point move.

Instead of committing bits permanently to integer vs. fractional parts, floating-point stores:

A significand (mantissa) — The significant digits, regardless of magnitude
An exponent — Where to position the radix point
A sign — Positive or negative

The exponent effectively says, "Take these significant digits and shift them left or right by this many positions." This "floating" of the radix point gives the representation its name.

How Floating-Point Adapts to Magnitude

Consider representing these values in a hypothetical 32-bit format:

How Floating-Point Adapts to Different Magnitudes
Value	Significand (approx)	Exponent	Precision Available
0.000001234	1.234	−6 (×10⁻⁶)	7 significant digits
1.234	1.234	0 (×10⁰)	7 significant digits
1234.0	1.234	+3 (×10³)	7 significant digits
1234000.0	1.234	+6 (×10⁶)	7 significant digits
1.234 × 10²⁰	1.234	+20 (×10²⁰)	7 significant digits

The Key Insight: Constant Relative Precision

Notice that regardless of magnitude, we maintain ~7 significant digits of precision. The precision is relative to the value's magnitude, not absolute.

For 0.000001234:

Precision is about ±0.0000000001 (absolute)
Relative precision is about 1 part in 10⁷ (0.00001%)

For 1,234,000:

Precision is about ±0.1 (absolute)
Relative precision is still about 1 part in 10⁷ (0.00001%)

This relative precision model matches how physical measurements and scientific quantities actually work. When measuring distances, ±1 meter error is acceptable for kilometer scales but catastrophic for millimeter scales. Relative precision naturally captures this.

Visualizing the Floating Radix Point

Imagine a sliding window that reveals 24 binary digits (the significand in IEEE 754 single precision). The exponent controls where this window sits on the number line:

For large numbers, the window sits far to the left, capturing billions but ignoring fractions.
For tiny numbers, the window sits far to the right, capturing millionths but ignoring integers.
The window is always the same width (same number of significant bits), but its position floats based on magnitude.

This sliding window metaphor explains both the power and the limitations of floating-point. You always get the same precision in significant digits, but absolute precision varies with magnitude.

The Logarithmic Perspective

Floating-point can be understood as representing numbers on a logarithmic scale. The exponent captures the order of magnitude (like the logarithm), while the significand provides precision within that order of magnitude. This logarithmic nature is why floating-point spans such enormous ranges efficiently.

Comparing the Representations Visually

The difference between fixed-point and floating-point becomes strikingly clear when we visualize how they distribute representable values across the number line.

Fixed-Point: Uniform Distribution

In fixed-point representation, representable values are uniformly spaced along the number line. Using our Q16.16 example:

|     |     |     |     |     |     |     |     |     |     |
0   0.001  0.002  0.003  0.004  ...  65535.996  65535.997  65535.998

Every adjacent pair of values is separated by exactly 1/65,536 ≈ 0.0000153. This uniform spacing means:

Near zero: Excellent relative precision (0.0000153 relative to 0.001 is 1.5%)
Near maximum: Excessive absolute precision (0.0000153 relative to 65,535 is 0.00002%)

Floating-Point: Logarithmic Distribution

In floating-point representation, representable values become denser near zero and sparser as magnitude increases:

|||||||||||||     |     |     |     |     |          |          |
0     0.001      0.01     0.1      1        10        100       1000
      (dense)              (moderate)              (sparse)

Between 1 and 2, there might be ~8 million representable values. Between 1,000,000 and 2,000,000, there are still ~8 million representable values—but now spread over a million units instead of one.

Fixed-Point Distribution

•Uniform spacing throughout range
•Same absolute gap between all adjacent values
•Precision 'wasted' at high magnitudes
•Precision 'insufficient' near zero (relatively)
•Simple mental model
•Predictable bit-for-bit behavior

Floating-Point Distribution

•Logarithmic spacing (denser near zero)
•Relative gap constant across magnitudes
•Precision proportional to magnitude
•Consistent significant digits everywhere
•More complex mental model
•Requires understanding of relative error

The Density Around Zero

A fascinating consequence of floating-point's design: approximately half of all representable floating-point numbers lie between −1 and +1. The other half covers the rest of the range—from ±1 to ±10³⁰⁸.

This makes intuitive sense: values near zero need the finest gradations (to distinguish 0.0001 from 0.0002 is as important as distinguishing 1,000 from 2,000 in relative terms). By concentrating representable values near zero, floating-point provides this fine gradation where it matters most.

The Spacing Formula

For IEEE 754 double precision, the gap between adjacent representable values near any number x is approximately x × 2⁻⁵² ≈ x × 2.2 × 10⁻¹⁶. This is the 'machine epsilon' relative to x. Near 1.0, the gap is about 2 × 10⁻¹⁶. Near 10¹⁵, the gap is about 0.2—still only 2 × 10⁻¹⁶ relative to the value.

When to Use Fixed-Point

Despite floating-point's dominance in general-purpose computing, fixed-point remains the right choice for several important domains. Understanding when to use each representation is a mark of engineering maturity.

Financial Computing

Money is universally counted in fixed units. A dollar has exactly 100 cents. An Indian rupee has exactly 100 paise. Financial regulations often mandate exact, reproducible arithmetic—no rounding surprises allowed.

Fixed-point (often implemented as integers representing the smallest unit) is standard in financial systems:

Store $19.99 as 1999 cents
Store ¥50,000.00 as 5,000,000 sen
Arithmetic is exact: 1999 + 2001 = 4000 cents = $40.00, always

Floating-point's rounding errors are unacceptable when auditors expect accounts to balance to the penny.

Embedded Systems and DSP

Many microcontrollers and digital signal processors lack floating-point hardware. Operations that take 1 cycle with floating-point hardware might take 100+ cycles in software emulation.

Fixed-point enables:

Audio processing on cheap hardware (MP3 players, hearing aids)
Motor control in appliances and vehicles
Sensor processing in IoT devices
Real-time graphics on resource-constrained systems

Ideal Fixed-Point Use Cases

•Financial accounting — Exact decimal arithmetic required by law or audit
•Embedded/real-time systems — No floating-point hardware, strict timing requirements
•Digital signal processing — Known dynamic range, deterministic behavior needed
•Cryptography — Exact, reproducible computations essential for security
•Blockchain/smart contracts — Determinism required across all nodes
•Legacy system integration — Matching behavior of systems predating IEEE 754
•FPGA/ASIC implementation — Fixed-point requires less silicon

The Key Question: Is Your Range Predictable?

The fundamental criterion for choosing fixed-point is whether your values occupy a predictable, bounded range with uniform precision requirements.

Currency: Always 0.01 to reasonable maximums with 2-4 decimal places → Fixed-point works
Audio samples: Always -1.0 to +1.0 with ~16 bits of precision → Fixed-point works
Physics simulation: 10⁻¹⁰ to 10¹⁰ with variable precision needs → Floating-point required

If you can predict your range and precision needs at design time, fixed-point offers simplicity and exactness. If requirements are variable or unpredictable, floating-point's flexibility is essential.

Modern Fixed-Point Solutions

Many languages provide decimal types (Python's decimal.Decimal, Java's BigDecimal, C#'s decimal) that offer exact decimal arithmetic for financial applications. These are essentially variable-precision fixed-point implementations, combining the exactness of fixed-point with flexible precision.

When to Use Floating-Point

Floating-point is the default choice for general-purpose computing, and for good reason. Its flexibility makes it suitable for an enormous range of applications that would be impractical with fixed-point.

Scientific and Engineering Computing

Science operates across scales that fixed-point cannot practically represent:

Planetary physics: distances in meters (10⁰) to astronomical units (10¹¹) to light-years (10¹⁶)
Quantum mechanics: energies from millielectronvolts (10⁻²²) to gigaelectronvolts (10⁻¹⁰)
Chemistry: molecular concentrations from millimolar (10⁻³) to picomolar (10⁻¹²)

Floating-point handles all these scales with consistent relative precision, using the same 64-bit representation.

Machine Learning and Statistics

ML computation involves:

Probabilities spanning 0.00001 to 0.99999
Gradients that might be 0.0000001 or 1000000
Intermediate activations of widely varying magnitudes

No single fixed-point format could reasonably handle this range.

Ideal Floating-Point Use Cases

•Scientific simulation — Physics, chemistry, biology, climate, geology
•3D graphics and gaming — Geometry, lighting, physics engines
•Machine learning — Training, inference, embedding computations
•Signal analysis — Fourier transforms, spectral analysis, filtering
•Geospatial computing — GPS, mapping, navigation, GIS
•Statistical analysis — Regression, probability, hypothesis testing
•General-purpose math — When range/precision requirements are unknown
•Prototyping — When performance tuning will come later

Hardware Support Changes the Calculus

Modern CPUs execute floating-point operations in a single clock cycle. GPUs perform trillions of floating-point operations per second. This hardware support makes floating-point effectively "free" from a performance standpoint for most applications.

The performance advantage that once favored fixed-point has largely evaporated on general-purpose hardware. Only in resource-constrained embedded systems or specialized hardware (FPGAs, custom ASICs) does fixed-point offer significant performance benefits.

The Default Choice

When in doubt, use floating-point. It's:

Universally supported in hardware
Built into every programming language
Flexible enough for almost any application
Well-understood with decades of numerical analysis research

Only reach for fixed-point when you have specific requirements (exact decimal arithmetic, no floating-point hardware, deterministic cross-platform behavior) that floating-point cannot satisfy.

The 90/10 Rule

Approximately 90%+ of non-integer numerical computing uses floating-point. The remaining uses are dominated by financial systems, embedded real-time applications, and specialized domains. For a typical software engineer, floating-point is the representation you'll encounter daily.

The Hybrid and Specialized Approaches

As computing has evolved, the binary choice between fixed-point and floating-point has been supplemented by specialized formats that combine features of both or optimize for specific workloads.

Posit Numbers

A recent alternative called "posit" (or "Type III unum") aims to provide better precision-per-bit than IEEE 754 floating-point. Posits use a variable-length exponent encoding that provides:

Greater precision near 1.0 (where many computations cluster)
Simpler exception handling (no NaN, only one infinity)
Potential efficiency gains for neural network inference

Posits remain experimental but show promise for specialized applications.

Block Floating-Point

Used in some DSP and machine learning applications, block floating-point shares a single exponent across a group of values, with each value storing only a mantissa:

More compact than individual floating-point values
Maintains some dynamic range flexibility
Common in neural network quantization

Logarithmic Number Systems

These systems store the logarithm of the value directly:

Multiplication becomes addition (log(a×b) = log(a) + log(b))
Addition becomes more complex
Useful when multiplication dominates computation

Comparison of Number Representations
Representation	Strengths	Weaknesses	Best For
Fixed-Point	Exact, predictable, integer-like	Limited range, inflexible	Finance, embedded, DSP
Floating-Point	Huge range, relative precision	Rounding errors, complex	Science, graphics, ML
Posit	Better precision/bit	Not standardized, new	Research, specialized HW
Block Float	Compact, shared exponent	Group-level precision only	ML inference, compression
Logarithmic	Fast multiplication	Slow addition	Multiplication-heavy DSP

Machine Learning Specialized Formats

The AI revolution has driven development of formats optimized for neural networks:

FP16 (half precision) — 16-bit floating-point, standard in GPU training
BF16 (bfloat16) — Google's 16-bit format with full 8-bit exponent but reduced mantissa
INT8/INT4 — Quantized integer formats for inference, achieving impressive accuracy with minimal precision
TF32 (TensorFloat-32) — NVIDIA's 19-bit format for tensor operations

These formats sacrifice precision for throughput, recognizing that neural networks are typically robust to quantization noise. The reduced precision enables more operations per second on specialized hardware.

The Right Tool for the Job

The proliferation of numerical formats reflects a mature understanding that different applications have different needs. A graphics shader, a financial ledger, and a climate model all process numbers—but their requirements for range, precision, speed, and exactness differ dramatically. Skilled engineers choose (or design) representations matched to their specific requirements.

Summary: Fixed-Point vs Floating-Point

We've developed a deep intuition for the fundamental difference between fixed-point and floating-point representation—a distinction that underlies all numerical computing.

Key Takeaways

•Fixed-point locks the radix point position — Bits are permanently allocated to integer and fractional parts, enabling simple integer-like arithmetic but limiting dynamic range.
•Floating-point allows the radix point to move — The exponent field describes where the radix point sits, enabling representation across vast ranges with consistent relative precision.
•Fixed-point provides uniform absolute spacing — Every adjacent pair of values is separated by the same gap, regardless of magnitude.
•Floating-point provides uniform relative spacing — The gap between adjacent values scales with magnitude, giving consistent significant digits across all ranges.
•Choose fixed-point for exact, bounded-range applications — Finance, embedded systems, cryptography, and deterministic computing benefit from fixed-point's predictability.
•Choose floating-point for flexible, wide-range applications — Science, graphics, ML, and general-purpose computing benefit from floating-point's adaptability.
•Modern hardware makes floating-point nearly free — Performance differences have largely vanished on general-purpose CPUs and GPUs.

What's Next

Now that we understand why floating-point exists and how it differs from fixed-point alternatives, we're ready to examine the actual standard that governs its implementation: IEEE 754.

The next page provides a high-level overview of IEEE 754—how bits are organized, what the special values mean, and how rounding works—without diving into bit-level arithmetic. This conceptual understanding will prepare you to reason about floating-point behavior in your own code.

Page Complete

You now have deep intuition for the fixed-point versus floating-point tradeoff. Fixed-point offers simplicity and exactness within bounded ranges; floating-point offers flexibility and relative precision across enormous ranges. Understanding when to use each is a fundamental engineering skill. Next, we'll explore IEEE 754—the standard that made floating-point universal.

2 / 5

Loading learning content...

Data Structures & AlgorithmsFloating-Point Data Types

Floating-Point Data Types

LevelBeginner

Duration75 mins

TopicFloating-Point Data Types

2 / 5

Fixed-Point vs Floating-Point (Intuition)

Two Philosophies of Representing Fractions

Floating-point representation says: "Let's allow the decimal point to move. We'll use some bits to describe where the point sits, and the remaining bits to describe the significant digits."

This page explores these two philosophies in depth, developing the intuition that will serve you throughout your career as an engineer.

What You Will Learn

Understanding Fixed-Point Representation

Fixed-point representation is the simpler and more intuitive approach. The core idea is straightforward: decide in advance where the radix point (decimal point) sits, and keep it there forever.

A Concrete Example: Q16.16 Format

Consider a 32-bit fixed-point format called "Q16.16":

16 bits for the integer part (before the radix point)
16 bits for the fractional part (after the radix point)

With 16 bits for the integer part, we can represent integers from 0 to 65,535 (unsigned) or −32,768 to 32,767 (signed).

With 16 bits for the fractional part, our smallest representable fraction is 1/65,536 ≈ 0.0000153—about 4.5 decimal places of precision.

How Values Are Encoded

In Q16.16, to store the value 3.5:

Integer part: 3 → binary 0000000000000011
Fractional part: 0.5 = 1/2 = 32768/65536 → binary 1000000000000000
Full representation: 0000000000000011.1000000000000000
As a raw 32-bit integer: 229,376 (since 3.5 × 65,536 = 229,376)

Q16.16 Fixed-Point Examples
Real Value	Integer Part	Fractional Part	Stored Integer
0.0	0	0	0
1.0	1	0	65,536
3.5	3	32,768	229,376
−1.25	−2	49,152	−81,920
255.99998	255	65,535	16,777,215
0.00001526	0	1	1

Arithmetic in Fixed-Point

One of fixed-point's greatest strengths is that arithmetic uses ordinary integer operations:

Addition/Subtraction: Directly add or subtract the stored integers. If A = 3.5 (stored as 229,376) and B = 1.25 (stored as 81,920), then A + B = 229,376 + 81,920 = 311,296, which represents 4.75.

This integer-based arithmetic is why fixed-point thrived in early computing and remains popular in resource-constrained environments.

The Hidden Scale Factor

The Fixed-Point Problem: Limited Dynamic Range

The Dynamic Range Problem

Consider our Q16.16 format again:

Largest representable value: ~65,535.99998
Smallest positive value: ~0.0000153
Dynamic range: ~4.3 billion to 1 (2³² : 1)

This might seem adequate until you consider actual scientific and engineering requirements:

Real-World Dynamic Range Requirements

•Atomic physics: Masses range from electrons (9 × 10⁻³¹ kg) to nuclei (10⁻²⁵ kg)—a factor of 10⁶ within atoms alone.
•Astronomy: Distances from Earth-Moon (4 × 10⁸ m) to the observable universe (10²⁶ m)—a factor of 10¹⁸.
•Audio processing: Sound intensity ranges from the threshold of hearing (10⁻¹² W/m²) to physical pain (10¹ W/m²)—a factor of 10¹³.
•Graphics: RGB colors (0-255) combined with HDR luminance spanning 10⁶:1 contrast ratios.
•Financial derivatives: Options pricing involves probabilities from 0.000001 to 0.999999 multiplied by notional values of billions.

The Format Proliferation Problem

Faced with varied requirements, fixed-point users must choose between:

Use different formats for different applications — Q8.24 for audio (more fractional precision), Q24.8 for graphics (more integer range), Q1.31 for normalized values, etc.
Accept precision loss — Use a compromise format that's suboptimal for all use cases.
Use wider representations — 64-bit or 128-bit fixed-point, sacrificing memory and performance.

None of these is satisfying. Different formats mean conversion overhead and complexity. Compromise means mediocrity. Wider representations mean cost.

The Precision Distribution Problem

Fixed-point allocates precision uniformly across its range. Near 65,535, the gap between adjacent values is 0.0000153—the smallest fractional step. Near 0.001, that same gap still applies.

But consider: do we really need 0.0000153 precision when representing 50,000? That's precision at the 0.00003% level—overkill for most applications.

Conversely, when representing 0.0001, a step of 0.0000153 is 15% of the value—far too coarse for many purposes.

This mismatch reveals fixed-point's fundamental limitation: it distributes precision uniformly in absolute terms, but real-world requirements are often relative to magnitude.

The Range-Precision Tradeoff Is Locked

The Floating-Point Solution: Dynamic Allocation

Floating-point representation solves fixed-point's limitations through a revolutionary idea: let the radix point move.

Instead of committing bits permanently to integer vs. fractional parts, floating-point stores:

A significand (mantissa) — The significant digits, regardless of magnitude
An exponent — Where to position the radix point
A sign — Positive or negative

The exponent effectively says, "Take these significant digits and shift them left or right by this many positions." This "floating" of the radix point gives the representation its name.

How Floating-Point Adapts to Magnitude

Consider representing these values in a hypothetical 32-bit format:

How Floating-Point Adapts to Different Magnitudes
Value	Significand (approx)	Exponent	Precision Available
0.000001234	1.234	−6 (×10⁻⁶)	7 significant digits
1.234	1.234	0 (×10⁰)	7 significant digits
1234.0	1.234	+3 (×10³)	7 significant digits
1234000.0	1.234	+6 (×10⁶)	7 significant digits
1.234 × 10²⁰	1.234	+20 (×10²⁰)	7 significant digits

The Key Insight: Constant Relative Precision

Notice that regardless of magnitude, we maintain ~7 significant digits of precision. The precision is relative to the value's magnitude, not absolute.

For 0.000001234:

Precision is about ±0.0000000001 (absolute)
Relative precision is about 1 part in 10⁷ (0.00001%)

For 1,234,000:

Precision is about ±0.1 (absolute)
Relative precision is still about 1 part in 10⁷ (0.00001%)

Visualizing the Floating Radix Point

Imagine a sliding window that reveals 24 binary digits (the significand in IEEE 754 single precision). The exponent controls where this window sits on the number line:

For large numbers, the window sits far to the left, capturing billions but ignoring fractions.
For tiny numbers, the window sits far to the right, capturing millionths but ignoring integers.
The window is always the same width (same number of significant bits), but its position floats based on magnitude.

This sliding window metaphor explains both the power and the limitations of floating-point. You always get the same precision in significant digits, but absolute precision varies with magnitude.

The Logarithmic Perspective

Comparing the Representations Visually

The difference between fixed-point and floating-point becomes strikingly clear when we visualize how they distribute representable values across the number line.

Fixed-Point: Uniform Distribution

In fixed-point representation, representable values are uniformly spaced along the number line. Using our Q16.16 example:

|     |     |     |     |     |     |     |     |     |     |
0   0.001  0.002  0.003  0.004  ...  65535.996  65535.997  65535.998

Every adjacent pair of values is separated by exactly 1/65,536 ≈ 0.0000153. This uniform spacing means:

Near zero: Excellent relative precision (0.0000153 relative to 0.001 is 1.5%)
Near maximum: Excessive absolute precision (0.0000153 relative to 65,535 is 0.00002%)

Floating-Point: Logarithmic Distribution

In floating-point representation, representable values become denser near zero and sparser as magnitude increases:

|||||||||||||     |     |     |     |     |          |          |
0     0.001      0.01     0.1      1        10        100       1000
      (dense)              (moderate)              (sparse)

Between 1 and 2, there might be ~8 million representable values. Between 1,000,000 and 2,000,000, there are still ~8 million representable values—but now spread over a million units instead of one.

Fixed-Point Distribution

•Uniform spacing throughout range
•Same absolute gap between all adjacent values
•Precision 'wasted' at high magnitudes
•Precision 'insufficient' near zero (relatively)
•Simple mental model
•Predictable bit-for-bit behavior

Floating-Point Distribution

•Logarithmic spacing (denser near zero)
•Relative gap constant across magnitudes
•Precision proportional to magnitude
•Consistent significant digits everywhere
•More complex mental model
•Requires understanding of relative error

The Density Around Zero

The Spacing Formula

When to Use Fixed-Point

Financial Computing

Fixed-point (often implemented as integers representing the smallest unit) is standard in financial systems:

Store $19.99 as 1999 cents
Store ¥50,000.00 as 5,000,000 sen
Arithmetic is exact: 1999 + 2001 = 4000 cents = $40.00, always

Floating-point's rounding errors are unacceptable when auditors expect accounts to balance to the penny.

Embedded Systems and DSP

Many microcontrollers and digital signal processors lack floating-point hardware. Operations that take 1 cycle with floating-point hardware might take 100+ cycles in software emulation.

Fixed-point enables:

Audio processing on cheap hardware (MP3 players, hearing aids)
Motor control in appliances and vehicles
Sensor processing in IoT devices
Real-time graphics on resource-constrained systems

Ideal Fixed-Point Use Cases

•Financial accounting — Exact decimal arithmetic required by law or audit
•Embedded/real-time systems — No floating-point hardware, strict timing requirements
•Digital signal processing — Known dynamic range, deterministic behavior needed
•Cryptography — Exact, reproducible computations essential for security
•Blockchain/smart contracts — Determinism required across all nodes
•Legacy system integration — Matching behavior of systems predating IEEE 754
•FPGA/ASIC implementation — Fixed-point requires less silicon

The Key Question: Is Your Range Predictable?

The fundamental criterion for choosing fixed-point is whether your values occupy a predictable, bounded range with uniform precision requirements.

Currency: Always 0.01 to reasonable maximums with 2-4 decimal places → Fixed-point works
Audio samples: Always -1.0 to +1.0 with ~16 bits of precision → Fixed-point works
Physics simulation: 10⁻¹⁰ to 10¹⁰ with variable precision needs → Floating-point required

Modern Fixed-Point Solutions

When to Use Floating-Point

Scientific and Engineering Computing

Science operates across scales that fixed-point cannot practically represent:

Planetary physics: distances in meters (10⁰) to astronomical units (10¹¹) to light-years (10¹⁶)
Quantum mechanics: energies from millielectronvolts (10⁻²²) to gigaelectronvolts (10⁻¹⁰)
Chemistry: molecular concentrations from millimolar (10⁻³) to picomolar (10⁻¹²)

Floating-point handles all these scales with consistent relative precision, using the same 64-bit representation.

Machine Learning and Statistics

ML computation involves:

Probabilities spanning 0.00001 to 0.99999
Gradients that might be 0.0000001 or 1000000
Intermediate activations of widely varying magnitudes

No single fixed-point format could reasonably handle this range.

Ideal Floating-Point Use Cases

•Scientific simulation — Physics, chemistry, biology, climate, geology
•3D graphics and gaming — Geometry, lighting, physics engines
•Machine learning — Training, inference, embedding computations
•Signal analysis — Fourier transforms, spectral analysis, filtering
•Geospatial computing — GPS, mapping, navigation, GIS
•Statistical analysis — Regression, probability, hypothesis testing
•General-purpose math — When range/precision requirements are unknown
•Prototyping — When performance tuning will come later

Hardware Support Changes the Calculus

The Default Choice

When in doubt, use floating-point. It's:

Universally supported in hardware
Built into every programming language
Flexible enough for almost any application
Well-understood with decades of numerical analysis research

Only reach for fixed-point when you have specific requirements (exact decimal arithmetic, no floating-point hardware, deterministic cross-platform behavior) that floating-point cannot satisfy.

The 90/10 Rule

The Hybrid and Specialized Approaches

As computing has evolved, the binary choice between fixed-point and floating-point has been supplemented by specialized formats that combine features of both or optimize for specific workloads.

Posit Numbers

A recent alternative called "posit" (or "Type III unum") aims to provide better precision-per-bit than IEEE 754 floating-point. Posits use a variable-length exponent encoding that provides:

Greater precision near 1.0 (where many computations cluster)
Simpler exception handling (no NaN, only one infinity)
Potential efficiency gains for neural network inference

Posits remain experimental but show promise for specialized applications.

Block Floating-Point

Used in some DSP and machine learning applications, block floating-point shares a single exponent across a group of values, with each value storing only a mantissa:

More compact than individual floating-point values
Maintains some dynamic range flexibility
Common in neural network quantization

Logarithmic Number Systems

These systems store the logarithm of the value directly:

Multiplication becomes addition (log(a×b) = log(a) + log(b))
Addition becomes more complex
Useful when multiplication dominates computation

Comparison of Number Representations
Representation	Strengths	Weaknesses	Best For
Fixed-Point	Exact, predictable, integer-like	Limited range, inflexible	Finance, embedded, DSP
Floating-Point	Huge range, relative precision	Rounding errors, complex	Science, graphics, ML
Posit	Better precision/bit	Not standardized, new	Research, specialized HW
Block Float	Compact, shared exponent	Group-level precision only	ML inference, compression
Logarithmic	Fast multiplication	Slow addition	Multiplication-heavy DSP

Machine Learning Specialized Formats

The AI revolution has driven development of formats optimized for neural networks:

FP16 (half precision) — 16-bit floating-point, standard in GPU training
BF16 (bfloat16) — Google's 16-bit format with full 8-bit exponent but reduced mantissa
INT8/INT4 — Quantized integer formats for inference, achieving impressive accuracy with minimal precision
TF32 (TensorFloat-32) — NVIDIA's 19-bit format for tensor operations

The Right Tool for the Job

Summary: Fixed-Point vs Floating-Point

We've developed a deep intuition for the fundamental difference between fixed-point and floating-point representation—a distinction that underlies all numerical computing.

Key Takeaways

•Fixed-point locks the radix point position — Bits are permanently allocated to integer and fractional parts, enabling simple integer-like arithmetic but limiting dynamic range.
•Floating-point allows the radix point to move — The exponent field describes where the radix point sits, enabling representation across vast ranges with consistent relative precision.
•Fixed-point provides uniform absolute spacing — Every adjacent pair of values is separated by the same gap, regardless of magnitude.
•Floating-point provides uniform relative spacing — The gap between adjacent values scales with magnitude, giving consistent significant digits across all ranges.
•Choose fixed-point for exact, bounded-range applications — Finance, embedded systems, cryptography, and deterministic computing benefit from fixed-point's predictability.
•Choose floating-point for flexible, wide-range applications — Science, graphics, ML, and general-purpose computing benefit from floating-point's adaptability.
•Modern hardware makes floating-point nearly free — Performance differences have largely vanished on general-purpose CPUs and GPUs.

What's Next

Now that we understand why floating-point exists and how it differs from fixed-point alternatives, we're ready to examine the actual standard that governs its implementation: IEEE 754.

Page Complete

2 / 5