Binary Representation - Learning Module

Loading content...

0/276

How Data Is Ultimately Stored as Bits

From Abstract Numbers to Physical Reality

We've explored binary representation and number systems in the abstract. Now it's time to see how this knowledge manifests in the physical reality of computer memory.

Every variable you declare — every integer, character, boolean, or floating-point number — exists as a specific pattern of bits at a specific location in memory. Understanding this isn't just theoretical knowledge; it explains why certain bugs occur, why some operations are fast and others slow, and how data structures actually work at the lowest level.

This page completes our journey through binary representation by showing you exactly how the primitive data types you use every day are encoded as bit patterns and stored in physical memory. With this knowledge, the "magic" of programming becomes explainable mechanism.

What You Will Learn

By the end of this page, you will understand how memory is organized, how different data types are encoded as bits, and why certain values and behaviors emerge from these representations. This is the foundation that makes everything else in data structures and algorithms make sense at the lowest level.

Memory as a Linear Array of Bytes

At the most fundamental level, computer memory is a gigantic linear array of bytes. Each byte has a unique numeric address, starting from 0 and going up to the maximum addressable memory.

The Memory Model:

Address    Contents (8 bits each)
0x0000     [01001101]
0x0001     [11100010]
0x0002     [00000000]
0x0003     [11111111]
...        ...
0xFFFFFFFF [10101010]  (4 GB boundary in 32-bit systems)

Every piece of data in your program — every variable, every object, every function, every string — occupies some portion of this linear byte array. When you access data, you're really asking the hardware: "Retrieve the bytes starting at address X."

Key Properties of Memory:

Fundamental Memory Properties

•Byte-Addressable: The smallest unit we can address is one byte (8 bits). You can't ask for "bit 5" directly; you fetch the byte containing that bit.
•Random Access: Any byte can be accessed in constant time O(1), regardless of its address. Address 0x7FFFFFFF is accessed as quickly as address 0x0.
•Volatile (RAM): Main memory loses its contents when power is removed. This is why we need persistent storage (SSDs, HDDs).
•Finite: Memory is limited. A 16 GB system has about 17 billion bytes — large, but not infinite.
•Shared: Multiple programs share the same physical memory (with OS-managed isolation).

Virtual vs. Physical Memory

Modern systems use virtual memory, where each program sees its own private address space starting at 0. The OS and hardware translate these virtual addresses to physical locations. This abstraction isolates programs and enables memory to exceed physical RAM (using swap/paging). But the fundamental model remains: a linear array of bytes.

How Integers Are Stored

Integers are the simplest numeric data type, but their storage reveals important principles about binary encoding.

Unsigned Integers:

An unsigned integer simply stores the binary representation of a non-negative number. For an n-bit unsigned integer:

Minimum value: 0
Maximum value: 2ⁿ - 1

Example: 8-bit unsigned integer storing 200

Decimal: 200
Binary:  11001000

Byte in memory: [11001000] = 0xC8

Signed Integers: Two's Complement

Signed integers need to represent negative numbers. The universal modern approach is two's complement:

Positive numbers are stored normally (most significant bit = 0)
Negative numbers are stored as: invert all bits, then add 1
The most significant bit (MSB) indicates sign: 0 = positive, 1 = negative

Example: Storing -5 as an 8-bit signed integer

Start with 5:    00000101
Invert all bits: 11111010
Add 1:           11111011

-5 stored as: [11111011] = 0xFB

Integer Ranges by Bit Width
Type	Bits	Unsigned Range	Signed Range
byte	8	0 to 255	-128 to 127
short	16	0 to 65,535	-32,768 to 32,767
int	32	0 to 4,294,967,295	-2,147,483,648 to 2,147,483,647
long	64	0 to 18.4 quintillion	-9.2 to +9.2 quintillion

Why Two's Complement?

Two's complement has elegant properties:

Addition works uniformly: Adding positive and negative numbers requires no special logic. The hardware just adds the bit patterns.
Single zero: There's only one representation of 0 (all zeros). Earlier schemes like "sign-magnitude" had +0 and -0.
Simple negation: To negate, invert and add 1. This uses existing hardware.
Easy sign detection: Just check the MSB.

This is why two's complement became universal — it simplifies hardware design dramatically.

Overflow: The Invisible Danger

When arithmetic exceeds the representable range, the value 'wraps around.' For a signed 8-bit integer: 127 + 1 = -128 (not 128!). This is because 01111111 + 1 = 10000000, which is the bit pattern for -128. This silent wraparound causes subtle bugs in countless programs.

How Characters Are Stored

Characters represent text, but at the bit level, they're just integers mapped to symbols through an encoding scheme.

ASCII: The Original Encoding

ASCII (American Standard Code for Information Interchange) uses 7 bits to represent 128 characters: letters, digits, punctuation, and control characters. The 8th bit (in a byte) was historically used for parity checking.

Selected ASCII Values:

'A' = 65 = 0x41 = 01000001
'Z' = 90 = 0x5A = 01011010
'a' = 97 = 0x61 = 01100001
'z' = 122 = 0x7A = 01111010
'0' = 48 = 0x30 = 00110000
'9' = 57 = 0x39 = 00111001
Space = 32 = 0x20 = 00100000
Newline = 10 = 0x0A = 00001010

Key ASCII Patterns:

Uppercase and lowercase letters differ by 32 (one bit flip: bit 5)
Digits 0-9 are 48-57 (subtract 48 to get numeric value)
Control characters are 0-31

Unicode: The Global Standard

ASCII only covers English. Unicode assigns a unique code point to every character in every language — plus symbols, emoji, and historical scripts. Over 140,000 characters are defined.

Unicode code points are written as U+XXXX (hex). Examples:

'A' = U+0041
'€' = U+20AC
'中' = U+4E2D
'😀' = U+1F600

Unicode Encodings:

Unicode defines code points, but how those are stored in memory varies:

UTF-8: Variable-width (1-4 bytes). ASCII-compatible. Dominant on the web.
UTF-16: Variable-width (2 or 4 bytes). Used in Windows, Java, JavaScript.
UTF-32: Fixed-width (4 bytes). Simple but wasteful.

UTF-8 Encoding Rules:

U+0000 to U+007F:    1 byte   (0xxxxxxx) — ASCII compatible
U+0080 to U+07FF:    2 bytes  (110xxxxx 10xxxxxx)
U+0800 to U+FFFF:    3 bytes  (1110xxxx 10xxxxxx 10xxxxxx)
U+10000 to U+10FFFF: 4 bytes  (11110xxx 10xxxxxx 10xxxxxx 10xxxxxx)

Why UTF-8 Won

UTF-8's genius is backward compatibility: all valid ASCII is valid UTF-8. English text uses 1 byte per character (same as ASCII), so existing ASCII systems worked unchanged. For other languages, 2-4 bytes are used. This balance of efficiency and compatibility made UTF-8 the dominant encoding online.

How Booleans Are Stored

A boolean represents TRUE or FALSE — logically, just 1 bit of information. So how much memory does a boolean actually use?

The Surprising Answer: Usually 1 Byte (or More)

Most programming languages store a boolean in a full byte (8 bits), not a single bit. Some use 4 bytes (32 bits). Why the waste?

Reasons for Byte-Sized Booleans:

Why Booleans Aren't 1 Bit

•Memory is byte-addressable: You can't address individual bits. To read one bit, you must read its entire byte.
•Alignment and efficiency: CPUs are optimized for word-aligned access (4 or 8 bytes). Accessing individual bits requires extra bit-masking operations.
•Cache considerations: Loading one bit still loads an entire cache line (typically 64 bytes). Bit-level storage doesn't save cache space.
•Simplicity: Treating booleans as bytes simplifies compiler implementation and memory management.

Boolean Representation:

Typically:

FALSE = 0 = 0x00 = [00000000]
TRUE = 1 = 0x01 = [00000001] (or any non-zero value in C/C++)

When Bit Packing Matters:

In specific scenarios, we do pack booleans into bits:

Bitfields in C/C++: Explicitly declare bit-sized struct members
BitSet/BitVector: Data structures storing large numbers of booleans
Compression: Reduce storage when saving/transmitting
Bloom Filters: Probabilistic data structures using bit arrays

For these use cases, we sacrifice access simplicity for space efficiency — a classic tradeoff.

The Tradeoff in Practice

For a few booleans, wasting 7 bits per boolean is irrelevant. For millions of booleans (like a bit vector for a massive set), the difference between 1 bit and 1 byte is 8x memory usage — suddenly very significant. Choose representation based on your scale.

How Floating-Point Numbers Are Stored

Floating-point numbers (like 3.14159 or -0.001) pose a challenge: how do you represent both very large and very small values with fractional parts?

The IEEE 754 Standard:

Virtually all modern systems use IEEE 754 floating-point representation. A floating-point number is stored as three components:

Sign bit (1 bit): 0 = positive, 1 = negative
Exponent (8 or 11 bits): The magnitude (how many times to multiply by 2)
Mantissa/Significand (23 or 52 bits): The significant digits

The value is: (-1)^sign × (1 + mantissa) × 2^(exponent - bias)

IEEE 754 Floating-Point Formats
Format	Total Bits	Sign	Exponent	Mantissa	Approx. Precision
Single (float)	32	1	8	23	~7 decimal digits
Double	64	1	11	52	~15 decimal digits
Half	16	1	5	10	~3 decimal digits
Quad	128	1	15	112	~34 decimal digits

Example: Storing 3.14159 as a float

The exact bit pattern (in hex): 0x40490FD0

Sign:     0 (positive)
Exponent: 10000000 (128, meaning 2^(128-127) = 2^1)
Mantissa: 10010000111100001111001...

Value: +1 × 1.57079... × 2 ≈ 3.14159

Special Values:

IEEE 754 reserves certain bit patterns for special values:

±Infinity: Exponent all 1s, mantissa all 0s
NaN (Not a Number): Exponent all 1s, mantissa non-zero
Denormalized numbers: Very small values near zero
±0: Yes, there are both +0 and -0!

The Precision Problem

Most decimal fractions CANNOT be exactly represented in binary. Just as 1/3 = 0.333... in decimal, 1/10 = 0.0001100110011... repeating in binary. This is why 0.1 + 0.2 ≠ 0.3 — it's not a bug, it's the fundamental limitation of binary representation.

Memory Layout and Alignment

When multiple values are stored in memory (arrays, structures, objects), their layout follows rules that optimize performance.

Memory Alignment:

Alignment means that a data item is stored at a memory address that's a multiple of its size:

2-byte values at even addresses
4-byte values at addresses divisible by 4
8-byte values at addresses divisible by 8

Why Alignment Matters:

Performance Impact of Alignment

•Hardware efficiency: CPUs access memory in fixed-width chunks. Aligned data requires one access; misaligned data may require two (spanning a boundary).
•Cache behavior: Cache lines (typically 64 bytes) work best with aligned data.
•Atomic operations: Some CPU instructions only work on aligned addresses.
•SIMD operations: Vectorized operations often require specific alignment.

Struct Padding Example:

Consider this C structure:

struct Example {
    char  a;    // 1 byte
    int   b;    // 4 bytes
    char  c;    // 1 byte
};

Naive expectation: 1 + 4 + 1 = 6 bytes Actual size: 12 bytes (on most systems)

Why? The compiler inserts padding for alignment:

Offset 0: a (char, 1 byte)
Offset 1-3: <padding> (3 bytes, so b starts at multiple of 4)
Offset 4-7: b (int, 4 bytes)
Offset 8: c (char, 1 byte)
Offset 9-11: <padding> (3 bytes, so struct size is multiple of 4)

Optimizing Layout:

Reordering fields can reduce padding:

struct BetterExample {
    int   b;    // 4 bytes (offset 0)
    char  a;    // 1 byte (offset 4)
    char  c;    // 1 byte (offset 5)
    // 2 bytes padding (to make total a multiple of 4)
};  // Total: 8 bytes

The General Rule

Order struct members from largest to smallest. This minimizes internal padding and produces the smallest overall size. The difference matters when you have millions of these structures!

Arrays in Memory: Contiguous Allocation

Arrays are among the simplest data structures: they store elements contiguously — one right after another with no gaps.

Memory Layout of an Array:

An array of n elements of size s bytes occupies n × s contiguous bytes.

Example: int array[4] = {10, 20, 30, 40}

Assuming 4-byte integers and starting at address 0x1000:

Address   Contents (hex)   Value
0x1000    0A 00 00 00     10 (little-endian)
0x1004    14 00 00 00     20
0x1008    1E 00 00 00     30
0x100C    28 00 00 00     40

Index-to-Address Calculation:

To access array[i]: address = base_address + (index × element_size)

For array[2]: 0x1000 + (2 × 4) = 0x1008

This calculation is O(1) — constant time regardless of array size. This is why array access is so fast.

Array Memory Advantages

•O(1) random access by index
•Excellent cache locality
•No per-element overhead
•SIMD-friendly for bulk operations

Array Memory Costs

•Fixed size (static arrays)
•O(n) insertion/deletion (shifting)
•May waste space if sparse
•Requires contiguous block for growth

Cache Locality Matters

When you access array[0], the CPU loads a cache line (typically 64 bytes) containing array[0] through array[15] (for 4-byte integers). Accessing array[1], [2], etc. are then 'free' — already in cache. This is why sequential array traversal is blazing fast compared to random access patterns.

How Pointers Are Stored

A pointer is a variable that stores a memory address. Understanding pointer storage demystifies how data structures like linked lists, trees, and graphs work at the lowest level.

Pointer Size:

32-bit systems: Pointers are 4 bytes (can address 2³² = 4 GB of memory)
64-bit systems: Pointers are 8 bytes (can address 2⁶⁴ = 16 exabytes)

Example: A Pointer Variable

int value = 42;       // Stored at address 0x1000
int *ptr = &value;    // ptr stores 0x1000

In memory (64-bit system):

Address   Contents        Interpretation
0x1000    2A 00 00 00    value = 42 (4 bytes, integer)
0x1008    00 10 00 00    ptr = 0x1000 (8 bytes, pointer)
          00 00 00 00    (little-endian, so low bytes first)

Dereferencing a Pointer:

When you write *ptr, the CPU:

Reads the 8-byte value at ptr's address (gets 0x1000)
Reads the 4-byte value at address 0x1000 (gets 42)

This is why pointer dereferencing adds a memory access — an extra step compared to direct variable access.

Null Pointers:

A null pointer is a special value (typically 0) indicating "points to nothing." Accessing memory at address 0 is typically protected by the OS, causing a crash ("null pointer exception," "segmentation fault").

null pointer in memory: 00 00 00 00 00 00 00 00

Multiple Indirection:

Pointers can point to other pointers:

int value = 42;
int *ptr = &value;
int **ptrptr = &ptr;   // Pointer to pointer

Each level adds 8 bytes of storage (on 64-bit) and one extra memory access to dereference.

Pointers Enable Flexibility

Pointers are why data structures like linked lists can have non-contiguous elements. Instead of storing data next to each other, we store pointers that tell us where the next element lives. This trades cache locality for flexibility in insertion/deletion.

Connecting to Data Structures and Algorithms

Everything we've learned about binary representation and memory storage directly informs how data structures work. Let's preview these connections:

Arrays and Contiguity:

Arrays leverage contiguous storage for O(1) access via address calculation
This is why random access is fast but insertion is O(n) — elements must shift

Linked Lists and Pointers:

Each node stores data + a pointer to the next node
O(1) insertion (just change pointers) but O(n) access (must traverse)
Non-contiguous storage means worse cache performance

Hash Tables and Integer Math:

Hash functions produce integers from keys
Modular arithmetic (using bitwise AND for power-of-2 sizes) maps to array indices
Understanding integer overflow and bit operations helps understand hash collisions

Binary Knowledge Applied

•Bit manipulation algorithms: Finding set bits, detecting powers of 2, swapping without temp variable — all use binary properties
•Memory-efficient storage: Bloom filters, bit vectors, packed structures all exploit bit-level control
•Understanding tradeoffs: Why trading pointer overhead for contiguity matters for performance
•Debugging: Reading memory dumps, understanding core dumps, diagnosing overflow bugs
•System design: Estimating memory usage, calculating network bandwidth, sizing caches

The Foundation Is Set

This knowledge of bits, bytes, and memory storage is the foundation upon which all data structures and algorithms rest. When you study arrays, trees, or hash tables in future chapters, remember: everything reduces to bit patterns stored in linear memory. There is no magic — only well-organized bytes.

Summary: Data as Physical Reality

We've completed our journey through binary representation — from the fundamental reasons for binary, through bits and bytes, number systems, and now physical data storage. Let's consolidate the key insights:

Key Takeaways

•Memory is a linear array of bytes — every piece of data occupies some range of addresses in this array.
•Integers use two's complement — enabling uniform arithmetic for positive and negative numbers.
•Characters are integers mapped to symbols — ASCII for English, Unicode (UTF-8/16/32) for all languages.
•Booleans waste space for efficiency — typically 1 byte, not 1 bit, due to memory addressing constraints.
•Floating-point uses IEEE 754 — sign, exponent, and mantissa encode real numbers with limited precision.
•Alignment improves performance — padding ensures efficient memory access by aligning data to boundaries.
•Arrays are contiguous — enabling O(1) access via simple address arithmetic.
•Pointers store addresses — enabling flexible, non-contiguous data structures at the cost of extra indirection.

Module Complete:

This module on Binary Representation & Number Systems provides the essential low-level foundation for everything in computer science. You now understand:

Why computers use binary (physics, engineering, reliability)
How binary organizes into bits, bytes, and larger units
What number systems we use and when (binary, decimal, hex)
Where data lives (linear memory) and how it's encoded

With this foundation, you're prepared to understand any data structure at its deepest level — not as abstract concepts, but as specific arrangements of bits in physical memory.

Module Complete

Congratulations! You've mastered the essential knowledge of binary representation and number systems. This deep understanding of how data is physically stored will inform every data structure and algorithm you study. The 'magic' of computing is no longer magic — it's organized patterns of ones and zeros in physical memory.