Data Structures & AlgorithmsMulti-Dimensional Arrays

Multi-Dimensional Arrays — Extending Arrays to Grids and Beyond

LevelBeginner

Duration60 mins

TopicMulti-Dimensional Arrays

2 / 4

Row-Major vs Column-Major Storage Intuition

The Hidden Layout That Controls Performance

In the previous page, we established that 2D arrays are conceptually grids of data but are ultimately stored in linear memory. This raises a fundamental question: when you have a grid of data, in what order do you lay it out in a one-dimensional sequence?

This isn't a trivial implementation detail—it's a decision that affects every matrix operation you perform. The wrong traversal pattern can make your code 10x to 100x slower due to cache effects, and the difference only grows as data sizes increase.

There are two primary answers to this question: row-major order and column-major order. Understanding both is essential for any serious work with matrices.

What You Will Learn

By the end of this page, you will understand how row-major and column-major storage work, why programming languages choose one over the other, how this affects cache performance, and how to write code that takes advantage of memory layout for optimal performance.

The Storage Problem: From 2D to 1D

Let's make the problem concrete. Consider a simple 3×4 matrix:

Matrix (logical view):
     Column 0   Column 1   Column 2   Column 3
Row 0:   A         B          C          D
Row 1:   E         F          G          H
Row 2:   I         J          K          L

Computer memory is essentially a very long array of bytes, each with an address. To store our 12-element matrix, we need to map these 12 positions to 12 consecutive memory locations.

There are two natural ways to do this:

Row-Major Order

•Store elements row by row
•Complete all of Row 0, then Row 1, etc.
•Memory: A B C D E F G H I J K L
•Row 0: indices 0-3
•Row 1: indices 4-7
•Row 2: indices 8-11

Column-Major Order

•Store elements column by column
•Complete all of Column 0, then Column 1, etc.
•Memory: A E I B F J C G K D H L
•Column 0: indices 0-2
•Column 1: indices 3-5
•Column 2: indices 6-8

Visual Comparison:

memory-layout-visual.txt
Original 3×4 Matrix:
        ┌───────────────────────────────────┐
        │   A   │   B   │   C   │   D   │  Row 0
        ├───────┼───────┼───────┼───────┤
        │   E   │   F   │   G   │   H   │  Row 1
        ├───────┼───────┼───────┼───────┤
        │   I   │   J   │   K   │   L   │  Row 2
        └───────────────────────────────────┘
 
Row-Major Order (traverse left-to-right, top-to-bottom):
Reading order: A → B → C → D → E → F → G → H → I → J → K → L
 
Memory layout: [A][B][C][D][E][F][G][H][I][J][K][L]
                ↑ Row 0  ↑  ↑ Row 1  ↑  ↑ Row 2  ↑
 
Column-Major Order (traverse top-to-bottom, left-to-right):
Reading order: A → E → I → B → F → J → C → G → K → D → H → L
 
Memory layout: [A][E][I][B][F][J][C][G][K][D][H][L]
                ↑ Col0 ↑  ↑ Col1 ↑  ↑ Col2 ↑  ↑Col3↑

Same Data, Different Layout

Both layouts store exactly the same data—12 elements. The difference is purely in the ordering within memory. Logically, matrix[1][2] still returns 'G' in both cases, but the underlying memory address of 'G' differs between layouts.

Row-Major Order: The C/C++/Python Model

Row-major order is the most common approach in modern programming. It stores matrix elements one row at a time, completing all columns of a row before moving to the next row.

The Name:

'Row-major' means the major (outer) axis is rows. We iterate through rows first, then columns within each row.

Memory Address Calculation:

For a matrix with cols columns, the element at position [row][col] is located at:

row-major-formula.pseudo
1
2
3
4
5
6
7
8
9
10
// Row-major address calculation
// Given: base_address, rows, cols, row index, col index
 
memory_offset = row * cols + col
memory_address = base_address + (memory_offset * element_size)
 
// Example: 3×4 matrix, finding element [1][2] (value 'G')
// offset = 1 * 4 + 2 = 6
// If base_address = 1000 and element_size = 4 bytes:
// address = 1000 + (6 * 4) = 1024

Languages Using Row-Major:

Row-Major Languages
Language	Array Type	Notes
C / C++	Built-in arrays, std::array	The canonical row-major languages
Python	Nested lists, NumPy (default)	NumPy defaults to row-major but supports both
JavaScript	Arrays of arrays	Simulates 2D with nested arrays
Java	2D arrays	Arrays of array references
Rust	Arrays, ndarray (default)	Follows C convention
Go	2D slices	Row-major by default
Swift	Nested arrays	Row-major convention

Why Row-Major Makes Intuitive Sense:

Row-major order matches how we typically read and write: left-to-right, top-to-bottom. When you write nested loops, the outer loop on rows and inner loop on columns produces sequential memory access:

row-major-traversal.pseudo
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
// Optimal traversal for row-major storage
matrix = [
    [1, 2, 3, 4],   // Row 0
    [5, 6, 7, 8],   // Row 1
    [9, 10, 11, 12] // Row 2
]
 
// This traversal accesses memory SEQUENTIALLY:
for row in range(3):           // Outer loop: rows
    for col in range(4):       // Inner loop: columns
        process(matrix[row][col])
 
// Access order: [0][0], [0][1], [0][2], [0][3], [1][0], [1][1], ...
// Memory order: 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11
// Perfect sequential access! ✓

Memory Access Pattern

In row-major storage, adjacent elements in the same row are adjacent in memory. This means traversing row-by-row (outer loop on rows, inner loop on columns) produces the most cache-friendly access pattern.

Column-Major Order: The Fortran/MATLAB Model

Column-major order is the alternative approach, storing matrix elements one column at a time, completing all rows of a column before moving to the next column.

The Name:

'Column-major' means the major (outer) axis is columns. We iterate through columns first, then rows within each column.

Memory Address Calculation:

For a matrix with rows rows, the element at position [row][col] is located at:

column-major-formula.pseudo
1
2
3
4
5
6
7
8
9
10
11
12
13
// Column-major address calculation
// Given: base_address, rows, cols, row index, col index
 
memory_offset = col * rows + row
memory_address = base_address + (memory_offset * element_size)
 
// Example: 3×4 matrix (3 rows, 4 cols), finding element [1][2] (value 'G')
// offset = 2 * 3 + 1 = 7
// If base_address = 1000 and element_size = 4 bytes:
// address = 1000 + (7 * 4) = 1028
 
// Compare: In row-major, 'G' was at offset 6
// In column-major, 'G' is at offset 7 - different location!

Languages Using Column-Major:

Column-Major Languages and Libraries
Language/Library	Context	Notes
Fortran	All arrays	The original column-major language (1950s)
MATLAB	All matrices	Follows Fortran convention
Julia	Arrays	Optimized for scientific computing
R	Matrices	Statistical computing heritage
NumPy	'F' order flag	Fortran-order when explicitly specified
OpenGL/GLSL	Matrices	Graphics convention
Most BLAS libraries	Matrix operations	Fortran-compatible interfaces

Why Column-Major Exists:

Column-major order originated with Fortran, the first high-level programming language (1957). Its designers chose this order because mathematical notation for vectors and matrices often emphasizes columns—a vector is a column of numbers, and matrix-vector multiplication operates column by column.

Optimal Traversal for Column-Major:

column-major-traversal.pseudo
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
// Optimal traversal for column-major storage
// Note: This is conceptual - most languages use row-major
matrix = [
    [1, 2, 3, 4],   // Row 0
    [5, 6, 7, 8],   // Row 1
    [9, 10, 11, 12] // Row 2
]
 
// For column-major: outer loop on COLUMNS, inner on ROWS
for col in range(4):           // Outer loop: columns
    for row in range(3):       // Inner loop: rows
        process(matrix[row][col])
 
// Access order: [0][0], [1][0], [2][0], [0][1], [1][1], [2][1], ...
// in column-major memory: 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11
// Sequential access for column-major! ✓

Critical Performance Implication

If you use the wrong loop order for your memory layout, performance degrades severely. Using column-first loops in a row-major language (C, Python) or row-first loops in a column-major language (Fortran, MATLAB) causes cache misses on every access, potentially slowing code by 10-100x.

Why Layout Matters: Cache Performance

Understanding why memory layout affects performance requires understanding how CPU caches work. This is where abstract knowledge becomes practical engineering.

The CPU Cache Hierarchy:

Modern CPUs don't access RAM directly for every operation—that would be too slow. Instead, they use a hierarchy of increasingly fast (but smaller) caches:

Typical CPU Cache Hierarchy
Cache Level	Size (typical)	Latency (cycles)	Relative Speed
L1 Cache	32-64 KB	~4 cycles	Fastest (reference)
L2 Cache	256 KB - 1 MB	~10-15 cycles	~3x slower
L3 Cache	4-32 MB	~40-50 cycles	~10x slower
RAM	8-64 GB+	~200-300 cycles	~50-75x slower

Cache Lines: The Key Concept

When the CPU needs data from RAM, it doesn't fetch just one byte—it fetches a cache line, typically 64 bytes. This means when you access matrix[0][0], the CPU also loads matrix[0][1], matrix[0][2], and potentially more—all for 'free' in terms of memory access time.

This is where layout matters:

cache-effects.pseudo
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
// Consider a 1000×1000 matrix of 4-byte integers
// Total size: 4 MB (too big for L1/L2, fits in L3)
// Cache line: 64 bytes = 16 integers per line
 
// ROW-MAJOR STORAGE with ROW-WISE TRAVERSAL:
for row in range(1000):
    for col in range(1000):
        process(matrix[row][col])
 
// Memory access pattern for row 0:
// Access [0][0]: Cache miss, load 64 bytes (elements 0-15)
// Access [0][1] to [0][15]: Cache HITS! Already loaded
// Access [0][16]: Cache miss, load next 64 bytes
// ...
// Result: 1 miss per 16 elements = 6.25% miss rate ✓
 
// ROW-MAJOR STORAGE with COLUMN-WISE TRAVERSAL (WRONG!):
for col in range(1000):
    for row in range(1000):
        process(matrix[row][col])
 
// Memory access pattern:
// Access [0][0]: Cache miss, load row 0 elements 0-15
// Access [1][0]: Cache miss! Row 1 is 4000 bytes away, evicts row 0 data
// Access [2][0]: Cache miss! Row 2 is another 4000 bytes away
// ...
// Result: Nearly 100% miss rate when matrix doesn't fit in cache ✗

The Performance Cliff

For a large matrix that doesn't fit in cache, the wrong traversal order can cause a cache miss on nearly every access. With each miss costing ~100x the time of a hit, overall performance drops by 10-100x. This is why loop order isn't just style—it's correctness at scale.

Spatial Locality:

The principle at work here is spatial locality: data that is stored nearby in memory tends to be accessed nearby in time. When your traversal pattern matches your memory layout, you maximize spatial locality and therefore cache hit rates.

Summary of Optimal Patterns:

Optimal Loop Ordering
Storage Layout	Optimal Outer Loop	Optimal Inner Loop	Memory Access
Row-major	Rows	Columns	Sequential
Column-major	Columns	Rows	Sequential

Practical Performance Comparison

Let's examine a concrete example that demonstrates the real-world impact of traversal order. Consider a common operation: summing all elements in a matrix.

The Two Approaches:

traversal-comparison.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
import numpy as np
import time
 
# Create a large matrix (row-major by default in NumPy)
size = 10000
matrix = np.random.rand(size, size)
 
# Approach 1: Row-wise traversal (cache-friendly for row-major)
def sum_row_wise(matrix):
    total = 0.0
    rows, cols = matrix.shape
    for row in range(rows):
        for col in range(cols):
            total += matrix[row, col]
    return total
 
# Approach 2: Column-wise traversal (cache-unfriendly for row-major)
def sum_column_wise(matrix):
    total = 0.0
    rows, cols = matrix.shape
    for col in range(cols):
        for row in range(rows):
            total += matrix[row, col]
    return total
 
# Benchmark (conceptual - actual times vary by hardware)
# Row-wise: ~2-3 seconds on typical hardware
# Column-wise: ~20-30 seconds on typical hardware
# Difference: 10x slower for cache-unfriendly traversal!

Why the Difference?

For a 10,000 × 10,000 matrix of 8-byte doubles:

Total size: 800 MB (far exceeds typical cache sizes)
Row-wise: After loading a cache line, next 8 accesses hit
Column-wise: Each access is 80,000 bytes from the previous, causing a miss

Visualizing the Access Patterns:

access-pattern-visual.txt
Row-Major Storage, Row-Wise Traversal (GOOD):
Memory: [Row0_Col0][Row0_Col1][Row0_Col2]...[Row0_ColN][Row1_Col0]...
Access:     1→         2→         3→    ...    N→        N+1→
        ─────────────────────────────────────────────────────────▶
        Sequential! Cache lines fully utilized.
 
Row-Major Storage, Column-Wise Traversal (BAD):
Memory: [Row0_Col0][Row0_Col1][Row0_Col2]...[Row0_ColN][Row1_Col0]...
Access:     1          ↓          ↓           ↓         2
            │          │          │           │         │
            └──────────┴──────────┴───────────┴─────────┘
        Each access jumps N elements. Constant cache misses!

The Simple Rule

For row-major languages (C, Python, Java, etc.): always have the ROW index in the OUTER loop and the COLUMN index in the INNER loop. This ensures sequential memory access. In performance-critical code, this simple rule can make orders of magnitude difference.

Converting Between Layouts

Sometimes you need to work with data in a different layout than your language expects—perhaps when reading data from a Fortran library or preparing data for OpenGL. Understanding how to convert between layouts is essential.

The Relationship:

A key insight: the column-major representation of a matrix is the same as the row-major representation of its transpose. Converting between layouts is equivalent to transposing.

layout-conversion.pseudo
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
// Original 3×4 matrix
matrix = [
    [A, B, C, D],    // Row 0
    [E, F, G, H],    // Row 1
    [I, J, K, L]     // Row 2
]
 
// Row-major layout: A B C D E F G H I J K L
// Column-major layout: A E I B F J C G K D H L
 
// Method 1: Transpose for conversion
// To go from row-major to column-major: transpose and store row-major
// To go from column-major to row-major: transpose that data
 
def transpose(matrix):
    rows = len(matrix)
    cols = len(matrix[0])
    # Create new matrix with swapped dimensions
    transposed = [[0 for _ in range(rows)] for _ in range(cols)]
    for row in range(rows):
        for col in range(cols):
            transposed[col][row] = matrix[row][col]
    return transposed
 
// Method 2: Direct memory reinterpretation
// Often used in low-level code or with NumPy
// No data copy needed - just change the interpretation

NumPy Example:

NumPy provides elegant tools for working with both layouts:

numpy-layouts.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
import numpy as np
 
# Create matrix - default is row-major ('C' order)
row_major = np.array([[1, 2, 3], [4, 5, 6]], order='C')
 
# Create the same matrix in column-major ('F' for Fortran order)
col_major = np.array([[1, 2, 3], [4, 5, 6]], order='F')
 
# Check the memory layout
print(row_major.flags['C_CONTIGUOUS'])  # True - row-major
print(col_major.flags['F_CONTIGUOUS'])  # True - column-major
 
# Convert between layouts
converted_to_col = np.asfortranarray(row_major)  # Row to column
converted_to_row = np.ascontiguousarray(col_major)  # Column to row
 
# When transposing, the layout relationship becomes clear
matrix = np.array([[1, 2, 3], [4, 5, 6]], order='C')
transposed = matrix.T
 
# The transpose's row-major layout equals original's column-major!
# (This is a view, not a copy - very efficient)

When Layout Conversion Matters

Layout conversion typically matters when: (1) interfacing with Fortran libraries like LAPACK, (2) working with OpenGL which uses column-major matrices, (3) reading scientific data stored in Fortran conventions, or (4) optimizing specific algorithms that perform better with a particular layout.

Beyond 2D: Higher Dimensional Arrays

The row-major vs column-major distinction generalizes to higher dimensions. A 3D array (think: a video as a sequence of image frames) must also be linearized into memory.

3D Array Layout:

For a 3D array with dimensions depth × rows × cols:

3d-layout.pseudo
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
// 3D array: 2 depths × 3 rows × 4 columns = 24 elements
tensor = [
    // Depth 0 (first 'slice')
    [
        [A, B, C, D],    // Row 0
        [E, F, G, H],    // Row 1
        [I, J, K, L]     // Row 2
    ],
    // Depth 1 (second 'slice')
    [
        [M, N, O, P],
        [Q, R, S, T],
        [U, V, W, X]
    ]
]
 
// Row-major (C order): rightmost index varies fastest
// Memory: A B C D E F G H I J K L M N O P Q R S T U V W X
// Complete all columns, then all rows, then all depths
 
// Column-major (Fortran order): leftmost index varies fastest
// Memory: A M E Q I U B N F R J V C O G S K W D P H T L X
// Complete all depths, then all rows, then all columns
 
// Address formula for row-major 3D:
// offset = depth * (rows * cols) + row * cols + col
 
// Access pattern for optimal row-major 3D:
for depth in range(2):
    for row in range(3):
        for col in range(4):
            process(tensor[depth][row][col])

The General Rule:

In row-major order (C, Python, Java):

The rightmost index varies fastest
The leftmost index varies slowest
Nest loops with the leftmost index outermost

In column-major order (Fortran, MATLAB, Julia):

The leftmost index varies fastest
The rightmost index varies slowest
Nest loops with the rightmost index outermost

Practical Example: Image Processing

image-processing.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
import numpy as np
 
# Color image: height × width × channels (RGB)
# Typical shape: (1080, 1920, 3) for Full HD
image = np.zeros((1080, 1920, 3), dtype=np.uint8)
 
# OPTIMAL: Iterate in row-major order (Python/NumPy default)
for row in range(1080):        # Height - outermost
    for col in range(1920):    # Width - middle
        for channel in range(3):  # Channel - innermost (fastest varying)
            image[row, col, channel] = compute_pixel(row, col, channel)
 
# This ensures sequential memory access because:
# image[0,0,0], image[0,0,1], image[0,0,2] (RGB of pixel 0,0)
# image[0,1,0], image[0,1,1], image[0,1,2] (RGB of pixel 0,1)
# ... all sequential in memory

Memory Rule for Any Dimension

For N-dimensional arrays in row-major languages: the LAST index should be in the INNERMOST loop, and the FIRST index should be in the OUTERMOST loop. This ensures sequential memory access regardless of how many dimensions you have.

Summary: Memory Layout Mastery

Understanding memory layout transforms you from a programmer who writes correct code into an engineer who writes efficient code. Let's consolidate the key insights:

Key Takeaways

•Row-major stores rows contiguously — Adjacent elements in the same row are adjacent in memory. Used by C, C++, Python, Java, and most modern languages.
•Column-major stores columns contiguously — Adjacent elements in the same column are adjacent in memory. Used by Fortran, MATLAB, Julia, R, and OpenGL.
•Cache lines determine performance — CPUs load 64-byte chunks; accessing data sequentially maximizes cache hits.
•Wrong traversal order = 10-100x slower — For large matrices, cache misses dominate runtime when access patterns don't match layout.
•Row-major: outer loop on rows — In C-like languages, iterate rows in outer loop, columns in inner loop.
•Column-major: outer loop on columns — In Fortran-like languages, iterate columns in outer loop, rows in inner loop.
•Transposition relates the layouts — Column-major of a matrix equals row-major of its transpose.
•Higher dimensions follow the same rules — Rightmost index varies fastest in row-major, leftmost in column-major.

What's Next:

Now that you understand how matrices are stored and accessed efficiently, we'll explore the patterns for traversing matrices—row-wise, column-wise, and diagonal traversals. These traversal patterns are the foundation of virtually every matrix algorithm, from image processing to pathfinding to dynamic programming on grids.

Page Complete

You now understand the critical distinction between row-major and column-major storage, why it matters for performance, and how to write cache-efficient matrix code. This knowledge separates programmers who wonder why their code is slow from engineers who design for performance from the start.

2 / 4

Loading learning content...

Data Structures & AlgorithmsMulti-Dimensional Arrays

Multi-Dimensional Arrays — Extending Arrays to Grids and Beyond

LevelBeginner

Duration60 mins

TopicMulti-Dimensional Arrays

2 / 4

Row-Major vs Column-Major Storage Intuition

The Hidden Layout That Controls Performance

There are two primary answers to this question: row-major order and column-major order. Understanding both is essential for any serious work with matrices.

What You Will Learn

The Storage Problem: From 2D to 1D

Let's make the problem concrete. Consider a simple 3×4 matrix:

Matrix (logical view):
     Column 0   Column 1   Column 2   Column 3
Row 0:   A         B          C          D
Row 1:   E         F          G          H
Row 2:   I         J          K          L

Computer memory is essentially a very long array of bytes, each with an address. To store our 12-element matrix, we need to map these 12 positions to 12 consecutive memory locations.

There are two natural ways to do this:

Row-Major Order

•Store elements row by row
•Complete all of Row 0, then Row 1, etc.
•Memory: A B C D E F G H I J K L
•Row 0: indices 0-3
•Row 1: indices 4-7
•Row 2: indices 8-11

Column-Major Order

•Store elements column by column
•Complete all of Column 0, then Column 1, etc.
•Memory: A E I B F J C G K D H L
•Column 0: indices 0-2
•Column 1: indices 3-5
•Column 2: indices 6-8

Visual Comparison:

memory-layout-visual.txt
Original 3×4 Matrix:
        ┌───────────────────────────────────┐
        │   A   │   B   │   C   │   D   │  Row 0
        ├───────┼───────┼───────┼───────┤
        │   E   │   F   │   G   │   H   │  Row 1
        ├───────┼───────┼───────┼───────┤
        │   I   │   J   │   K   │   L   │  Row 2
        └───────────────────────────────────┘
 
Row-Major Order (traverse left-to-right, top-to-bottom):
Reading order: A → B → C → D → E → F → G → H → I → J → K → L
 
Memory layout: [A][B][C][D][E][F][G][H][I][J][K][L]
                ↑ Row 0  ↑  ↑ Row 1  ↑  ↑ Row 2  ↑
 
Column-Major Order (traverse top-to-bottom, left-to-right):
Reading order: A → E → I → B → F → J → C → G → K → D → H → L
 
Memory layout: [A][E][I][B][F][J][C][G][K][D][H][L]
                ↑ Col0 ↑  ↑ Col1 ↑  ↑ Col2 ↑  ↑Col3↑

Same Data, Different Layout

Row-Major Order: The C/C++/Python Model

Row-major order is the most common approach in modern programming. It stores matrix elements one row at a time, completing all columns of a row before moving to the next row.

The Name:

'Row-major' means the major (outer) axis is rows. We iterate through rows first, then columns within each row.

Memory Address Calculation:

For a matrix with cols columns, the element at position [row][col] is located at:

row-major-formula.pseudo
1
2
3
4
5
6
7
8
9
10
// Row-major address calculation
// Given: base_address, rows, cols, row index, col index
 
memory_offset = row * cols + col
memory_address = base_address + (memory_offset * element_size)
 
// Example: 3×4 matrix, finding element [1][2] (value 'G')
// offset = 1 * 4 + 2 = 6
// If base_address = 1000 and element_size = 4 bytes:
// address = 1000 + (6 * 4) = 1024

Languages Using Row-Major:

Row-Major Languages
Language	Array Type	Notes
C / C++	Built-in arrays, std::array	The canonical row-major languages
Python	Nested lists, NumPy (default)	NumPy defaults to row-major but supports both
JavaScript	Arrays of arrays	Simulates 2D with nested arrays
Java	2D arrays	Arrays of array references
Rust	Arrays, ndarray (default)	Follows C convention
Go	2D slices	Row-major by default
Swift	Nested arrays	Row-major convention

Why Row-Major Makes Intuitive Sense:

Row-major order matches how we typically read and write: left-to-right, top-to-bottom. When you write nested loops, the outer loop on rows and inner loop on columns produces sequential memory access:

row-major-traversal.pseudo
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
// Optimal traversal for row-major storage
matrix = [
    [1, 2, 3, 4],   // Row 0
    [5, 6, 7, 8],   // Row 1
    [9, 10, 11, 12] // Row 2
]
 
// This traversal accesses memory SEQUENTIALLY:
for row in range(3):           // Outer loop: rows
    for col in range(4):       // Inner loop: columns
        process(matrix[row][col])
 
// Access order: [0][0], [0][1], [0][2], [0][3], [1][0], [1][1], ...
// Memory order: 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11
// Perfect sequential access! ✓

Memory Access Pattern

Column-Major Order: The Fortran/MATLAB Model

Column-major order is the alternative approach, storing matrix elements one column at a time, completing all rows of a column before moving to the next column.

The Name:

'Column-major' means the major (outer) axis is columns. We iterate through columns first, then rows within each column.

Memory Address Calculation:

For a matrix with rows rows, the element at position [row][col] is located at:

column-major-formula.pseudo
1
2
3
4
5
6
7
8
9
10
11
12
13
// Column-major address calculation
// Given: base_address, rows, cols, row index, col index
 
memory_offset = col * rows + row
memory_address = base_address + (memory_offset * element_size)
 
// Example: 3×4 matrix (3 rows, 4 cols), finding element [1][2] (value 'G')
// offset = 2 * 3 + 1 = 7
// If base_address = 1000 and element_size = 4 bytes:
// address = 1000 + (7 * 4) = 1028
 
// Compare: In row-major, 'G' was at offset 6
// In column-major, 'G' is at offset 7 - different location!

Languages Using Column-Major:

Column-Major Languages and Libraries
Language/Library	Context	Notes
Fortran	All arrays	The original column-major language (1950s)
MATLAB	All matrices	Follows Fortran convention
Julia	Arrays	Optimized for scientific computing
R	Matrices	Statistical computing heritage
NumPy	'F' order flag	Fortran-order when explicitly specified
OpenGL/GLSL	Matrices	Graphics convention
Most BLAS libraries	Matrix operations	Fortran-compatible interfaces

Why Column-Major Exists:

Optimal Traversal for Column-Major:

column-major-traversal.pseudo
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
// Optimal traversal for column-major storage
// Note: This is conceptual - most languages use row-major
matrix = [
    [1, 2, 3, 4],   // Row 0
    [5, 6, 7, 8],   // Row 1
    [9, 10, 11, 12] // Row 2
]
 
// For column-major: outer loop on COLUMNS, inner on ROWS
for col in range(4):           // Outer loop: columns
    for row in range(3):       // Inner loop: rows
        process(matrix[row][col])
 
// Access order: [0][0], [1][0], [2][0], [0][1], [1][1], [2][1], ...
// in column-major memory: 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11
// Sequential access for column-major! ✓

Critical Performance Implication

Why Layout Matters: Cache Performance

Understanding why memory layout affects performance requires understanding how CPU caches work. This is where abstract knowledge becomes practical engineering.

The CPU Cache Hierarchy:

Modern CPUs don't access RAM directly for every operation—that would be too slow. Instead, they use a hierarchy of increasingly fast (but smaller) caches:

Typical CPU Cache Hierarchy
Cache Level	Size (typical)	Latency (cycles)	Relative Speed
L1 Cache	32-64 KB	~4 cycles	Fastest (reference)
L2 Cache	256 KB - 1 MB	~10-15 cycles	~3x slower
L3 Cache	4-32 MB	~40-50 cycles	~10x slower
RAM	8-64 GB+	~200-300 cycles	~50-75x slower

Cache Lines: The Key Concept

This is where layout matters:

cache-effects.pseudo
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
// Consider a 1000×1000 matrix of 4-byte integers
// Total size: 4 MB (too big for L1/L2, fits in L3)
// Cache line: 64 bytes = 16 integers per line
 
// ROW-MAJOR STORAGE with ROW-WISE TRAVERSAL:
for row in range(1000):
    for col in range(1000):
        process(matrix[row][col])
 
// Memory access pattern for row 0:
// Access [0][0]: Cache miss, load 64 bytes (elements 0-15)
// Access [0][1] to [0][15]: Cache HITS! Already loaded
// Access [0][16]: Cache miss, load next 64 bytes
// ...
// Result: 1 miss per 16 elements = 6.25% miss rate ✓
 
// ROW-MAJOR STORAGE with COLUMN-WISE TRAVERSAL (WRONG!):
for col in range(1000):
    for row in range(1000):
        process(matrix[row][col])
 
// Memory access pattern:
// Access [0][0]: Cache miss, load row 0 elements 0-15
// Access [1][0]: Cache miss! Row 1 is 4000 bytes away, evicts row 0 data
// Access [2][0]: Cache miss! Row 2 is another 4000 bytes away
// ...
// Result: Nearly 100% miss rate when matrix doesn't fit in cache ✗

The Performance Cliff

Spatial Locality:

Summary of Optimal Patterns:

Optimal Loop Ordering
Storage Layout	Optimal Outer Loop	Optimal Inner Loop	Memory Access
Row-major	Rows	Columns	Sequential
Column-major	Columns	Rows	Sequential

Practical Performance Comparison

Let's examine a concrete example that demonstrates the real-world impact of traversal order. Consider a common operation: summing all elements in a matrix.

The Two Approaches:

traversal-comparison.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
import numpy as np
import time
 
# Create a large matrix (row-major by default in NumPy)
size = 10000
matrix = np.random.rand(size, size)
 
# Approach 1: Row-wise traversal (cache-friendly for row-major)
def sum_row_wise(matrix):
    total = 0.0
    rows, cols = matrix.shape
    for row in range(rows):
        for col in range(cols):
            total += matrix[row, col]
    return total
 
# Approach 2: Column-wise traversal (cache-unfriendly for row-major)
def sum_column_wise(matrix):
    total = 0.0
    rows, cols = matrix.shape
    for col in range(cols):
        for row in range(rows):
            total += matrix[row, col]
    return total
 
# Benchmark (conceptual - actual times vary by hardware)
# Row-wise: ~2-3 seconds on typical hardware
# Column-wise: ~20-30 seconds on typical hardware
# Difference: 10x slower for cache-unfriendly traversal!

Why the Difference?

For a 10,000 × 10,000 matrix of 8-byte doubles:

Total size: 800 MB (far exceeds typical cache sizes)
Row-wise: After loading a cache line, next 8 accesses hit
Column-wise: Each access is 80,000 bytes from the previous, causing a miss

Visualizing the Access Patterns:

access-pattern-visual.txt
Row-Major Storage, Row-Wise Traversal (GOOD):
Memory: [Row0_Col0][Row0_Col1][Row0_Col2]...[Row0_ColN][Row1_Col0]...
Access:     1→         2→         3→    ...    N→        N+1→
        ─────────────────────────────────────────────────────────▶
        Sequential! Cache lines fully utilized.
 
Row-Major Storage, Column-Wise Traversal (BAD):
Memory: [Row0_Col0][Row0_Col1][Row0_Col2]...[Row0_ColN][Row1_Col0]...
Access:     1          ↓          ↓           ↓         2
            │          │          │           │         │
            └──────────┴──────────┴───────────┴─────────┘
        Each access jumps N elements. Constant cache misses!

The Simple Rule

Converting Between Layouts

The Relationship:

A key insight: the column-major representation of a matrix is the same as the row-major representation of its transpose. Converting between layouts is equivalent to transposing.

layout-conversion.pseudo
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
// Original 3×4 matrix
matrix = [
    [A, B, C, D],    // Row 0
    [E, F, G, H],    // Row 1
    [I, J, K, L]     // Row 2
]
 
// Row-major layout: A B C D E F G H I J K L
// Column-major layout: A E I B F J C G K D H L
 
// Method 1: Transpose for conversion
// To go from row-major to column-major: transpose and store row-major
// To go from column-major to row-major: transpose that data
 
def transpose(matrix):
    rows = len(matrix)
    cols = len(matrix[0])
    # Create new matrix with swapped dimensions
    transposed = [[0 for _ in range(rows)] for _ in range(cols)]
    for row in range(rows):
        for col in range(cols):
            transposed[col][row] = matrix[row][col]
    return transposed
 
// Method 2: Direct memory reinterpretation
// Often used in low-level code or with NumPy
// No data copy needed - just change the interpretation

NumPy Example:

NumPy provides elegant tools for working with both layouts:

numpy-layouts.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
import numpy as np
 
# Create matrix - default is row-major ('C' order)
row_major = np.array([[1, 2, 3], [4, 5, 6]], order='C')
 
# Create the same matrix in column-major ('F' for Fortran order)
col_major = np.array([[1, 2, 3], [4, 5, 6]], order='F')
 
# Check the memory layout
print(row_major.flags['C_CONTIGUOUS'])  # True - row-major
print(col_major.flags['F_CONTIGUOUS'])  # True - column-major
 
# Convert between layouts
converted_to_col = np.asfortranarray(row_major)  # Row to column
converted_to_row = np.ascontiguousarray(col_major)  # Column to row
 
# When transposing, the layout relationship becomes clear
matrix = np.array([[1, 2, 3], [4, 5, 6]], order='C')
transposed = matrix.T
 
# The transpose's row-major layout equals original's column-major!
# (This is a view, not a copy - very efficient)

When Layout Conversion Matters

Beyond 2D: Higher Dimensional Arrays

The row-major vs column-major distinction generalizes to higher dimensions. A 3D array (think: a video as a sequence of image frames) must also be linearized into memory.

3D Array Layout:

For a 3D array with dimensions depth × rows × cols:

3d-layout.pseudo
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
// 3D array: 2 depths × 3 rows × 4 columns = 24 elements
tensor = [
    // Depth 0 (first 'slice')
    [
        [A, B, C, D],    // Row 0
        [E, F, G, H],    // Row 1
        [I, J, K, L]     // Row 2
    ],
    // Depth 1 (second 'slice')
    [
        [M, N, O, P],
        [Q, R, S, T],
        [U, V, W, X]
    ]
]
 
// Row-major (C order): rightmost index varies fastest
// Memory: A B C D E F G H I J K L M N O P Q R S T U V W X
// Complete all columns, then all rows, then all depths
 
// Column-major (Fortran order): leftmost index varies fastest
// Memory: A M E Q I U B N F R J V C O G S K W D P H T L X
// Complete all depths, then all rows, then all columns
 
// Address formula for row-major 3D:
// offset = depth * (rows * cols) + row * cols + col
 
// Access pattern for optimal row-major 3D:
for depth in range(2):
    for row in range(3):
        for col in range(4):
            process(tensor[depth][row][col])

The General Rule:

In row-major order (C, Python, Java):

The rightmost index varies fastest
The leftmost index varies slowest
Nest loops with the leftmost index outermost

In column-major order (Fortran, MATLAB, Julia):

The leftmost index varies fastest
The rightmost index varies slowest
Nest loops with the rightmost index outermost

Practical Example: Image Processing

image-processing.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
import numpy as np
 
# Color image: height × width × channels (RGB)
# Typical shape: (1080, 1920, 3) for Full HD
image = np.zeros((1080, 1920, 3), dtype=np.uint8)
 
# OPTIMAL: Iterate in row-major order (Python/NumPy default)
for row in range(1080):        # Height - outermost
    for col in range(1920):    # Width - middle
        for channel in range(3):  # Channel - innermost (fastest varying)
            image[row, col, channel] = compute_pixel(row, col, channel)
 
# This ensures sequential memory access because:
# image[0,0,0], image[0,0,1], image[0,0,2] (RGB of pixel 0,0)
# image[0,1,0], image[0,1,1], image[0,1,2] (RGB of pixel 0,1)
# ... all sequential in memory

Memory Rule for Any Dimension

Summary: Memory Layout Mastery

Understanding memory layout transforms you from a programmer who writes correct code into an engineer who writes efficient code. Let's consolidate the key insights:

Key Takeaways

•Row-major stores rows contiguously — Adjacent elements in the same row are adjacent in memory. Used by C, C++, Python, Java, and most modern languages.
•Column-major stores columns contiguously — Adjacent elements in the same column are adjacent in memory. Used by Fortran, MATLAB, Julia, R, and OpenGL.
•Cache lines determine performance — CPUs load 64-byte chunks; accessing data sequentially maximizes cache hits.
•Wrong traversal order = 10-100x slower — For large matrices, cache misses dominate runtime when access patterns don't match layout.
•Row-major: outer loop on rows — In C-like languages, iterate rows in outer loop, columns in inner loop.
•Column-major: outer loop on columns — In Fortran-like languages, iterate columns in outer loop, rows in inner loop.
•Transposition relates the layouts — Column-major of a matrix equals row-major of its transpose.
•Higher dimensions follow the same rules — Rightmost index varies fastest in row-major, leftmost in column-major.

What's Next:

Page Complete

2 / 4