Spatial Indexes - Learning Module

Loading content...

0/252

R-tree Concept: Hierarchical Bounding Rectangle Trees

The Ingenious Insight Behind R-trees

In 1984, Antonin Guttman published a paper that would become the foundation of spatial database systems worldwide. His insight was elegantly simple: if you can't order spatial objects along a single dimension, group them by proximity and enclose each group in a bounding box.

The R-tree (where 'R' stands for Rectangle) is a height-balanced tree structure that organizes spatial objects hierarchically. Instead of sorting data along one axis like B+-trees, R-trees recursively subdivide space using Minimum Bounding Rectangles (MBRs)—the smallest axis-aligned rectangles that fully contain each group of objects.

This simple idea enables remarkable query efficiency: to find objects near a point, we only examine tree branches whose bounding rectangles intersect our search region. Branches whose rectangles are entirely outside our search can be skipped completely, potentially pruning millions of objects with a single comparison.

What You Will Learn

By the end of this page, you will understand the R-tree's structural properties, how the bounding rectangle hierarchy enables efficient spatial search, the key design parameters that affect performance, and the trade-offs inherent in R-tree design. This conceptual foundation is essential before exploring the detailed algorithms.

R-tree Structure Overview

An R-tree is a height-balanced tree that organizes spatial data in a hierarchy of nested bounding rectangles. The structure shares key properties with B+-trees but adapts them for multi-dimensional data:

Structural Properties:

Height-balanced — All leaf nodes are at the same depth, guaranteeing O(log n) traversal for any query.
Bounded node capacity — Each node contains between m and M entries (except the root, which may have fewer). Typical values: m = M/2 (40% minimum fill).
Leaf nodes contain data — Leaf entries store bounding rectangles of actual spatial objects plus pointers to the object data.
Internal nodes contain MBRs — Internal entries store bounding rectangles that enclose all objects in the child subtree plus pointers to child nodes.

rtree_structure.pseudo
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
// R-tree Node Structure
struct RTreeEntry {
    MBR bounding_box;       // Minimum bounding rectangle
    pointer child_or_data;  // Pointer to child node (internal) or object (leaf)
}
 
struct RTreeNode {
    int level;              // 0 for leaf, increases going up
    int entry_count;        // Current number of entries (m ≤ count ≤ M)
    RTreeEntry entries[M];  // Array of entries
}
 
// R-tree Parameters
const int M = 50;           // Maximum entries per node (typically 50-200)
const int m = 20;           // Minimum entries per node (typically 40% of M)
 
// MBR (Minimum Bounding Rectangle)
struct MBR {
    double x_min, x_max;    // X-axis extent
    double y_min, y_max;    // Y-axis extent
    // For 3D: add z_min, z_max
}
 
// Total MBR storage: 4 doubles × 8 bytes = 32 bytes per entry (2D)
// With child pointer: 32 + 8 = 40 bytes per entry
// A 4KB page holds approximately: 4096 / 40 ≈ 100 entries

Key Differences from B+-trees:

Property	B+-tree	R-tree
Key type	Single value (1D)	Bounding rectangle (2D+)
Ordering	Total order (keys sorted)	No ordering (spatial grouping)
Node overlap	None (children are disjoint)	Overlap allowed and common
Search	Single path to leaf	Multiple paths may need traversal
Equality	Unique answer	Many objects may match
Range query	Contiguous leaf scan	Tree traversal with pruning

The key insight: B+-trees use ordering to prune search; R-trees use spatial containment. A query rectangle either overlaps a node's MBR (must explore) or doesn't (can skip).

Hierarchical Bounding Rectangles

The genius of R-trees lies in their hierarchical containment property: every rectangle at level L is completely contained within exactly one rectangle at level L+1. This nesting creates a powerful pruning mechanism.

rtree_hierarchy_example.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
                    ┌─────────────────────────────────────────────────────────┐
                    │                     ROOT MBR                             │
                    │   ┌───────────────────────┐ ┌─────────────────────────┐ │
      Level 2:      │   │      MBR A            │ │         MBR B           │ │
                    │   │  ┌─────┐ ┌─────────┐   │ │  ┌──────┐  ┌────────┐  │ │
      Level 1:      │   │  │MBR 1│ │  MBR 2  │   │ │  │MBR 3 │  │ MBR 4  │  │ │
                    │   │  │     │ │         │   │ │  │      │  │        │  │ │
      Level 0:      │   │  │■ ●  │ │ ▲ ★ ◆   │   │ │  │♦ ○   │  │► ◄ ♠   │  │ │
      (Leaves)      │   │  │     │ │         │   │ │  │      │  │        │  │ │
                    │   │  └─────┘ └─────────┘   │ │  └──────┘  └────────┘  │ │
                    │   └───────────────────────┘ └─────────────────────────┘ │
                    └─────────────────────────────────────────────────────────┘
 
Object symbols: ■ ● ▲ ★ ◆ ♦ ○ ► ◄ ♠ (spatial objects like points, polygons)
 
Tree View:
                            ROOT
                        [Root MBR]
                       /          \
                    [A]            [B]
                   /    \          /    \
                [1]    [2]       [3]    [4]
               /   \   /|\      /  \   /|\
              ■    ●  ▲ ★ ◆    ♦   ○  ► ◄ ♠
 
Key Property: MBR 1 ⊂ MBR A ⊂ ROOT MBR
             Each child MBR is contained within its parent MBR

How Containment Enables Pruning:

Consider searching for objects within query rectangle Q:

Query Q = [10, 30] × [20, 40]

Root MBR = [0, 100] × [0, 100]     → Q overlaps Root? YES → explore
  ├── MBR A = [0, 50] × [0, 60]   → Q overlaps A? YES → explore
  │     ├── MBR 1 = [5, 25] × [30, 55]  → Q overlaps 1? YES → check objects
  │     └── MBR 2 = [30, 50] × [5, 30]  → Q overlaps 2? NO → SKIP subtree!
  └── MBR B = [60, 100] × [40, 100] → Q overlaps B? NO → SKIP subtree!

By checking MBR B once, we eliminated ALL objects in that subtree (potentially thousands) without examining any of them individually. This is the pruning power of R-trees.

The Overlap Check:

Two rectangles R1 and R2 overlap if and only if they overlap in BOTH dimensions:

overlaps(R1, R2) = 
    (R1.x_max >= R2.x_min) AND (R1.x_min <= R2.x_max) AND
    (R1.y_max >= R2.y_min) AND (R1.y_min <= R2.y_max)

This is an O(1) check regardless of how many objects are beneath that node.

Tight Bounds Enable Better Pruning

The tighter the MBRs (closer to the actual objects they contain), the more effective pruning becomes. Loose, inflated MBRs overlap more queries, forcing unnecessary exploration. This observation drives many R-tree optimization strategies we'll explore later.

Node Structure and Capacity

Like B+-trees, R-tree node capacity is typically chosen to match disk page size for I/O efficiency. Understanding node structure helps predict performance and tune implementations.

rtree_node_layout.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
┌────────────────────────────────────────────────────────────────┐
│                    R-TREE NODE (4KB Page)                       │
├──────────────┬──────────────────────────────────────────────────┤
│ Header       │ Level (4B) | Entry Count (4B) | Flags (8B)       │
│ (16 bytes)   │                                                  │
├──────────────┼──────────────────────────────────────────────────┤
│ Entry[0]     │ x_min(8B) | x_max(8B) | y_min(8B) | y_max(8B)    │
│ (40 bytes)   │ child_ptr(8B)                                    │
├──────────────┼──────────────────────────────────────────────────┤
│ Entry[1]     │ x_min | x_max | y_min | y_max | child_ptr        │
├──────────────┼──────────────────────────────────────────────────┤
│    ...       │    ...                                           │
├──────────────┼──────────────────────────────────────────────────┤
│ Entry[M-1]   │ x_min | x_max | y_min | y_max | child_ptr        │
└──────────────┴──────────────────────────────────────────────────┘
 
Calculation for 4KB page with 2D rectangles:
- Header: 16 bytes
- Entry size: 4×8 (MBR) + 8 (pointer) = 40 bytes
- Maximum entries M: (4096 - 16) / 40 = 102 entries
- Typical minimum m: 40% of M ≈ 40 entries

Fanout and Height:

With average node utilization around 70%, a node holds approximately 70 entries. For n spatial objects:

Height ≈ log₇₀(n)
1 million objects: height ≈ log₇₀(1,000,000) ≈ 3.3 → 4 levels
1 billion objects: height ≈ log₇₀(1,000,000,000) ≈ 4.9 → 5 levels

This means any spatial query on a billion objects requires at most 5 page reads to reach any leaf—a remarkable guarantee.

Minimum Fill Guarantee:

The constraint that nodes must have at least m entries (typically m = M/2 or m = 0.4M) ensures:

Space utilization — Nodes are at least 40-50% full, bounding storage overhead
Height control — Trees don't become pathologically deep from empty nodes
Merge triggers — Underflow during deletion triggers node redistribution or merge

R-tree Height vs. Dataset Size
Dataset Size	Height (M=50)	Max Disk Reads	Approximate Query Time
1,000	2	2	< 1ms
100,000	3	3	1-5ms
10,000,000	4	4	5-20ms
1,000,000,000	5	5	20-100ms

The Overlap Problem

Unlike B+-trees where children are strictly disjoint, R-tree sibling nodes can have overlapping MBRs. This is not a bug but a fundamental consequence of indexing multi-dimensional data. Understanding overlap is critical for R-tree performance analysis.

rtree_overlap.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
B+-tree: Disjoint children (1D ranges don't overlap)
─────────────────────────────────────────────────────────
      [1-10]           [11-20]          [21-30]
   ────────────    ────────────     ────────────
   No overlap! Search follows exactly one path.
 
 
R-tree: Overlapping children (2D rectangles can overlap)
─────────────────────────────────────────────────────────
      ┌──────────────────────┐
      │   MBR A              │
      │       ┌──────────────┼────────────┐
      │       │  OVERLAP     │            │
      │       │    ZONE      │   MBR B    │
      └───────┼──────────────┘            │
              └───────────────────────────┘
 
Query point in overlap zone must explore BOTH A and B!

Why Overlap is Unavoidable:

Spatial data clusters irregularly — Real geographic data isn't uniformly distributed. Cities cluster; oceans are empty. Any partitioning will have awkward boundaries.
No natural cut lines — Unlike 1D where you can cleanly split at any value, 2D has no universally optimal split planes.
Object extent varies — A large polygon may span regions that would otherwise be separate. It must be assigned to ONE node, which expands that node's MBR.

Impact on Query Performance:

Best case (no overlap in query path): O(log n) — single path to leaf
Worst case (high overlap): O(n) — degenerates to full scan
Average case: Depends on overlap and query selectivity

The art of R-tree design is minimizing overlap during construction and maintenance.

Overlap is the Primary R-tree Performance Factor

All R-tree variants (R*, R+, packed R-trees) focus on reducing overlap. An R-tree with 10% overlap between sibling nodes may be 2-3x faster than one with 30% overlap. Insertion order, splitting strategy, and bulk loading algorithms all aim to minimize overlap.

Strategies to Reduce Overlap

•Careful split algorithms — When a node overflows, choose the split that minimizes resulting MBR overlap (R*-tree approach).
•Forced reinsert — Instead of splitting immediately, remove and reinsert some entries to find better placements (R*-tree approach).
•Clipping — Allow an object to be split across multiple MBRs that don't overlap (R+-tree approach).
•Bulk loading — Sort data spatially before building the tree to achieve optimal packing (STR, OMT algorithms).
•Periodic reorganization — Rebuild or rebalance the tree when overlap exceeds thresholds.

R-tree Variants

Since Guttman's original R-tree (1984), researchers have developed numerous variants optimizing different aspects. Understanding the major variants helps you choose appropriate implementations for specific workloads.

Major R-tree Variants
Variant	Year	Key Innovation	Best For
R-tree (Guttman)	1984	Original: hierarchical MBRs	General purpose, baseline
R+-tree	1987	No overlap—objects may be clipped/duplicated	Point lookups, disjoint regions
R-tree*	1990	Forced reinsert, minimize overlap & coverage	Mixed workloads, dynamic data
Packed R-tree	1993	Sort-Tile-Recursive bulk loading	Static data, bulk loading
Hilbert R-tree	1994	Order entries by Hilbert curve value	Sequential disk I/O
X-tree	1996	Supernodes to avoid high-dimensional overlap	High dimensions (5-10D)
Priority R-tree	2004	Worst-case optimal for window queries	Guaranteed performance bounds

R*-tree (Most Common)

•Uses forced reinsert before splitting
•Optimizes overlap AND dead space
•Considers MBR perimeter in splits
•Standard in PostGIS, Oracle Spatial
•20-50% faster than original R-tree
•Slight overhead on inserts

R+-tree (No Overlap)

•Guarantees zero overlap between siblings
•Objects may be clipped at boundaries
•Some objects stored in multiple nodes
•Faster queries (single path guaranteed)
•Higher storage overhead (duplication)
•Complex update and delete operations

Which Variant Do Databases Use?

Most production databases (PostgreSQL/PostGIS, MySQL, SQLite) use R*-tree or close variants. It offers the best balance of query performance and update efficiency. R+-trees are occasionally used for specialized point-only workloads. Packed R-trees are common when data is loaded once and rarely updated.

The Minimum Bounding Rectangle (MBR)

The Minimum Bounding Rectangle (MBR) is the fundamental approximation that makes R-trees efficient. Understanding MBR properties is crucial for grasping both the power and limitations of spatial indexing.

mbr_calculation.pseudo
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
// Computing MBR for a set of points
function computeMBR(points[]):
    min_x = +INFINITY
    max_x = -INFINITY
    min_y = +INFINITY
    max_y = -INFINITY
    
    for each point in points:
        min_x = min(min_x, point.x)
        max_x = max(max_x, point.x)
        min_y = min(min_y, point.y)
        max_y = max(max_y, point.y)
    
    return MBR(min_x, max_x, min_y, max_y)
 
// MBR for complex geometries
function geometryMBR(geometry):
    if geometry.type == POINT:
        return MBR(geometry.x, geometry.x, geometry.y, geometry.y)
    
    elif geometry.type == LINESTRING or POLYGON:
        return computeMBR(geometry.vertices)
    
    elif geometry.type == MULTIPOLYGON:
        mbr = MBR(+INF, -INF, +INF, -INF)
        for polygon in geometry.polygons:
            mbr = unionMBR(mbr, geometryMBR(polygon))
        return mbr
 
// MBR union (used when combining entries)
function unionMBR(mbr1, mbr2):
    return MBR(
        min(mbr1.min_x, mbr2.min_x),
        max(mbr1.max_x, mbr2.max_x),
        min(mbr1.min_y, mbr2.min_y),
        max(mbr1.max_y, mbr2.max_y)
    )

MBR Properties:

Axis-aligned — MBRs are always parallel to coordinate axes. This simplifies intersection tests but can poorly fit diagonal or rotated shapes.
Inclusive filter — If query Q doesn't intersect an MBR, it cannot intersect any object within that MBR. The converse is NOT true—query may intersect MBR but miss all actual objects.
Monotonic growth — Adding an object to a node can only expand its MBR, never shrink it. This expansion is called coverage.
Dead space — Area within MBR not occupied by actual objects. High dead space means more false positives during queries.

mbr_dead_space.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
Good MBR (low dead space, tight fit):
┌─────────────────────────────────┐
│ ████████████████████████████████│
│ ████████████████████████████████│
│ █████████████  POLYGON █████████│
│ ████████████████████████████████│
│ ████████████████████████████████│
└─────────────────────────────────┘
Dead space: ~5%
 
Poor MBR (high dead space, diagonal object):
┌─────────────────────────────────┐
│                              ███│
│                           ███   │
│                        ███      │
│                     ███         │
│                  ███            │
│               ███               │
│            ███                  │
│         ███                     │
│      ███                        │
│   ███                           │
│███                              │
└─────────────────────────────────┘
Dead space: ~95% (queries hitting empty space still explore this MBR!)

MBR is a Two-Pass Filter

Spatial queries use MBRs as a FIRST PASS filter to quickly eliminate non-matching candidates. Objects whose MBRs intersect the query then undergo SECOND PASS exact geometry testing. This 'filter and refine' pattern is fundamental to spatial databases—MBRs do the cheap elimination, exact tests handle the few remaining candidates.

MBR Quality Metrics

R-tree algorithms make decisions based on quantifiable MBR properties. These metrics drive insertion choices, splitting decisions, and quality assessment:

Key MBR Metrics

•Area — Total space enclosed: (max_x - min_x) × (max_y - min_y). Smaller is better; large MBRs intersect more queries.
•Perimeter — Edge length: 2 × [(max_x - min_x) + (max_y - min_y)]. Used by R*-tree to favor square-ish MBRs over elongated ones.
•Overlap — Intersection area with sibling MBRs. Queries landing in overlap must explore multiple paths.
•Dead space — Area within MBR not occupied by any object. High dead space = false positive queries.
•Coverage — Ratio of actual object area to MBR area. Higher coverage means tighter fits.
•Enlargement — How much an MBR must grow to include a new object. Used to decide which node receives a new entry.

mbr_metrics.pseudo
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
// Area calculation
function area(mbr):
    return (mbr.max_x - mbr.min_x) * (mbr.max_y - mbr.min_y)
 
// Perimeter calculation
function perimeter(mbr):
    return 2 * ((mbr.max_x - mbr.min_x) + (mbr.max_y - mbr.min_y))
 
// Overlap between two MBRs
function overlap(mbr1, mbr2):
    // Calculate intersection bounds
    x_overlap = max(0, min(mbr1.max_x, mbr2.max_x) - max(mbr1.min_x, mbr2.min_x))
    y_overlap = max(0, min(mbr1.max_y, mbr2.max_y) - max(mbr1.min_y, mbr2.min_y))
    return x_overlap * y_overlap
 
// Enlargement needed to add object to node
function enlargement(node_mbr, object_mbr):
    combined = unionMBR(node_mbr, object_mbr)
    return area(combined) - area(node_mbr)
 
// Total overlap of an MBR with all siblings
function totalOverlap(mbr, siblings[]):
    total = 0
    for each sibling in siblings:
        total += overlap(mbr, sibling.mbr)
    return total

How Metrics Drive Decisions:

Decision	Metric Used	Strategy
Insert: Choose subtree	Enlargement	Pick subtree requiring least enlargement
Insert: Tie-breaking	Area, then # entries	Prefer smaller MBR, then less-full node
Split: Axis selection	Perimeter sum	Split along axis minimizing total perimeter
Split: Distribution	Overlap, then area	Minimize overlap first, then total area
R*-tree reinsert	Centroid distance	Reinsert entries farthest from centroid

These metrics are computed frequently during tree operations. Efficient implementations cache intermediate values to avoid redundant recalculation.

Summary: R-tree Foundations

We've established the conceptual foundation of R-trees. Let's consolidate the key principles:

Key Takeaways

•R-trees use bounding rectangles — Objects are grouped hierarchically using MBRs that enclose each group.
•Containment enables pruning — If a query doesn't intersect an MBR, all objects within can be skipped.
•Height is logarithmic — With fanout ~70-100, even billion-object datasets have height ≤ 5.
•Overlap is unavoidable — Unlike B+-trees, sibling MBRs may overlap, requiring multiple path exploration.
•Overlap minimization is key — All R-tree variants focus on reducing overlap through better splits and organization.
•MBR quality metrics — Area, overlap, enlargement, and perimeter guide algorithmic decisions.
•Filter-and-refine — MBRs provide fast approximate filtering; exact geometry tests verify remaining candidates.

What's Next:

With the R-tree concept clear, we'll examine the Minimum Bounding Rectangle in deeper detail. The next page explores MBR computation, comparison strategies, and how MBR quality directly impacts query performance. Understanding MBRs thoroughly is essential before tackling the R-tree's insertion, deletion, and query algorithms.

Page Complete

You now understand the R-tree's fundamental structure, how hierarchical bounding rectangles enable spatial search, and the critical role of overlap in performance. This conceptual grounding prepares you for the detailed MBR analysis and algorithmic deep-dives ahead.