Loading content...
In 1984, Antonin Guttman published a paper that would become the foundation of spatial database systems worldwide. His insight was elegantly simple: if you can't order spatial objects along a single dimension, group them by proximity and enclose each group in a bounding box.
The R-tree (where 'R' stands for Rectangle) is a height-balanced tree structure that organizes spatial objects hierarchically. Instead of sorting data along one axis like B+-trees, R-trees recursively subdivide space using Minimum Bounding Rectangles (MBRs)—the smallest axis-aligned rectangles that fully contain each group of objects.
This simple idea enables remarkable query efficiency: to find objects near a point, we only examine tree branches whose bounding rectangles intersect our search region. Branches whose rectangles are entirely outside our search can be skipped completely, potentially pruning millions of objects with a single comparison.
By the end of this page, you will understand the R-tree's structural properties, how the bounding rectangle hierarchy enables efficient spatial search, the key design parameters that affect performance, and the trade-offs inherent in R-tree design. This conceptual foundation is essential before exploring the detailed algorithms.
An R-tree is a height-balanced tree that organizes spatial data in a hierarchy of nested bounding rectangles. The structure shares key properties with B+-trees but adapts them for multi-dimensional data:
Structural Properties:
Height-balanced — All leaf nodes are at the same depth, guaranteeing O(log n) traversal for any query.
Bounded node capacity — Each node contains between m and M entries (except the root, which may have fewer). Typical values: m = M/2 (40% minimum fill).
Leaf nodes contain data — Leaf entries store bounding rectangles of actual spatial objects plus pointers to the object data.
Internal nodes contain MBRs — Internal entries store bounding rectangles that enclose all objects in the child subtree plus pointers to child nodes.
1234567891011121314151617181920212223242526
// R-tree Node Structurestruct RTreeEntry { MBR bounding_box; // Minimum bounding rectangle pointer child_or_data; // Pointer to child node (internal) or object (leaf)} struct RTreeNode { int level; // 0 for leaf, increases going up int entry_count; // Current number of entries (m ≤ count ≤ M) RTreeEntry entries[M]; // Array of entries} // R-tree Parametersconst int M = 50; // Maximum entries per node (typically 50-200)const int m = 20; // Minimum entries per node (typically 40% of M) // MBR (Minimum Bounding Rectangle)struct MBR { double x_min, x_max; // X-axis extent double y_min, y_max; // Y-axis extent // For 3D: add z_min, z_max} // Total MBR storage: 4 doubles × 8 bytes = 32 bytes per entry (2D)// With child pointer: 32 + 8 = 40 bytes per entry// A 4KB page holds approximately: 4096 / 40 ≈ 100 entriesKey Differences from B+-trees:
| Property | B+-tree | R-tree |
|---|---|---|
| Key type | Single value (1D) | Bounding rectangle (2D+) |
| Ordering | Total order (keys sorted) | No ordering (spatial grouping) |
| Node overlap | None (children are disjoint) | Overlap allowed and common |
| Search | Single path to leaf | Multiple paths may need traversal |
| Equality | Unique answer | Many objects may match |
| Range query | Contiguous leaf scan | Tree traversal with pruning |
The key insight: B+-trees use ordering to prune search; R-trees use spatial containment. A query rectangle either overlaps a node's MBR (must explore) or doesn't (can skip).
The genius of R-trees lies in their hierarchical containment property: every rectangle at level L is completely contained within exactly one rectangle at level L+1. This nesting creates a powerful pruning mechanism.
123456789101112131415161718192021222324252627
┌─────────────────────────────────────────────────────────┐ │ ROOT MBR │ │ ┌───────────────────────┐ ┌─────────────────────────┐ │ Level 2: │ │ MBR A │ │ MBR B │ │ │ │ ┌─────┐ ┌─────────┐ │ │ ┌──────┐ ┌────────┐ │ │ Level 1: │ │ │MBR 1│ │ MBR 2 │ │ │ │MBR 3 │ │ MBR 4 │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ Level 0: │ │ │■ ● │ │ ▲ ★ ◆ │ │ │ │♦ ○ │ │► ◄ ♠ │ │ │ (Leaves) │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ └─────┘ └─────────┘ │ │ └──────┘ └────────┘ │ │ │ └───────────────────────┘ └─────────────────────────┘ │ └─────────────────────────────────────────────────────────┘ Object symbols: ■ ● ▲ ★ ◆ ♦ ○ ► ◄ ♠ (spatial objects like points, polygons) Tree View: ROOT [Root MBR] / \ [A] [B] / \ / \ [1] [2] [3] [4] / \ /|\ / \ /|\ ■ ● ▲ ★ ◆ ♦ ○ ► ◄ ♠ Key Property: MBR 1 ⊂ MBR A ⊂ ROOT MBR Each child MBR is contained within its parent MBRHow Containment Enables Pruning:
Consider searching for objects within query rectangle Q:
Query Q = [10, 30] × [20, 40]
Root MBR = [0, 100] × [0, 100] → Q overlaps Root? YES → explore
├── MBR A = [0, 50] × [0, 60] → Q overlaps A? YES → explore
│ ├── MBR 1 = [5, 25] × [30, 55] → Q overlaps 1? YES → check objects
│ └── MBR 2 = [30, 50] × [5, 30] → Q overlaps 2? NO → SKIP subtree!
└── MBR B = [60, 100] × [40, 100] → Q overlaps B? NO → SKIP subtree!
By checking MBR B once, we eliminated ALL objects in that subtree (potentially thousands) without examining any of them individually. This is the pruning power of R-trees.
The Overlap Check:
Two rectangles R1 and R2 overlap if and only if they overlap in BOTH dimensions:
overlaps(R1, R2) =
(R1.x_max >= R2.x_min) AND (R1.x_min <= R2.x_max) AND
(R1.y_max >= R2.y_min) AND (R1.y_min <= R2.y_max)
This is an O(1) check regardless of how many objects are beneath that node.
The tighter the MBRs (closer to the actual objects they contain), the more effective pruning becomes. Loose, inflated MBRs overlap more queries, forcing unnecessary exploration. This observation drives many R-tree optimization strategies we'll explore later.
Like B+-trees, R-tree node capacity is typically chosen to match disk page size for I/O efficiency. Understanding node structure helps predict performance and tune implementations.
123456789101112131415161718192021
┌────────────────────────────────────────────────────────────────┐│ R-TREE NODE (4KB Page) │├──────────────┬──────────────────────────────────────────────────┤│ Header │ Level (4B) | Entry Count (4B) | Flags (8B) ││ (16 bytes) │ │├──────────────┼──────────────────────────────────────────────────┤│ Entry[0] │ x_min(8B) | x_max(8B) | y_min(8B) | y_max(8B) ││ (40 bytes) │ child_ptr(8B) │├──────────────┼──────────────────────────────────────────────────┤│ Entry[1] │ x_min | x_max | y_min | y_max | child_ptr │├──────────────┼──────────────────────────────────────────────────┤│ ... │ ... │├──────────────┼──────────────────────────────────────────────────┤│ Entry[M-1] │ x_min | x_max | y_min | y_max | child_ptr │└──────────────┴──────────────────────────────────────────────────┘ Calculation for 4KB page with 2D rectangles:- Header: 16 bytes- Entry size: 4×8 (MBR) + 8 (pointer) = 40 bytes- Maximum entries M: (4096 - 16) / 40 = 102 entries- Typical minimum m: 40% of M ≈ 40 entriesFanout and Height:
With average node utilization around 70%, a node holds approximately 70 entries. For n spatial objects:
This means any spatial query on a billion objects requires at most 5 page reads to reach any leaf—a remarkable guarantee.
Minimum Fill Guarantee:
The constraint that nodes must have at least m entries (typically m = M/2 or m = 0.4M) ensures:
| Dataset Size | Height (M=50) | Max Disk Reads | Approximate Query Time |
|---|---|---|---|
| 1,000 | 2 | 2 | < 1ms |
| 100,000 | 3 | 3 | 1-5ms |
| 10,000,000 | 4 | 4 | 5-20ms |
| 1,000,000,000 | 5 | 5 | 20-100ms |
Unlike B+-trees where children are strictly disjoint, R-tree sibling nodes can have overlapping MBRs. This is not a bug but a fundamental consequence of indexing multi-dimensional data. Understanding overlap is critical for R-tree performance analysis.
123456789101112131415161718
B+-tree: Disjoint children (1D ranges don't overlap)───────────────────────────────────────────────────────── [1-10] [11-20] [21-30] ──────────── ──────────── ──────────── No overlap! Search follows exactly one path. R-tree: Overlapping children (2D rectangles can overlap)───────────────────────────────────────────────────────── ┌──────────────────────┐ │ MBR A │ │ ┌──────────────┼────────────┐ │ │ OVERLAP │ │ │ │ ZONE │ MBR B │ └───────┼──────────────┘ │ └───────────────────────────┘ Query point in overlap zone must explore BOTH A and B!Why Overlap is Unavoidable:
Spatial data clusters irregularly — Real geographic data isn't uniformly distributed. Cities cluster; oceans are empty. Any partitioning will have awkward boundaries.
No natural cut lines — Unlike 1D where you can cleanly split at any value, 2D has no universally optimal split planes.
Object extent varies — A large polygon may span regions that would otherwise be separate. It must be assigned to ONE node, which expands that node's MBR.
Impact on Query Performance:
The art of R-tree design is minimizing overlap during construction and maintenance.
All R-tree variants (R*, R+, packed R-trees) focus on reducing overlap. An R-tree with 10% overlap between sibling nodes may be 2-3x faster than one with 30% overlap. Insertion order, splitting strategy, and bulk loading algorithms all aim to minimize overlap.
Since Guttman's original R-tree (1984), researchers have developed numerous variants optimizing different aspects. Understanding the major variants helps you choose appropriate implementations for specific workloads.
| Variant | Year | Key Innovation | Best For |
|---|---|---|---|
| R-tree (Guttman) | 1984 | Original: hierarchical MBRs | General purpose, baseline |
| R+-tree | 1987 | No overlap—objects may be clipped/duplicated | Point lookups, disjoint regions |
| R-tree* | 1990 | Forced reinsert, minimize overlap & coverage | Mixed workloads, dynamic data |
| Packed R-tree | 1993 | Sort-Tile-Recursive bulk loading | Static data, bulk loading |
| Hilbert R-tree | 1994 | Order entries by Hilbert curve value | Sequential disk I/O |
| X-tree | 1996 | Supernodes to avoid high-dimensional overlap | High dimensions (5-10D) |
| Priority R-tree | 2004 | Worst-case optimal for window queries | Guaranteed performance bounds |
Most production databases (PostgreSQL/PostGIS, MySQL, SQLite) use R*-tree or close variants. It offers the best balance of query performance and update efficiency. R+-trees are occasionally used for specialized point-only workloads. Packed R-trees are common when data is loaded once and rarely updated.
The Minimum Bounding Rectangle (MBR) is the fundamental approximation that makes R-trees efficient. Understanding MBR properties is crucial for grasping both the power and limitations of spatial indexing.
12345678910111213141516171819202122232425262728293031323334353637
// Computing MBR for a set of pointsfunction computeMBR(points[]): min_x = +INFINITY max_x = -INFINITY min_y = +INFINITY max_y = -INFINITY for each point in points: min_x = min(min_x, point.x) max_x = max(max_x, point.x) min_y = min(min_y, point.y) max_y = max(max_y, point.y) return MBR(min_x, max_x, min_y, max_y) // MBR for complex geometriesfunction geometryMBR(geometry): if geometry.type == POINT: return MBR(geometry.x, geometry.x, geometry.y, geometry.y) elif geometry.type == LINESTRING or POLYGON: return computeMBR(geometry.vertices) elif geometry.type == MULTIPOLYGON: mbr = MBR(+INF, -INF, +INF, -INF) for polygon in geometry.polygons: mbr = unionMBR(mbr, geometryMBR(polygon)) return mbr // MBR union (used when combining entries)function unionMBR(mbr1, mbr2): return MBR( min(mbr1.min_x, mbr2.min_x), max(mbr1.max_x, mbr2.max_x), min(mbr1.min_y, mbr2.min_y), max(mbr1.max_y, mbr2.max_y) )MBR Properties:
Axis-aligned — MBRs are always parallel to coordinate axes. This simplifies intersection tests but can poorly fit diagonal or rotated shapes.
Inclusive filter — If query Q doesn't intersect an MBR, it cannot intersect any object within that MBR. The converse is NOT true—query may intersect MBR but miss all actual objects.
Monotonic growth — Adding an object to a node can only expand its MBR, never shrink it. This expansion is called coverage.
Dead space — Area within MBR not occupied by actual objects. High dead space means more false positives during queries.
12345678910111213141516171819202122232425
Good MBR (low dead space, tight fit):┌─────────────────────────────────┐│ ████████████████████████████████││ ████████████████████████████████││ █████████████ POLYGON █████████││ ████████████████████████████████││ ████████████████████████████████│└─────────────────────────────────┘Dead space: ~5% Poor MBR (high dead space, diagonal object):┌─────────────────────────────────┐│ ███││ ███ ││ ███ ││ ███ ││ ███ ││ ███ ││ ███ ││ ███ ││ ███ ││ ███ ││███ │└─────────────────────────────────┘Dead space: ~95% (queries hitting empty space still explore this MBR!)Spatial queries use MBRs as a FIRST PASS filter to quickly eliminate non-matching candidates. Objects whose MBRs intersect the query then undergo SECOND PASS exact geometry testing. This 'filter and refine' pattern is fundamental to spatial databases—MBRs do the cheap elimination, exact tests handle the few remaining candidates.
R-tree algorithms make decisions based on quantifiable MBR properties. These metrics drive insertion choices, splitting decisions, and quality assessment:
1234567891011121314151617181920212223242526
// Area calculationfunction area(mbr): return (mbr.max_x - mbr.min_x) * (mbr.max_y - mbr.min_y) // Perimeter calculationfunction perimeter(mbr): return 2 * ((mbr.max_x - mbr.min_x) + (mbr.max_y - mbr.min_y)) // Overlap between two MBRsfunction overlap(mbr1, mbr2): // Calculate intersection bounds x_overlap = max(0, min(mbr1.max_x, mbr2.max_x) - max(mbr1.min_x, mbr2.min_x)) y_overlap = max(0, min(mbr1.max_y, mbr2.max_y) - max(mbr1.min_y, mbr2.min_y)) return x_overlap * y_overlap // Enlargement needed to add object to nodefunction enlargement(node_mbr, object_mbr): combined = unionMBR(node_mbr, object_mbr) return area(combined) - area(node_mbr) // Total overlap of an MBR with all siblingsfunction totalOverlap(mbr, siblings[]): total = 0 for each sibling in siblings: total += overlap(mbr, sibling.mbr) return totalHow Metrics Drive Decisions:
| Decision | Metric Used | Strategy |
|---|---|---|
| Insert: Choose subtree | Enlargement | Pick subtree requiring least enlargement |
| Insert: Tie-breaking | Area, then # entries | Prefer smaller MBR, then less-full node |
| Split: Axis selection | Perimeter sum | Split along axis minimizing total perimeter |
| Split: Distribution | Overlap, then area | Minimize overlap first, then total area |
| R*-tree reinsert | Centroid distance | Reinsert entries farthest from centroid |
These metrics are computed frequently during tree operations. Efficient implementations cache intermediate values to avoid redundant recalculation.
We've established the conceptual foundation of R-trees. Let's consolidate the key principles:
What's Next:
With the R-tree concept clear, we'll examine the Minimum Bounding Rectangle in deeper detail. The next page explores MBR computation, comparison strategies, and how MBR quality directly impacts query performance. Understanding MBRs thoroughly is essential before tackling the R-tree's insertion, deletion, and query algorithms.
You now understand the R-tree's fundamental structure, how hierarchical bounding rectangles enable spatial search, and the critical role of overlap in performance. This conceptual grounding prepares you for the detailed MBR analysis and algorithmic deep-dives ahead.