Machine LearningDecision Tree Fundamentals

Decision Tree Fundamentals

LevelBeginner

Duration90 mins

TopicDecision Tree Fundamentals

1 / 5

Tree Structure

The Architecture of Decision Making

Decision trees are one of the most intuitive and powerful constructs in machine learning. Unlike the opaque weight matrices of neural networks or the abstract hyperplanes of SVMs, decision trees provide a transparent, interpretable model that mirrors how humans naturally reason about decisions—through a series of sequential, hierarchical choices.

Consider how a physician diagnoses a patient: Is there a fever? If yes, is there a cough? If yes, is it productive or dry? Each question refines the diagnosis, narrowing down possibilities until a conclusion is reached. This is precisely how a decision tree operates—a cascade of binary decisions that partition the feature space into regions of homogeneous predictions.

Understanding the structural anatomy of decision trees is foundational to everything that follows: splitting criteria, growing algorithms, pruning strategies, and the ensemble methods (Random Forests, Gradient Boosting) that dominate modern machine learning competitions.

What You Will Learn

By the end of this page, you will understand every structural component of a decision tree—root nodes, internal nodes, leaf nodes, edges, branches, depth, and the fundamental tree properties that govern complexity and expressiveness. You will be able to read, interpret, and reason about any decision tree structure.

Formal Definition of a Decision Tree

A decision tree is a supervised learning model that represents a function mapping input features to output predictions through a hierarchical tree structure. Formally, a decision tree $T$ consists of:

$$T = (V, E, r, \phi, \psi)$$

Where:

$V$: Set of vertices (nodes) partitioned into internal nodes $V_I$ and leaf nodes $V_L$, such that $V = V_I \cup V_L$
$E$: Set of edges connecting parent nodes to child nodes
$r \in V_I$: The root node (unique node with no parent)
$\phi: V_I \rightarrow \mathcal{F} \times \Theta$: A splitting function assigning each internal node a feature and threshold
$\psi: V_L \rightarrow \mathcal{Y}$: A prediction function assigning each leaf node an output value

The tree is directed (edges flow from root toward leaves), acyclic (no loops), and rooted (exactly one root node). For binary decision trees, each internal node has exactly two children, creating a binary branching structure.

Binary vs. Multi-way Trees

While multi-way decision trees (e.g., ID3 for categorical features) can have more than two children per node, the CART algorithm (Classification and Regression Trees) exclusively produces binary trees. Any multi-way split can be decomposed into a sequence of binary splits, making binary trees universal in expressiveness while simplifying implementation.

The Prediction Path:

Given an input vector $\mathbf{x} \in \mathcal{X}$, prediction proceeds by traversing the tree from root to leaf:

Start at root node $r$
At each internal node $v$ with split function $\phi(v) = (f_j, \theta)$:
- If $x_j \leq \theta$, proceed to left child
- If $x_j > \theta$, proceed to right child
Continue until reaching a leaf node $\ell \in V_L$
Return prediction $\psi(\ell)$

This process partitions the feature space $\mathcal{X}$ into $|V_L|$ disjoint regions, each corresponding to a unique root-to-leaf path.

Node Types and Their Roles

Every decision tree comprises three fundamental node types, each serving a distinct purpose in the prediction process. Understanding these roles is essential for grasping tree mechanics.

The Root Node

The root node is the entry point of every decision tree—the starting position from which all predictions flow. It possesses several unique properties:

No parent: The root is the only node without an incoming edge
Universal coverage: Every training example passes through the root
Maximum impurity context: Before any splits, the root contains the full diversity of the training data
First split importance: The feature selected at the root is often the most discriminative for the overall prediction task

In mathematical terms, if the training set is $\mathcal{D} = {(\mathbf{x}i, y_i)}{i=1}^{N}$, the root node contains all $N$ samples. The split at the root creates the most impactful information gain because it operates on the maximum possible sample set.

Internal Nodes (Decision Nodes)

Internal nodes are the decision-makers of the tree. Each internal node:

Receives samples from its parent node (a subset of the parent's samples)
Applies a split criterion based on a feature and threshold: "Is $x_j \leq \theta$?"
Routes samples to child nodes based on the split decision

The splitting function at internal node $v$ can be represented as:

$$\text{split}v(\mathbf{x}) = \begin{cases} \text{left} & \text{if } x{j(v)} \leq \theta(v) \ \text{right} & \text{if } x_{j(v)} > \theta(v) \end{cases}$$

Where $j(v)$ is the feature index and $\theta(v)$ is the threshold for node $v$.

Critical properties of internal nodes:

Sample subset: Each internal node operates on a strict subset of its parent's samples (except root)
Monotonic refinement: As depth increases, nodes contain fewer samples but more homogeneous labels
Local optimization: Each split is chosen to maximize information gain at that node, given available samples
Feature reuse allowed: The same feature can be used at multiple nodes (important for capturing nonlinear relationships)

Leaf Nodes (Terminal Nodes)

Leaf nodes are where predictions are made. They have no children and no split criterion—only a prediction value:

For Classification: $$\psi(\ell) = \text{mode}{y_i : (\mathbf{x}i, y_i) \in \mathcal{D}\ell}$$

The leaf predicts the majority class among training samples that reached it. Alternatively, it can output class probabilities: $$P(Y = k | \mathbf{x} \in \ell) = \frac{|{i : y_i = k, \mathbf{x}i \in \ell}|}{|\mathcal{D}\ell|}$$

For Regression: $$\psi(\ell) = \frac{1}{|\mathcal{D}\ell|} \sum{(\mathbf{x}i, y_i) \in \mathcal{D}\ell} y_i$$

The leaf predicts the mean of target values for training samples that reached it.

Critical properties of leaf nodes:

No further splitting: Leaves represent terminal decisions
Region ownership: Each leaf "owns" a hypercube region in feature space
Purity goal: Ideally, all samples in a leaf share the same label (classification) or similar values (regression)
Prediction origin: Every prediction the tree makes originates from exactly one leaf

Comparison of Node Types in Decision Trees
Property	Root Node	Internal Node	Leaf Node
Parent node	None	One	One
Children	2 (for binary)	2 (for binary)	0
Split criterion	Yes (feature, threshold)	Yes (feature, threshold)	No
Prediction value	No	No	Yes (class or continuous)
Sample coverage	All training samples	Subset of parent	Subset with prediction
Feature space	Entire space	Subregion of parent	Terminal subregion

Edges and Branches

Edges are the connections between nodes that define the tree's structure and the flow of samples. In a decision tree, edges carry both structural and semantic meaning.

Edge Semantics

Each edge represents a split outcome—the result of evaluating a node's split criterion:

Left edge: Corresponds to the condition "$x_j \leq \theta$" (feature value at or below threshold)
Right edge: Corresponds to the condition "$x_j > \theta$" (feature value above threshold)

This binary split convention is standard in CART but can vary in other tree implementations. The key insight is that edges encode logical constraints on feature values.

The Path as Conjunction

The sequence of edges from root to any node forms a path, and each path represents a conjunction (AND) of conditions. For a sample to reach a particular leaf, it must satisfy all conditions along the path:

$$\text{Path to leaf } \ell: (x_{j_1} \leq \theta_1) \land (x_{j_2} > \theta_2) \land \cdots \land (x_{j_d} \leq \theta_d)$$

This conjunction defines the axis-aligned hyperrectangle in feature space that the leaf represents.

Reading a Decision Tree Path

When interpreting a decision tree, trace any path from root to leaf and read the conditions sequentially. The conjunction of all edge conditions gives you an explicit rule: 'IF age ≤ 30 AND income > 50000 AND credit_score ≤ 700 THEN predict: high_risk'. This is why decision trees are inherently interpretable—every prediction has an explicit rule justification.

Branches and Subtrees

A branch is the collection of all nodes and edges reachable from a given internal node. Equivalently, it is the subtree rooted at that node.

Branch properties:

Self-contained prediction: Any branch can function as an independent decision tree for the feature subspace it covers
Nestedness: The branch of any node is a subset of its parent's branch
Pruning unit: When we prune a tree, we replace an entire branch with a leaf

Branch depth and coverage relationship:

As we descend deeper into the tree:

Branch size (number of descendant nodes) decreases
Sample coverage (training samples in the branch) decreases
Prediction specificity increases
Feature space region becomes smaller and more refined

This trade-off is fundamental to tree behavior: deeper branches make more specific predictions on smaller regions, risking overfitting but capturing complex patterns.

Depth, Height, and Size

Three fundamental metrics characterize a decision tree's structure: depth, height, and size. These metrics directly relate to model complexity, computational cost, and generalization ability.

Depth of a Node

The depth of a node is the number of edges from the root to that node:

$$\text{depth}(v) = \begin{cases} 0 & \text{if } v = r \text{ (root)} \ \text{depth}(\text{parent}(v)) + 1 & \text{otherwise} \end{cases}$$

Depth interpretation:

Depth 0: Root node (no conditions applied yet)
Depth 1: First-level splits (one condition applied)
Depth $d$: Nodes requiring exactly $d$ conditions to reach

A sample's prediction complexity is proportional to the depth of the leaf it reaches—each edge traversal is one feature comparison.

Height (Maximum Depth)

The height of a tree is the maximum depth among all nodes, equivalently the longest path from root to any leaf:

$$\text{height}(T) = \max_{v \in V} \text{depth}(v)$$

Height bounds the worst-case prediction complexity and is the primary hyperparameter for controlling tree complexity in practice (via max_depth constraints).

Height-complexity relationship:

Height	Max Leaves (Binary Tree)	Max Features Checked	Typical Risk
1	2	1	Underfitting
3	8	3	Low
5	32	5	Moderate
10	1,024	10	High
20	1,048,576	20	Extreme overfitting

A tree of height $h$ can have at most $2^h$ leaves, creating up to $2^h$ distinct prediction regions.

Tree Size

Size can refer to different measures:

Node count: Total number of nodes $|V| = |V_I| + |V_L|$
Leaf count: Number of terminal nodes $|V_L|$ (equivalent to number of distinct predictions)
Split count: Number of internal nodes $|V_I| = |V_L| - 1$ (for binary trees)

For a complete binary tree of height $h$:

Total nodes: $2^{h+1} - 1$
Internal nodes: $2^h - 1$
Leaf nodes: $2^h$

For practical decision trees, the structure is rarely complete. Imbalanced data leads to asymmetric trees where some branches terminate early while others grow deep.

Size-generalization principles:

Smaller trees generalize better (fewer parameters, lower variance)
Each leaf requires at least some samples to estimate its prediction reliably
The ratio of samples to leaves bounds the variance of predictions
For $N$ samples, having more than $N/k$ leaves (for small $k$) typically causes overfitting

The Curse of Exponential Growth

Tree capacity grows exponentially with depth. A height-20 tree can represent over a million distinct regions—far more than needed for most datasets. Without regularization (pruning, depth limits, minimum samples), decision trees will partition the feature space until each training sample has its own leaf, achieving zero training error but catastrophic test error. This is why controlling depth and size is crucial.

Feature Space Partitioning

The geometric interpretation of decision trees reveals their fundamental mechanism: trees are hierarchical partitioning schemes that recursively divide the feature space into axis-aligned hyperrectangles.

Axis-Aligned Splits

Each split in a decision tree creates a hyperplane perpendicular to one feature axis. For a split on feature $j$ at threshold $\theta$:

The hyperplane is defined by ${\mathbf{x} : x_j = \theta}$
This hyperplane is parallel to all other feature axes
The split bisects the current region into two sub-regions

In 2D, these are vertical and horizontal lines. In 3D, these are planes perpendicular to axes. In $d$ dimensions, these are $(d-1)$-dimensional hyperplanes.

Constraint: Axis-alignment

Decision trees can only split perpendicular to feature axes. They cannot create diagonal decision boundaries directly. This is both a limitation and a source of interpretability:

Limitation: Capturing diagonal patterns (like $x_1 + x_2 > 5$) requires multiple splits, creating a staircase approximation
Advantage: Each split is interpretable as a simple feature threshold

From Tree to Partition

Every decision tree induces a partition of the feature space. Formally:

$$\mathcal{X} = \bigsqcup_{\ell \in V_L} R_\ell$$

Where:

$\bigsqcup$ denotes disjoint union
$R_\ell$ is the hyperrectangle corresponding to leaf $\ell$
Each $R_\ell$ is defined by the conjunction of conditions along the path from root to $\ell$

Region definition:

For a leaf $\ell$ reached by path $p_1, p_2, \ldots, p_d$ (where each $p_i$ is a split decision):

$$R_\ell = {\mathbf{x} \in \mathcal{X} : \bigwedge_{i=1}^{d} \text{condition}_i(\mathbf{x})}$$

Each $R_\ell$ is a hyperrectangle (box) because:

Each condition constrains exactly one feature dimension
Conditions on the same feature compound to intervals: $a_j \leq x_j \leq b_j$
The intersection of interval constraints creates an axis-aligned box

partition_visualization.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
# Visualizing decision tree feature space partitioning
import numpy as np
import matplotlib.pyplot as plt
from sklearn.tree import DecisionTreeClassifier
 
# Generate 2D data with two classes
np.random.seed(42)
X = np.random.randn(200, 2)
y = (X[:, 0] + X[:, 1] > 0).astype(int)
 
# Train decision tree with limited depth for visualization
tree = DecisionTreeClassifier(max_depth=3, random_state=42)
tree.fit(X, y)
 
# Create mesh for decision boundary visualization
x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
xx, yy = np.meshgrid(np.linspace(x_min, x_max, 300),
                     np.linspace(y_min, y_max, 300))
 
# Predict on mesh
Z = tree.predict(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)
 
# Plot partitioned feature space
plt.figure(figsize=(10, 8))
plt.contourf(xx, yy, Z, alpha=0.4, cmap='RdBu')
plt.contour(xx, yy, Z, colors='black', linewidths=0.5)
plt.scatter(X[:, 0], X[:, 1], c=y, cmap='RdBu', edgecolors='black')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.title('Decision Tree Partition (max_depth=3)')
plt.colorbar(label='Prediction')
plt.show()
 
# The resulting plot shows axis-aligned rectangular regions
# Each region corresponds to one leaf in the decision tree
# The "staircase" boundary approximates the true diagonal boundary y = -x

Partition Properties

Completeness: Every point in the feature space belongs to exactly one region. There are no gaps or overlaps.

Homogeneity Goal: Each region ideally contains samples of only one class (classification) or similar target values (regression). This is the purity objective that drives splitting.

Prediction Constancy: Within each region, the prediction is constant—the predicted value for the corresponding leaf.

Refinement Hierarchy: Each split refines the partition, breaking one region into two. The partition becomes progressively finer as tree depth increases.

Binary Trees and Data Structures

Decision trees are implemented using binary tree data structures, and understanding this implementation reveals important properties about memory layout, traversal efficiency, and prediction complexity.

Array-Based Representation

The most common implementation uses a compact array representation where nodes are stored in level-order:

Node at index $i$ has:
- Left child at index $2i + 1$
- Right child at index $2i + 2$
- Parent at index $\lfloor(i-1)/2\rfloor$

Per-node storage:

Node {
    feature_index: int     // Which feature to split on (-1 for leaf)
    threshold: float       // Split threshold (unused for leaf)
    value: float[]         // Prediction value(s) for the node
    n_samples: int         // Number of training samples in node
    impurity: float        // Node impurity (Gini, entropy, MSE)
}

This array layout provides:

O(1) child and parent access
Cache-efficient traversal
Simple serialization for model persistence

Memory Complexity

For a tree with $|V|$ nodes and $C$ classes (for classification):

Per-node memory:

Feature index: 4 bytes
Threshold: 8 bytes (double precision)
Class counts/value: $C \times 8$ bytes
Sample count: 4 bytes
Impurity: 8 bytes

Total memory: $O(|V| \cdot (C + k))$ where $k$ is constant overhead.

For typical binary classification ($C = 2$), each node requires approximately 48 bytes. A tree with 1000 nodes requires ~48 KB.

Prediction Complexity

Time complexity: $O(\text{height})$ per prediction

Each prediction traverses one root-to-leaf path
At each internal node: one comparison, one child pointer follow
Independent of feature dimensionality

Batch prediction: $O(N \cdot \text{height})$ for $N$ samples

Excellent for parallelization (each sample independent)

Feature access pattern:

Non-contiguous feature access (different features at each level)
May cause cache misses for high-dimensional data

tree_structure_inspection.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
# Inspecting decision tree internal structure
from sklearn.tree import DecisionTreeClassifier
from sklearn.datasets import load_iris
import numpy as np
 
# Train a simple decision tree
iris = load_iris()
X, y = iris.data, iris.target
tree = DecisionTreeClassifier(max_depth=3, random_state=42)
tree.fit(X, y)
 
# Access the underlying tree structure
tree_structure = tree.tree_
 
print("="*50)
print("DECISION TREE STRUCTURE ANALYSIS")
print("="*50)
print(f"Number of nodes: {tree_structure.node_count}")
print(f"Maximum depth: {tree_structure.max_depth}")
print(f"Number of leaves: {tree_structure.n_leaves}")
print(f"Number of features: {tree_structure.n_features}")
print()
 
# Analyze each node
print("NODE DETAILS:")
print("-"*50)
for node_id in range(tree_structure.node_count):
    is_leaf = tree_structure.feature[node_id] == -2  # -2 indicates leaf
    
    if is_leaf:
        node_type = "LEAF"
        feature_info = "N/A"
        threshold_info = "N/A"
    else:
        node_type = "INTERNAL"
        feature_info = iris.feature_names[tree_structure.feature[node_id]]
        threshold_info = f"{tree_structure.threshold[node_id]:.3f}"
    
    # Get sample counts per class at this node
    class_counts = tree_structure.value[node_id][0]
    total_samples = tree_structure.n_node_samples[node_id]
    
    print(f"Node {node_id:2d} [{node_type:8s}]: "
          f"samples={total_samples:3d}, "
          f"feature={feature_info:20s}, "
          f"threshold={threshold_info:10s}, "
          f"class_dist={class_counts}")
 
# Child relationships (level-order doesn't directly map for sklearn trees)
print()
print("PARENT-CHILD RELATIONSHIPS:")
print("-"*50)
for node_id in range(tree_structure.node_count):
    left_child = tree_structure.children_left[node_id]
    right_child = tree_structure.children_right[node_id]
    if left_child != -1:  # Not a leaf
        print(f"Node {node_id} -> Left: {left_child}, Right: {right_child}")

Implementation Insight

Understanding tree internals helps you debug unexpected behavior. The 'feature[node_id] == -2' convention (TREE_UNDEFINED in sklearn) marks leaf nodes. When inspecting trees, always check n_node_samples to understand how many training samples inform each node's decisions—nodes with very few samples are prone to overfitting.

Balanced vs. Unbalanced Trees

Unlike binary search trees where balance is actively maintained for efficiency, decision trees grow naturally based on data properties, often resulting in highly unbalanced structures. Understanding this distinction is crucial for anticipating tree behavior.

What Causes Imbalance?

Class imbalance: When one class dominates, splits may consistently favor one direction, creating long chains. The minority class samples get repeatedly split to achieve purity, while majority class regions terminate early.

Feature correlations: If splitting features are correlated with specific outcomes, the tree may develop deep paths for one outcome type.

Data distribution: Non-uniform feature distributions create uneven splits. A feature with outliers may generate extremely small regions for those outliers.

Stopping conditions: When minimum sample requirements are met for one branch before another, that branch terminates while others continue growing.

Balanced Tree Characteristics

•All paths have similar length
•Leaves at similar depths
•Consistent prediction latency
•Efficient use of tree capacity
•Often indicates well-separated classes
•Height ≈ log₂(leaves)

Unbalanced Tree Characteristics

•Highly variable path lengths
•Some paths much deeper than others
•Variable prediction latency
•May indicate class imbalance
•Risk of overfitting in deep branches
•Height >> log₂(leaves)

Implications for Model Behavior

Prediction latency variation: In highly unbalanced trees, some samples traverse 3 nodes while others traverse 30. This creates variable prediction times—problematic for real-time systems requiring consistent latency.

Overfitting risk: Deep paths in unbalanced trees often have very few training samples, leading to high-variance predictions. A path with 50 splits but only 5 samples is almost certainly overfit.

Interpretation difficulty: While shallow branches yield simple rules, deep branches become convoluted. A rule with 20 conditions is practically uninterpretable.

Memory inefficiency: Array-based storage for very unbalanced trees wastes space on empty nodes in the implicit structure.

Balance Metrics

Height-to-leaf-count ratio: $$\text{Balance ratio} = \frac{\text{height}}{\log_2(\text{leaf count})}$$

Ratio = 1: Perfectly balanced full binary tree
Ratio > 2: Moderately unbalanced
Ratio > 4: Highly unbalanced (potential issue)

Depth variance: The variance of leaf depths indicates how consistent path lengths are.

Tree Structure Summary

We have now established a complete understanding of decision tree structural anatomy. Let us consolidate the key concepts:

Key Structural Concepts

•Formal definition: A tree $T = (V, E, r, \phi, \psi)$ with nodes, edges, root, splitting function, and prediction function
•Three node types: Root (entry point), Internal (decision makers), Leaves (prediction makers)
•Edges encode conditions: Each edge represents one feature-threshold condition; paths are conjunctions
•Depth, height, size: Key metrics controlling complexity and generalization
•Axis-aligned partitioning: Trees divide feature space into hyperrectangles perpendicular to feature axes
•Compact representation: Array-based storage enables efficient O(height) prediction
•Balance matters: Unbalanced trees risk overfitting and variable latency

What's next:

With the structural foundation in place, we are ready to explore the most critical question in tree construction: How do we choose splits? The next page dives deep into splitting rules—the algorithms that determine which feature and threshold to use at each internal node, fundamentally shaping the tree's predictive power.

Page Complete

You now possess a thorough understanding of decision tree structure—the hierarchical anatomy that enables trees to partition feature space into prediction regions. This structural foundation is essential for understanding how trees are grown, pruned, and combined into powerful ensemble methods. Next, we explore the splitting rules that drive tree construction.

1 / 5

Loading learning content...

Machine LearningDecision Tree Fundamentals

Decision Tree Fundamentals

LevelBeginner

Duration90 mins

TopicDecision Tree Fundamentals

1 / 5

Tree Structure

The Architecture of Decision Making

What You Will Learn

Formal Definition of a Decision Tree

$$T = (V, E, r, \phi, \psi)$$

Where:

$V$: Set of vertices (nodes) partitioned into internal nodes $V_I$ and leaf nodes $V_L$, such that $V = V_I \cup V_L$
$E$: Set of edges connecting parent nodes to child nodes
$r \in V_I$: The root node (unique node with no parent)
$\phi: V_I \rightarrow \mathcal{F} \times \Theta$: A splitting function assigning each internal node a feature and threshold
$\psi: V_L \rightarrow \mathcal{Y}$: A prediction function assigning each leaf node an output value

Binary vs. Multi-way Trees

The Prediction Path:

Given an input vector $\mathbf{x} \in \mathcal{X}$, prediction proceeds by traversing the tree from root to leaf:

Start at root node $r$
At each internal node $v$ with split function $\phi(v) = (f_j, \theta)$:
- If $x_j \leq \theta$, proceed to left child
- If $x_j > \theta$, proceed to right child
Continue until reaching a leaf node $\ell \in V_L$
Return prediction $\psi(\ell)$

This process partitions the feature space $\mathcal{X}$ into $|V_L|$ disjoint regions, each corresponding to a unique root-to-leaf path.

Node Types and Their Roles

Every decision tree comprises three fundamental node types, each serving a distinct purpose in the prediction process. Understanding these roles is essential for grasping tree mechanics.

The Root Node

The root node is the entry point of every decision tree—the starting position from which all predictions flow. It possesses several unique properties:

No parent: The root is the only node without an incoming edge
Universal coverage: Every training example passes through the root
Maximum impurity context: Before any splits, the root contains the full diversity of the training data
First split importance: The feature selected at the root is often the most discriminative for the overall prediction task

Internal Nodes (Decision Nodes)

Internal nodes are the decision-makers of the tree. Each internal node:

Receives samples from its parent node (a subset of the parent's samples)
Applies a split criterion based on a feature and threshold: "Is $x_j \leq \theta$?"
Routes samples to child nodes based on the split decision

The splitting function at internal node $v$ can be represented as:

$$\text{split}v(\mathbf{x}) = \begin{cases} \text{left} & \text{if } x{j(v)} \leq \theta(v) \ \text{right} & \text{if } x_{j(v)} > \theta(v) \end{cases}$$

Where $j(v)$ is the feature index and $\theta(v)$ is the threshold for node $v$.

Critical properties of internal nodes:

Sample subset: Each internal node operates on a strict subset of its parent's samples (except root)
Monotonic refinement: As depth increases, nodes contain fewer samples but more homogeneous labels
Local optimization: Each split is chosen to maximize information gain at that node, given available samples
Feature reuse allowed: The same feature can be used at multiple nodes (important for capturing nonlinear relationships)

Leaf Nodes (Terminal Nodes)

Leaf nodes are where predictions are made. They have no children and no split criterion—only a prediction value:

For Classification: $$\psi(\ell) = \text{mode}{y_i : (\mathbf{x}i, y_i) \in \mathcal{D}\ell}$$

For Regression: $$\psi(\ell) = \frac{1}{|\mathcal{D}\ell|} \sum{(\mathbf{x}i, y_i) \in \mathcal{D}\ell} y_i$$

The leaf predicts the mean of target values for training samples that reached it.

Critical properties of leaf nodes:

No further splitting: Leaves represent terminal decisions
Region ownership: Each leaf "owns" a hypercube region in feature space
Purity goal: Ideally, all samples in a leaf share the same label (classification) or similar values (regression)
Prediction origin: Every prediction the tree makes originates from exactly one leaf

Comparison of Node Types in Decision Trees
Property	Root Node	Internal Node	Leaf Node
Parent node	None	One	One
Children	2 (for binary)	2 (for binary)	0
Split criterion	Yes (feature, threshold)	Yes (feature, threshold)	No
Prediction value	No	No	Yes (class or continuous)
Sample coverage	All training samples	Subset of parent	Subset with prediction
Feature space	Entire space	Subregion of parent	Terminal subregion

Edges and Branches

Edges are the connections between nodes that define the tree's structure and the flow of samples. In a decision tree, edges carry both structural and semantic meaning.

Edge Semantics

Each edge represents a split outcome—the result of evaluating a node's split criterion:

Left edge: Corresponds to the condition "$x_j \leq \theta$" (feature value at or below threshold)
Right edge: Corresponds to the condition "$x_j > \theta$" (feature value above threshold)

This binary split convention is standard in CART but can vary in other tree implementations. The key insight is that edges encode logical constraints on feature values.

The Path as Conjunction

$$\text{Path to leaf } \ell: (x_{j_1} \leq \theta_1) \land (x_{j_2} > \theta_2) \land \cdots \land (x_{j_d} \leq \theta_d)$$

This conjunction defines the axis-aligned hyperrectangle in feature space that the leaf represents.

Reading a Decision Tree Path

Branches and Subtrees

A branch is the collection of all nodes and edges reachable from a given internal node. Equivalently, it is the subtree rooted at that node.

Branch properties:

Self-contained prediction: Any branch can function as an independent decision tree for the feature subspace it covers
Nestedness: The branch of any node is a subset of its parent's branch
Pruning unit: When we prune a tree, we replace an entire branch with a leaf

Branch depth and coverage relationship:

As we descend deeper into the tree:

Branch size (number of descendant nodes) decreases
Sample coverage (training samples in the branch) decreases
Prediction specificity increases
Feature space region becomes smaller and more refined

This trade-off is fundamental to tree behavior: deeper branches make more specific predictions on smaller regions, risking overfitting but capturing complex patterns.

Depth, Height, and Size

Depth of a Node

The depth of a node is the number of edges from the root to that node:

$$\text{depth}(v) = \begin{cases} 0 & \text{if } v = r \text{ (root)} \ \text{depth}(\text{parent}(v)) + 1 & \text{otherwise} \end{cases}$$

Depth interpretation:

Depth 0: Root node (no conditions applied yet)
Depth 1: First-level splits (one condition applied)
Depth $d$: Nodes requiring exactly $d$ conditions to reach

A sample's prediction complexity is proportional to the depth of the leaf it reaches—each edge traversal is one feature comparison.

Height (Maximum Depth)

The height of a tree is the maximum depth among all nodes, equivalently the longest path from root to any leaf:

$$\text{height}(T) = \max_{v \in V} \text{depth}(v)$$

Height bounds the worst-case prediction complexity and is the primary hyperparameter for controlling tree complexity in practice (via max_depth constraints).

Height-complexity relationship:

Height	Max Leaves (Binary Tree)	Max Features Checked	Typical Risk
1	2	1	Underfitting
3	8	3	Low
5	32	5	Moderate
10	1,024	10	High
20	1,048,576	20	Extreme overfitting

A tree of height $h$ can have at most $2^h$ leaves, creating up to $2^h$ distinct prediction regions.

Tree Size

Size can refer to different measures:

Node count: Total number of nodes $|V| = |V_I| + |V_L|$
Leaf count: Number of terminal nodes $|V_L|$ (equivalent to number of distinct predictions)
Split count: Number of internal nodes $|V_I| = |V_L| - 1$ (for binary trees)

For a complete binary tree of height $h$:

Total nodes: $2^{h+1} - 1$
Internal nodes: $2^h - 1$
Leaf nodes: $2^h$

For practical decision trees, the structure is rarely complete. Imbalanced data leads to asymmetric trees where some branches terminate early while others grow deep.

Size-generalization principles:

Smaller trees generalize better (fewer parameters, lower variance)
Each leaf requires at least some samples to estimate its prediction reliably
The ratio of samples to leaves bounds the variance of predictions
For $N$ samples, having more than $N/k$ leaves (for small $k$) typically causes overfitting

The Curse of Exponential Growth

Feature Space Partitioning

Axis-Aligned Splits

Each split in a decision tree creates a hyperplane perpendicular to one feature axis. For a split on feature $j$ at threshold $\theta$:

The hyperplane is defined by ${\mathbf{x} : x_j = \theta}$
This hyperplane is parallel to all other feature axes
The split bisects the current region into two sub-regions

In 2D, these are vertical and horizontal lines. In 3D, these are planes perpendicular to axes. In $d$ dimensions, these are $(d-1)$-dimensional hyperplanes.

Constraint: Axis-alignment

Decision trees can only split perpendicular to feature axes. They cannot create diagonal decision boundaries directly. This is both a limitation and a source of interpretability:

Limitation: Capturing diagonal patterns (like $x_1 + x_2 > 5$) requires multiple splits, creating a staircase approximation
Advantage: Each split is interpretable as a simple feature threshold

From Tree to Partition

Every decision tree induces a partition of the feature space. Formally:

$$\mathcal{X} = \bigsqcup_{\ell \in V_L} R_\ell$$

Where:

$\bigsqcup$ denotes disjoint union
$R_\ell$ is the hyperrectangle corresponding to leaf $\ell$
Each $R_\ell$ is defined by the conjunction of conditions along the path from root to $\ell$

Region definition:

For a leaf $\ell$ reached by path $p_1, p_2, \ldots, p_d$ (where each $p_i$ is a split decision):

$$R_\ell = {\mathbf{x} \in \mathcal{X} : \bigwedge_{i=1}^{d} \text{condition}_i(\mathbf{x})}$$

Each $R_\ell$ is a hyperrectangle (box) because:

Each condition constrains exactly one feature dimension
Conditions on the same feature compound to intervals: $a_j \leq x_j \leq b_j$
The intersection of interval constraints creates an axis-aligned box

partition_visualization.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
# Visualizing decision tree feature space partitioning
import numpy as np
import matplotlib.pyplot as plt
from sklearn.tree import DecisionTreeClassifier
 
# Generate 2D data with two classes
np.random.seed(42)
X = np.random.randn(200, 2)
y = (X[:, 0] + X[:, 1] > 0).astype(int)
 
# Train decision tree with limited depth for visualization
tree = DecisionTreeClassifier(max_depth=3, random_state=42)
tree.fit(X, y)
 
# Create mesh for decision boundary visualization
x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
xx, yy = np.meshgrid(np.linspace(x_min, x_max, 300),
                     np.linspace(y_min, y_max, 300))
 
# Predict on mesh
Z = tree.predict(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)
 
# Plot partitioned feature space
plt.figure(figsize=(10, 8))
plt.contourf(xx, yy, Z, alpha=0.4, cmap='RdBu')
plt.contour(xx, yy, Z, colors='black', linewidths=0.5)
plt.scatter(X[:, 0], X[:, 1], c=y, cmap='RdBu', edgecolors='black')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.title('Decision Tree Partition (max_depth=3)')
plt.colorbar(label='Prediction')
plt.show()
 
# The resulting plot shows axis-aligned rectangular regions
# Each region corresponds to one leaf in the decision tree
# The "staircase" boundary approximates the true diagonal boundary y = -x

Partition Properties

Completeness: Every point in the feature space belongs to exactly one region. There are no gaps or overlaps.

Homogeneity Goal: Each region ideally contains samples of only one class (classification) or similar target values (regression). This is the purity objective that drives splitting.

Prediction Constancy: Within each region, the prediction is constant—the predicted value for the corresponding leaf.

Refinement Hierarchy: Each split refines the partition, breaking one region into two. The partition becomes progressively finer as tree depth increases.

Binary Trees and Data Structures

Array-Based Representation

The most common implementation uses a compact array representation where nodes are stored in level-order:

Node at index $i$ has:
- Left child at index $2i + 1$
- Right child at index $2i + 2$
- Parent at index $\lfloor(i-1)/2\rfloor$

Per-node storage:

Node {
    feature_index: int     // Which feature to split on (-1 for leaf)
    threshold: float       // Split threshold (unused for leaf)
    value: float[]         // Prediction value(s) for the node
    n_samples: int         // Number of training samples in node
    impurity: float        // Node impurity (Gini, entropy, MSE)
}

This array layout provides:

O(1) child and parent access
Cache-efficient traversal
Simple serialization for model persistence

Memory Complexity

For a tree with $|V|$ nodes and $C$ classes (for classification):

Per-node memory:

Feature index: 4 bytes
Threshold: 8 bytes (double precision)
Class counts/value: $C \times 8$ bytes
Sample count: 4 bytes
Impurity: 8 bytes

Total memory: $O(|V| \cdot (C + k))$ where $k$ is constant overhead.

For typical binary classification ($C = 2$), each node requires approximately 48 bytes. A tree with 1000 nodes requires ~48 KB.

Prediction Complexity

Time complexity: $O(\text{height})$ per prediction

Each prediction traverses one root-to-leaf path
At each internal node: one comparison, one child pointer follow
Independent of feature dimensionality

Batch prediction: $O(N \cdot \text{height})$ for $N$ samples

Excellent for parallelization (each sample independent)

Feature access pattern:

Non-contiguous feature access (different features at each level)
May cause cache misses for high-dimensional data

tree_structure_inspection.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
# Inspecting decision tree internal structure
from sklearn.tree import DecisionTreeClassifier
from sklearn.datasets import load_iris
import numpy as np
 
# Train a simple decision tree
iris = load_iris()
X, y = iris.data, iris.target
tree = DecisionTreeClassifier(max_depth=3, random_state=42)
tree.fit(X, y)
 
# Access the underlying tree structure
tree_structure = tree.tree_
 
print("="*50)
print("DECISION TREE STRUCTURE ANALYSIS")
print("="*50)
print(f"Number of nodes: {tree_structure.node_count}")
print(f"Maximum depth: {tree_structure.max_depth}")
print(f"Number of leaves: {tree_structure.n_leaves}")
print(f"Number of features: {tree_structure.n_features}")
print()
 
# Analyze each node
print("NODE DETAILS:")
print("-"*50)
for node_id in range(tree_structure.node_count):
    is_leaf = tree_structure.feature[node_id] == -2  # -2 indicates leaf
    
    if is_leaf:
        node_type = "LEAF"
        feature_info = "N/A"
        threshold_info = "N/A"
    else:
        node_type = "INTERNAL"
        feature_info = iris.feature_names[tree_structure.feature[node_id]]
        threshold_info = f"{tree_structure.threshold[node_id]:.3f}"
    
    # Get sample counts per class at this node
    class_counts = tree_structure.value[node_id][0]
    total_samples = tree_structure.n_node_samples[node_id]
    
    print(f"Node {node_id:2d} [{node_type:8s}]: "
          f"samples={total_samples:3d}, "
          f"feature={feature_info:20s}, "
          f"threshold={threshold_info:10s}, "
          f"class_dist={class_counts}")
 
# Child relationships (level-order doesn't directly map for sklearn trees)
print()
print("PARENT-CHILD RELATIONSHIPS:")
print("-"*50)
for node_id in range(tree_structure.node_count):
    left_child = tree_structure.children_left[node_id]
    right_child = tree_structure.children_right[node_id]
    if left_child != -1:  # Not a leaf
        print(f"Node {node_id} -> Left: {left_child}, Right: {right_child}")

Implementation Insight

Balanced vs. Unbalanced Trees

What Causes Imbalance?

Feature correlations: If splitting features are correlated with specific outcomes, the tree may develop deep paths for one outcome type.

Data distribution: Non-uniform feature distributions create uneven splits. A feature with outliers may generate extremely small regions for those outliers.

Stopping conditions: When minimum sample requirements are met for one branch before another, that branch terminates while others continue growing.

Balanced Tree Characteristics

•All paths have similar length
•Leaves at similar depths
•Consistent prediction latency
•Efficient use of tree capacity
•Often indicates well-separated classes
•Height ≈ log₂(leaves)

Unbalanced Tree Characteristics

•Highly variable path lengths
•Some paths much deeper than others
•Variable prediction latency
•May indicate class imbalance
•Risk of overfitting in deep branches
•Height >> log₂(leaves)

Implications for Model Behavior

Overfitting risk: Deep paths in unbalanced trees often have very few training samples, leading to high-variance predictions. A path with 50 splits but only 5 samples is almost certainly overfit.

Interpretation difficulty: While shallow branches yield simple rules, deep branches become convoluted. A rule with 20 conditions is practically uninterpretable.

Memory inefficiency: Array-based storage for very unbalanced trees wastes space on empty nodes in the implicit structure.

Balance Metrics

Height-to-leaf-count ratio: $$\text{Balance ratio} = \frac{\text{height}}{\log_2(\text{leaf count})}$$

Ratio = 1: Perfectly balanced full binary tree
Ratio > 2: Moderately unbalanced
Ratio > 4: Highly unbalanced (potential issue)

Depth variance: The variance of leaf depths indicates how consistent path lengths are.

Tree Structure Summary

We have now established a complete understanding of decision tree structural anatomy. Let us consolidate the key concepts:

Key Structural Concepts

•Formal definition: A tree $T = (V, E, r, \phi, \psi)$ with nodes, edges, root, splitting function, and prediction function
•Three node types: Root (entry point), Internal (decision makers), Leaves (prediction makers)
•Edges encode conditions: Each edge represents one feature-threshold condition; paths are conjunctions
•Depth, height, size: Key metrics controlling complexity and generalization
•Axis-aligned partitioning: Trees divide feature space into hyperrectangles perpendicular to feature axes
•Compact representation: Array-based storage enables efficient O(height) prediction
•Balance matters: Unbalanced trees risk overfitting and variable latency

What's next:

Page Complete

1 / 5