Loading learning content...
Oblique trees extend standard decision trees from axis-aligned to linear splits. But what if the decision boundary is inherently nonlinear? What if the optimal split follows a curve, a circle, or a complex manifold in feature space?
Multivariate splits generalize the split mechanism even further, allowing each tree node to use:
This creates a spectrum of tree-like models ranging from simple axis-aligned cuts to sophisticated nodes that are themselves powerful classifiers.
By the end of this page, you will understand the full spectrum of multivariate split types, how to embed nonlinear transformations within tree nodes, the tradeoffs between node complexity and tree structure, connections to kernel methods and neural networks, and specific algorithms including model trees, perceptron trees, and neural decision trees.
To understand multivariate splits, we need a systematic taxonomy of split complexity. This reveals the design space between simple trees and complex neural architectures.
Level 0: Axis-Aligned (Univariate)
$$\phi(x) = \mathbb{1}[x_j \leq \theta]$$
Tests a single feature against a threshold. Partitions space with axis-parallel hyperplanes.
Level 1: Linear (Oblique)
$$\phi(x) = \mathbb{1}[w^T x \leq \theta]$$
Tests a linear combination of all features. Partitions space with arbitrary hyperplanes.
Level 2: Polynomial
$$\phi(x) = \mathbb{1}[\sum_{|\alpha| \leq k} c_\alpha x^\alpha \leq \theta]$$
Tests polynomial combinations (e.g., $x_1^2 + 2x_1 x_2 + x_2^2$). Partitions space with algebraic surfaces.
Level 3: Kernel-Based
$$\phi(x) = \mathbb{1}\left[\sum_{i=1}^{n} \alpha_i K(x, x_i) + b \leq 0\right]$$
Uses kernel expansion (like SVM). Can represent complex, smooth boundaries.
Level 4: Neural
$$\phi(x) = \mathbb{1}[\sigma(W_2 \cdot \sigma(W_1 x + b_1) + b_2) > 0.5]$$
Embeds a neural network at each node. Can approximate any smooth function.
| Level | Split Type | Parameters | Boundary Shape | VC Dimension |
|---|---|---|---|---|
| 0 | Axis-aligned | 2 (j, θ) | Axis-parallel hyperplane | O(log d) |
| 1 | Linear/Oblique | d + 1 | Arbitrary hyperplane | O(d) |
| 2 | Degree-k polynomial | O(d^k) | Algebraic surface | O(d^k) |
| 3 | Kernel (n support vectors) | O(n) | Smooth manifold | O(n) |
| 4 | Neural (h hidden units) | O(d·h + h) | Arbitrary | O(d·h log h) |
The Expressiveness-Complexity Tradeoff:
As we move up the hierarchy:
Key Design Question:
Should we use:
Both can achieve the same expressiveness, but with different bias-variance profiles and computational characteristics.
A sufficiently deep axis-aligned tree is a universal approximator. A sufficiently complex single-node neural network is also a universal approximator. Multivariate trees interpolate between these extremes, balancing depth against node complexity.
A natural first step beyond linear combinations is to allow polynomial splits that can capture curved boundaries.
Quadratic Splits:
The most common extension uses second-degree polynomials:
$$\phi(x) = \mathbb{1}[x^T A x + b^T x + c \leq 0]$$
where:
This represents:
Number of Parameters:
For a general quadratic split in $d$ dimensions:
For $d = 10$: 66 parameters per split For $d = 100$: 5,151 parameters per split
The O(d²) parameter count makes full quadratic splits impractical for high-dimensional data. Even with generous training data, fitting thousands of parameters per node leads to severe overfitting. Regularization and sparsity are essential.
Restricted Quadratic Forms:
To manage complexity, we can restrict the quadratic form:
Diagonal Quadratic: Only $A_{ii}$ terms (sum of squares) $$\sum_{i=1}^d a_i x_i^2 + \sum_{i=1}^d b_i x_i + c \leq 0$$
Parameters: $2d + 1$
Axis-Pair Interactions: Include only selected $x_i x_j$ terms $$\text{Select } k \ll d^2 \text{ interaction pairs}$$
Parameters: $d + k + 1$
Low-Rank Quadratic: $A = V V^T$ where $V \in \mathbb{R}^{d \times r}$, $r \ll d$
Parameters: $dr + d + 1$
Optimization for Quadratic Splits:
Finding optimal quadratic splits is more difficult than linear:
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768697071727374757677787980818283848586878889
import numpy as npfrom sklearn.discriminant_analysis import QuadraticDiscriminantAnalysisfrom scipy.optimize import minimize class QuadraticSplitFinder: """ Find quadratic splits using QDA or gradient-based optimization. """ def __init__(self, method='qda', regularization=0.1): self.method = method self.reg = regularization def find_split_qda(self, X, y): """ Use Quadratic Discriminant Analysis to define split. QDA learns class-conditional Gaussians -> quadratic boundary. """ qda = QuadraticDiscriminantAnalysis(reg_param=self.reg) qda.fit(X, y) # QDA decision boundary is quadratic # Predict probability and threshold at 0.5 probs = qda.predict_proba(X)[:, 1] return qda, probs > 0.5 def find_split_gradient(self, X, y, max_iter=100): """ Gradient-based optimization for quadratic split parameters. Uses sigmoid smoothing for differentiability. """ n, d = X.shape # Initialize: small random quadratic + no linear A_init = 0.1 * np.random.randn(d, d) A_init = (A_init + A_init.T) / 2 # Symmetrize b_init = np.zeros(d) c_init = 0 # Flatten parameters params_init = np.concatenate([ A_init[np.triu_indices(d)], # Upper triangular of A b_init, [c_init] ]) def sigmoid(z, temperature=1.0): return 1 / (1 + np.exp(-z / temperature)) def unpack_params(params): n_A = d * (d + 1) // 2 A_upper = params[:n_A] A = np.zeros((d, d)) A[np.triu_indices(d)] = A_upper A = A + A.T - np.diag(np.diag(A)) # Full symmetric b = params[n_A:n_A+d] c = params[-1] return A, b, c def loss(params): A, b, c = unpack_params(params) # Quadratic form for each sample quad_vals = np.array([x @ A @ x + b @ x + c for x in X]) # Soft predictions preds = sigmoid(quad_vals, temperature=1.0) # Cross-entropy loss eps = 1e-10 ce = -np.mean(y * np.log(preds + eps) + (1-y) * np.log(1 - preds + eps)) # L2 regularization reg = self.reg * (np.sum(A**2) + np.sum(b**2)) return ce + reg result = minimize(loss, params_init, method='L-BFGS-B', options={'maxiter': max_iter}) A, b, c = unpack_params(result.x) return {'A': A, 'b': b, 'c': c, 'loss': result.fun} def evaluate_quadratic(self, X, A, b, c): """Evaluate quadratic form for samples.""" return np.array([x @ A @ x + b @ x + c for x in X])When Quadratic Splits Help:
When to Avoid:
Model trees take a different approach: rather than complicating the splits, they complicate the leaf predictions. Instead of predicting a constant at each leaf, they fit a local model.
Standard Regression Tree: $$\hat{f}(x) = \sum_{m=1}^{M} c_m \cdot \mathbb{1}(x \in R_m)$$
Model Tree (with linear leaves): $$\hat{f}(x) = \sum_{m=1}^{M} (w_m^T x + b_m) \cdot \mathbb{1}(x \in R_m)$$
Each leaf contains a linear regression model fitted to the samples in that region. This creates piecewise linear predictions instead of piecewise constant.
M5 Algorithm (Quinlan, 1992):
The most famous model tree algorithm:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384
import numpy as npfrom sklearn.tree import DecisionTreeRegressorfrom sklearn.linear_model import Ridge class ModelTree: """ Model tree with linear regression at leaves. Similar to M5 algorithm. """ def __init__(self, max_depth=5, min_samples_leaf=10, alpha=1.0): self.max_depth = max_depth self.min_samples_leaf = min_samples_leaf self.alpha = alpha # Ridge regularization # First grow a standard tree to get partition self.tree = DecisionTreeRegressor( max_depth=max_depth, min_samples_leaf=min_samples_leaf ) self.leaf_models = {} def fit(self, X, y): """Fit model tree.""" # Step 1: Grow tree to partition feature space self.tree.fit(X, y) # Step 2: Identify which leaf each sample belongs to leaf_indices = self.tree.apply(X) unique_leaves = np.unique(leaf_indices) # Step 3: Fit linear model at each leaf for leaf in unique_leaves: mask = leaf_indices == leaf X_leaf = X[mask] y_leaf = y[mask] if len(y_leaf) < X.shape[1] + 1: # Not enough samples for linear model, use mean self.leaf_models[leaf] = { 'type': 'constant', 'value': y_leaf.mean() } else: # Fit regularized linear regression model = Ridge(alpha=self.alpha) model.fit(X_leaf, y_leaf) self.leaf_models[leaf] = { 'type': 'linear', 'model': model } return self def predict(self, X): """Predict using local linear models.""" leaf_indices = self.tree.apply(X) predictions = np.zeros(len(X)) for i, (x, leaf) in enumerate(zip(X, leaf_indices)): model_info = self.leaf_models[leaf] if model_info['type'] == 'constant': predictions[i] = model_info['value'] else: predictions[i] = model_info['model'].predict([x])[0] return predictions def predict_with_smoothing(self, X, smooth_factor=0.5): """ Predict with smoothing toward parent predictions. This reduces discontinuities at leaf boundaries. """ # Get leaf predictions leaf_preds = self.predict(X) # Get tree constant predictions (for smoothing) tree_preds = self.tree.predict(X) # Blend return smooth_factor * leaf_preds + (1 - smooth_factor) * tree_predsAdvantages of Model Trees:
Smoothing in Model Trees:
Raw model trees have discontinuities at region boundaries. M5 uses smoothing:
$$\hat{f}(x) = \frac{n_\text{leaf} \cdot \hat{f}\text{leaf}(x) + k \cdot \hat{f}\text{parent}(x)}{n_\text{leaf} + k}$$
where $k$ is a smoothing constant (typically 15). This blends leaf predictions toward parent predictions, creating smoother transitions.
Extensions:
Weka's M5P and scikit-lego's model_trees provide production implementations. Model trees often outperform standard regression trees with similar interpretability, making them excellent choices for regression problems with local linear structure.
Taking multivariate splits to the extreme, we can embed neural network components at tree nodes. This creates hybrid models that combine tree structure with neural flexibility.
Perceptron Trees:
The simplest neural tree uses a perceptron (single-layer neural network) at each internal node:
$$\phi(x) = \mathbb{1}[\sigma(w^T x + b) > 0.5]$$
where $\sigma$ is an activation function (sigmoid, tanh). This is essentially an oblique split with a nonlinear squashing function.
Multi-Layer Perceptron (MLP) Nodes:
More powerful: use a small MLP at each node:
$$h = \sigma(W_1 x + b_1)$$ $$\phi(x) = \mathbb{1}[\sigma(w_2^T h + b_2) > 0.5]$$
This can represent arbitrarily complex decision boundaries at each node.
The Soft Decision Tree:
Instead of hard splits, use probabilistic routing:
$$p(\text{left} | x) = \sigma(w^T x + b)$$ $$p(\text{right} | x) = 1 - p(\text{left} | x)$$
A sample "flows" through both branches with different probabilities. The prediction is a weighted sum:
$$\hat{f}(x) = \sum_{m=1}^{M} \pi_m(x) \cdot c_m$$
where $\pi_m(x)$ is the probability of reaching leaf $m$, computed as the product of routing probabilities along the path.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110
import numpy as npimport torchimport torch.nn as nnimport torch.nn.functional as F class SoftDecisionTree(nn.Module): """ Differentiable soft decision tree. Uses sigmoid routing for end-to-end gradient training. """ def __init__(self, input_dim, depth, n_classes): super().__init__() self.depth = depth self.n_classes = n_classes self.n_leaves = 2 ** depth self.n_internal = 2 ** depth - 1 # Internal node parameters (oblique splits) self.node_weights = nn.Parameter( torch.randn(self.n_internal, input_dim) * 0.1 ) self.node_biases = nn.Parameter( torch.zeros(self.n_internal) ) # Leaf predictions (class logits) self.leaf_logits = nn.Parameter( torch.randn(self.n_leaves, n_classes) * 0.1 ) # Temperature for soft splitting self.temperature = nn.Parameter(torch.ones(1)) def forward(self, x): """ Compute soft predictions. Returns weighted average of leaf predictions based on routing probabilities. """ batch_size = x.size(0) # Compute routing probabilities at each node # node_outputs[i] = sigmoid(w_i^T x + b_i) node_logits = F.linear(x, self.node_weights, self.node_biases) node_probs = torch.sigmoid(node_logits / self.temperature) # Compute path probabilities to each leaf # This is the tricky part: tree routing path_probs = self._compute_path_probs(node_probs) # Weighted sum of leaf predictions leaf_preds = F.softmax(self.leaf_logits, dim=-1) # [n_leaves, n_classes] # path_probs: [batch_size, n_leaves] # leaf_preds: [n_leaves, n_classes] output = torch.matmul(path_probs, leaf_preds) # [batch_size, n_classes] return output def _compute_path_probs(self, node_probs): """ Compute probability of reaching each leaf. For a complete binary tree: - Left child of node i is 2i + 1 - Right child of node i is 2i + 2 - Leaf nodes are indexed 0 to n_leaves-1 (separate indexing) """ batch_size = node_probs.size(0) # Start at root with probability 1 path_probs = torch.ones(batch_size, self.n_leaves, device=node_probs.device) for leaf_idx in range(self.n_leaves): # Find path from root to this leaf current = leaf_idx + self.n_internal # Leaf's position in full tree while current > 0: parent = (current - 1) // 2 # If current is left child, multiply by p(left) # If current is right child, multiply by p(right) = 1 - p(left) if current == 2 * parent + 1: # Left child path_probs[:, leaf_idx] *= node_probs[:, parent] else: # Right child path_probs[:, leaf_idx] *= (1 - node_probs[:, parent]) current = parent return path_probs class NeuralDecisionForest(nn.Module): """ Ensemble of soft decision trees. """ def __init__(self, input_dim, depth, n_classes, n_trees=10): super().__init__() self.trees = nn.ModuleList([ SoftDecisionTree(input_dim, depth, n_classes) for _ in range(n_trees) ]) def forward(self, x): """Average predictions from all trees.""" preds = torch.stack([tree(x) for tree in self.trees]) return preds.mean(dim=0)Advantages of Neural Splits:
Disadvantages:
Key Research: Neural Oblivious Decision Ensembles (NODE)
Popov et al. (2019) introduced NODE for tabular data:
This represents the frontier of neural-tree hybrids.
Zhou and Feng's Deep Forest (2017) takes a different approach: instead of neural networks inside trees, it stacks Random Forests layer-by-layer like a deep network. Each layer's forest output becomes input to the next layer. This achieves deep learning-like representation learning with tree components.
Another approach to nonlinear splits uses kernel methods at each node. This leverages the kernel trick to create complex boundaries without explicit feature expansion.
The Kernel Split:
$$\phi(x) = \mathbb{1}\left[\sum_{i \in \text{SVs}} \alpha_i y_i K(x, x_i) + b \leq 0\right]$$
where:
This is equivalent to training an SVM at each node and using its decision function for routing.
SVM Trees:
Kernel Choice:
| Kernel | Formula | Boundary Type | Best When |
|---|---|---|---|
| Linear | K(x,x') = x^T x' | Hyperplane | Linear separability |
| Polynomial | K(x,x') = (x^T x' + c)^d | Polynomial surface | Known polynomial interactions |
| RBF | K(x,x') = exp(-γ||x-x'||²) | Smooth, local | Complex, smooth boundaries |
| Sigmoid | K(x,x') = tanh(αx^T x' + c) | Similar to MLP | Neural-network-like |
RBF Kernel SVM at Nodes:
The RBF (Radial Basis Function) kernel is particularly powerful:
$$K(x, x') = \exp(-\gamma |x - x'|^2)$$
With appropriate $\gamma$, a single RBF-SVM can create:
This means an SVM tree with RBF kernels can handle extremely complex decision boundaries with just a few nodes.
Computational Considerations:
| Aspect | Standard SVM | SVM Tree Node |
|---|---|---|
| Training samples | All N | n_t (samples at node) |
| Complexity | O(N² to N³) | O(n_t² to n_t³) |
| Support vectors | O(N) | O(n_t) |
| Inference | O(N·d) | O(sum of SVs at nodes) |
Because each node has fewer samples, individual SVMs are faster. But more nodes means more total SVMs.
Hybrid Strategy:
A practical approach:
This gives the efficiency of simple splits where possible, with kernel power where needed.
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273747576
from sklearn.svm import SVCfrom sklearn.tree import DecisionTreeClassifierimport numpy as np class SVMTreeNode: """ Tree node that uses SVM for splitting. """ def __init__(self, kernel='rbf', C=1.0, gamma='scale'): self.kernel = kernel self.C = C self.gamma = gamma self.svm = None self.left = None self.right = None self.is_leaf = False self.prediction = None def fit(self, X, y, max_depth, current_depth=0, min_samples=10): """Recursively fit SVM tree.""" n_samples = len(y) unique_classes = np.unique(y) # Check stopping conditions if (current_depth >= max_depth or n_samples < min_samples or len(unique_classes) == 1): self.is_leaf = True self.prediction = np.bincount(y).argmax() return # Train SVM for splitting self.svm = SVC(kernel=self.kernel, C=self.C, gamma=self.gamma) self.svm.fit(X, y) # Get predictions for splitting predictions = self.svm.predict(X) # Check if SVM produced a useful split left_mask = predictions == unique_classes[0] right_mask = ~left_mask if left_mask.sum() < min_samples or right_mask.sum() < min_samples: # SVM didn't produce useful split, make leaf self.is_leaf = True self.prediction = np.bincount(y).argmax() return # Create children self.left = SVMTreeNode(self.kernel, self.C, self.gamma) self.right = SVMTreeNode(self.kernel, self.C, self.gamma) # Recursively fit children self.left.fit(X[left_mask], y[left_mask], max_depth, current_depth + 1, min_samples) self.right.fit(X[right_mask], y[right_mask], max_depth, current_depth + 1, min_samples) def predict(self, X): """Predict using SVM tree.""" if self.is_leaf: return np.full(len(X), self.prediction) # Route using SVM svm_pred = self.svm.predict(X) predictions = np.zeros(len(X), dtype=int) left_mask = svm_pred == self.svm.classes_[0] if left_mask.any(): predictions[left_mask] = self.left.predict(X[left_mask]) if (~left_mask).any(): predictions[~left_mask] = self.right.predict(X[~left_mask]) return predictionsWith so many options for multivariate splits, how do we choose? The answer lies in understanding the fundamental tradeoff between node complexity and tree structure.
The Total Complexity Budget:
Think of model complexity as a budget that can be spent in two ways:
Both can achieve similar total expressiveness, but with different characteristics:
| Approach | Depth | Nodes | Params/Node | Total Params | Training |
|---|---|---|---|---|---|
| Deep axis-aligned | 20 | 1M leaves | 2 | ~2M | Fast |
| Medium oblique | 10 | ~1K leaves | d+1 | ~100K·d | Moderate |
| Shallow quadratic | 5 | ~32 leaves | d² | ~32·d² | Slow |
| Very shallow neural | 3 | ~8 leaves | 1000+ | ~10K | Very slow |
Empirical Observations:
Research suggests:
The Model Selection Framework:
To choose the right multivariate split strategy:
Cross-Validation is King:
Ultimately, the choice must be validated empirically:
$$\text{Best model} = \arg\min_{m \in {\text{axis}, \text{oblique}, \text{kernel}, \ldots}} \text{CV-Error}(m)$$
Don't assume that more complex nodes are better—they often aren't!
Adding node complexity should be justified by improved cross-validation performance. In practice, ensembles of simple trees (Random Forest, XGBoost, LightGBM) almost always outperform single complex trees. The complexity budget is better spent on more trees than on more complex nodes, except in special circumstances.
Several libraries provide implementations of multivariate split trees. Here's a practical guide:
Standard Libraries:
| Library | Split Types | Notes |
|---|---|---|
| scikit-learn | Axis-aligned only | Foundation for extensions |
| Weka | CART-LC, M5P (model trees) | Java, comprehensive |
| R/rpart | Axis-aligned | Can extend with custom splits |
Specialized Libraries:
| Library | Split Types | Notes |
|---|---|---|
| OC1 | Oblique (original algorithm) | C implementation |
| LMT | Logistic Model Trees | Weka |
| Cubist | Model trees, ensembles | Industrial strength |
| pytorch-tabular | Neural oblique (NODE) | Deep learning integration |
DIY Approaches:
For custom multivariate splits, you can:
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768697071727374757677
"""Practical workflow for choosing multivariate splits.""" import numpy as npfrom sklearn.tree import DecisionTreeClassifierfrom sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifierfrom sklearn.model_selection import cross_val_scorefrom sklearn.preprocessing import PolynomialFeatures, StandardScalerfrom sklearn.pipeline import Pipeline def multivariate_split_analysis(X, y, cv=5): """ Compare different split strategies on a dataset. Helps determine if multivariate splits would help. """ results = {} # 1. Baseline: Axis-aligned tree tree = DecisionTreeClassifier(max_depth=10, random_state=42) results['Axis-aligned tree'] = { 'cv_score': cross_val_score(tree, X, y, cv=cv).mean(), 'std': cross_val_score(tree, X, y, cv=cv).std() } # 2. Axis-aligned ensemble (sets the bar) rf = RandomForestClassifier(n_estimators=100, random_state=42) results['Random Forest'] = { 'cv_score': cross_val_score(rf, X, y, cv=cv).mean(), 'std': cross_val_score(rf, X, y, cv=cv).std() } # 3. Pseudo-oblique via interaction features poly = PolynomialFeatures(degree=2, interaction_only=True) tree_poly = DecisionTreeClassifier(max_depth=10, random_state=42) pipe_poly = Pipeline([ ('scaler', StandardScaler()), ('poly', poly), ('tree', tree_poly) ]) results['Tree + interactions'] = { 'cv_score': cross_val_score(pipe_poly, X, y, cv=cv).mean(), 'std': cross_val_score(pipe_poly, X, y, cv=cv).std() } # 4. Gradient Boosting (iterative oblique approximation) gb = GradientBoostingClassifier(n_estimators=100, random_state=42) results['Gradient Boosting'] = { 'cv_score': cross_val_score(gb, X, y, cv=cv).mean(), 'std': cross_val_score(gb, X, y, cv=cv).std() } # Report results print("Multivariate Split Analysis") print("=" * 50) for name, scores in sorted(results.items(), key=lambda x: -x[1]['cv_score']): print(f"{name:25s}: {scores['cv_score']:.4f} ± {scores['std']:.4f}") # Recommendations print("\nRecommendations:") base_score = results['Axis-aligned tree']['cv_score'] poly_score = results['Tree + interactions']['cv_score'] if poly_score - base_score > 0.02: print("• Data has diagonal/interaction structure.") print("• Consider: OC1, CART-LC, or add interaction features.") else: print("• Axis-aligned splits appear sufficient.") print("• Complex multivariate splits unlikely to help.") if results['Random Forest']['cv_score'] > base_score + 0.05: print("• Ensemble of simple trees beats complex single tree.") print("• Prefer Random Forest/XGBoost over multivariate trees.") return resultsIn most real-world applications, start with Random Forest or XGBoost. These ensembles of axis-aligned trees are fast, robust, and competitive. Only explore multivariate splits if: (1) you have a specific interpretability requirement for single trees, (2) domain knowledge suggests diagonal boundaries, or (3) ensemble performance is disappointing.
We've explored the full spectrum of multivariate splits, from simple axis-aligned to complex neural nodes. Here are the essential takeaways:
Looking Ahead:
The final page of this module examines Connections to Other Methods. We'll see how decision trees relate to kernel methods, neural networks, Bayesian models, and rule systems. Understanding these connections reveals why trees have remained relevant as machine learning has evolved, and how modern hybrid approaches draw on multiple traditions.
You now understand the full spectrum of multivariate splits in decision trees—from polynomial and kernel methods to neural architectures. This knowledge helps you recognize when complex nodes might help and when simpler approaches suffice. The next page explores how trees connect to the broader machine learning landscape.