Loading learning content...
In the realm of machine learning, individual features rarely tell the complete story. Consider predicting house prices: the number of bedrooms alone provides limited information, and square footage by itself is similarly incomplete. But when we combine these features—bedrooms relative to square footage—we unlock a powerful signal: spaciousness per room. This synergy between features, known as feature interaction, often holds more predictive power than any single feature could provide alone.
Gradient boosting algorithms have a remarkable, often underappreciated property: they can automatically discover and leverage feature interactions through their hierarchical tree structure. Yet understanding this mechanism deeply—and knowing when to explicitly engineer interactions—separates practitioners who achieve good results from those who achieve exceptional ones.
By the end of this page, you will understand how gradient boosting models capture feature interactions, the theoretical foundations of interaction effects, techniques for explicit interaction engineering, and advanced methods for discovering high-value interactions in high-dimensional datasets. You'll gain the expertise to systematically enhance boosting model performance through strategic feature interaction design.
A feature interaction occurs when the effect of one feature on the target variable depends on the value of another feature. This is fundamentally different from additive effects, where each feature contributes independently to the prediction.
Formal Definition:
Let $f(x)$ be a model predicting target $y$ from features $x = (x_1, x_2, \ldots, x_p)$. Features $x_i$ and $x_j$ exhibit an interaction if the second-order partial derivative is non-zero:
$$\frac{\partial^2 f(x)}{\partial x_i \partial x_j} \neq 0$$
This mathematical formulation captures the intuition that changing $x_i$ affects how $x_j$ influences the prediction—the hallmark of interaction effects.
Types of Interactions:
Interactions manifest in several distinct forms, each with different implications for modeling:
In real-world datasets, feature interactions are the norm rather than the exception. Physical systems exhibit multiplicative relationships (force = mass × acceleration), biological systems show threshold behaviors (enzyme activation), and economic systems display complex non-linear dependencies. Any model that cannot capture interactions will systematically underperform on realistic problems.
Why Interactions Matter for Prediction:
Consider a concrete example from credit risk modeling. Suppose we have two features:
A linear model might learn: $P(\text{default}) = \beta_0 + \beta_1 x_1 + \beta_2 x_2$
But the reality is more nuanced: high debt-to-income is dangerous for new employees but manageable for stable, long-tenured workers. The true relationship involves an interaction:
$$P(\text{default}) = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \beta_{12} x_1 x_2$$
Where $\beta_{12} < 0$ indicates that employment tenure mitigates the risk associated with high debt ratios. Without modeling this interaction, the predictor systematically misjudges risk for specific population segments.
One of the most elegant properties of tree-based models, including gradient boosting decision trees (GBDT), is their natural ability to capture feature interactions without explicit specification. This capability emerges from the hierarchical structure of decision trees.
The Mechanics of Implicit Interaction Learning:
When a decision tree makes successive splits on different features, it partitions the feature space into rectangular regions. Each region corresponds to a unique path from root to leaf, and importantly, the prediction for samples in that region depends on the combination of feature values—not just individual features.
Consider a tree that first splits on feature $A$ at threshold $t_A$, then splits on feature $B$ at threshold $t_B$. The resulting four leaf nodes represent:
Each leaf can have a different prediction, meaning the effect of $B$ on the prediction depends on whether $A$ is above or below $t_A$. This is precisely the definition of an interaction!
12345678910111213141516171819202122232425262728293031323334353637383940
import numpy as npimport matplotlib.pyplot as pltfrom sklearn.tree import DecisionTreeRegressor # Generate XOR-like interaction datanp.random.seed(42)n_samples = 1000X = np.random.randn(n_samples, 2)# XOR interaction: target high when both features same signy = (X[:, 0] * X[:, 1] > 0).astype(float) + 0.1 * np.random.randn(n_samples) # Fit a decision tree - it will naturally capture the interactiontree = DecisionTreeRegressor(max_depth=3, random_state=42)tree.fit(X, y) # Visualize the learned decision boundariesxx, yy = np.meshgrid(np.linspace(-3, 3, 100), np.linspace(-3, 3, 100))Z = tree.predict(np.c_[xx.ravel(), yy.ravel()]).reshape(xx.shape) fig, axes = plt.subplots(1, 2, figsize=(14, 5)) # True interaction surfaceaxes[0].contourf(xx, yy, (xx * yy > 0).astype(float), alpha=0.8, cmap='RdBu')axes[0].scatter(X[:, 0], X[:, 1], c=y, cmap='RdBu', edgecolor='k', s=20)axes[0].set_title('True XOR Interaction Pattern', fontsize=12)axes[0].set_xlabel('Feature A')axes[0].set_ylabel('Feature B') # Tree's learned approximationaxes[1].contourf(xx, yy, Z, alpha=0.8, cmap='RdBu')axes[1].scatter(X[:, 0], X[:, 1], c=y, cmap='RdBu', edgecolor='k', s=20)axes[1].set_title('Tree-Learned Decision Boundaries', fontsize=12)axes[1].set_xlabel('Feature A')axes[1].set_ylabel('Feature B') plt.tight_layout()plt.show() # The tree captures the interaction by creating axis-aligned rectangles# that approximate the diagonal XOR decision boundaryInteraction Depth and Tree Depth:
The depth of a tree directly determines the maximum order of interactions it can capture:
| Tree Depth | Maximum Interaction Order | Example |
|---|---|---|
| 1 (stump) | 0 (no interactions) | Single feature threshold |
| 2 | 2-way interactions | A × B |
| 3 | 3-way interactions | A × B × C |
| d | d-way interactions | Up to d features combined |
This relationship has profound implications for gradient boosting. When using shallow trees (depth 2-4), the ensemble builds up complex interactions by combining simple interactions across many trees. This differs fundamentally from a single deep tree, which captures interactions within one structure.
The Boosting Advantage:
Gradient boosting's iterative nature provides a specific advantage for learning interactions: each subsequent tree can focus on residual errors in interaction-rich regions of the feature space. If the first tree captures a main effect, the second tree fits the residual, which often contains interaction patterns that the main effect missed.
When tuning gradient boosting models, max_depth controls interaction complexity. Setting max_depth=1 (stumps) creates an additive model with no interactions. Increasing max_depth allows higher-order interactions but risks overfitting. For most tabular problems, max_depth between 3-8 provides a good balance, allowing meaningful 3-way to 8-way interactions while maintaining regularization.
While trees naturally capture interactions, this capability has significant limitations that practitioners must understand. Relying solely on implicit interaction learning can lead to suboptimal models in several scenarios.
The "Split Dilution" Problem:
Consider an interaction between features $A$ and $B$ where the true relationship is $y = A \times B$. For a tree to capture this:
The challenge: the first split on $A$ alone provides relatively weak signal (since $A$ by itself has limited predictive power). The algorithm might prefer a different feature $C$ with stronger marginal signal, never discovering the $A \times B$ interaction.
This phenomenon—where strong interactions between weak marginal features get overlooked—is called split dilution.
| Scenario | Problem | Solution |
|---|---|---|
| Weak marginal features | Individual features have low importance but interact strongly | Explicitly create interaction features |
| High-cardinality categoricals | Too many unique values to split effectively | Target encoding with interaction awareness |
| Rare interaction patterns | Interaction only matters in small subpopulation | Segment-specific features or oversampling |
| Symmetric interactions | Order of splits doesn't matter but tree must choose | Create symmetric interaction features (e.g., A×B) |
| Continuous × continuous | Multiplicative relationship hard to approximate with steps | Explicit polynomial or ratio features |
Sample Efficiency:
Implicitly learning interactions requires sufficient samples in each leaf to estimate the interaction effect reliably. For a depth-$d$ tree with balanced splits, the number of leaves is $2^d$. If total training samples is $n$, each leaf has approximately $n/2^d$ samples.
For a depth-6 tree with 100,000 samples: each leaf has ~1,500 samples. For 1,000,000 samples: ~15,000 per leaf. This seems adequate, but remember:
Explicitly engineered interaction features often provide stronger signal with fewer samples because the model directly observes the combined effect.
Computational Considerations:
Deep trees require more computation both in training (more splits to evaluate) and inference (longer paths to traverse). If a multiplicative interaction $A \times B$ is known to be important, adding it as an explicit feature allows shallower trees to capture the pattern, improving both speed and generalization.
There's a tension in practice: domain experts who know which interactions matter can engineer powerful features, but this requires significant domain knowledge. Meanwhile, tree-based models can discover interactions automatically but may miss the most valuable ones. The best practitioners combine both approaches—using domain knowledge for known interactions while letting the model discover unexpected ones.
Explicit interaction engineering involves creating new features that directly capture the combined effect of two or more original features. This transforms implicit patterns into explicit signals that the model can leverage more efficiently.
Common Interaction Operators:
For numerical features $A$ and $B$, common interaction constructs include:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117
import numpy as npimport pandas as pdfrom sklearn.preprocessing import PolynomialFeaturesfrom itertools import combinations def create_pairwise_interactions(df, numerical_cols, interaction_types=['product', 'ratio', 'diff']): """ Create explicit pairwise interaction features for numerical columns. Parameters: ----------- df : DataFrame Input data with numerical features numerical_cols : list List of column names to create interactions for interaction_types : list Types of interactions to create: 'product', 'ratio', 'diff', 'sum', 'sqrt_product' Returns: -------- DataFrame with original columns plus interaction features """ result = df.copy() epsilon = 1e-8 # Prevent division by zero for col_a, col_b in combinations(numerical_cols, 2): a, b = df[col_a], df[col_b] if 'product' in interaction_types: result[f'{col_a}_x_{col_b}'] = a * b if 'ratio' in interaction_types: # Create both directions for asymmetric ratios result[f'{col_a}_div_{col_b}'] = a / (b + epsilon) result[f'{col_b}_div_{col_a}'] = b / (a + epsilon) if 'diff' in interaction_types: result[f'{col_a}_minus_{col_b}'] = a - b if 'sum' in interaction_types: result[f'{col_a}_plus_{col_b}'] = a + b if 'sqrt_product' in interaction_types: # Geometric mean - handle negative values result[f'{col_a}_geomean_{col_b}'] = np.sign(a * b) * np.sqrt(np.abs(a * b)) return result def create_domain_interactions(df): """ Example: Domain-specific interactions for housing price prediction. These encode expert knowledge about meaningful feature combinations. """ result = df.copy() # Spaciousness: square footage relative to rooms if 'sqft' in df.columns and 'bedrooms' in df.columns: result['sqft_per_bedroom'] = df['sqft'] / (df['bedrooms'] + 1) # Bathroom ratio: bathrooms per bedroom (indicates luxury) if 'bathrooms' in df.columns and 'bedrooms' in df.columns: result['bath_bedroom_ratio'] = df['bathrooms'] / (df['bedrooms'] + 1) # Age-condition interaction: old but renovated vs old and dated if 'year_built' in df.columns and 'year_renovated' in df.columns: current_year = 2024 result['effective_age'] = current_year - np.maximum(df['year_built'], df['year_renovated']) # Price per sqft for comparable features if 'lot_size' in df.columns and 'sqft' in df.columns: result['building_coverage'] = df['sqft'] / (df['lot_size'] + 1) return result # Example usage with sklearn's PolynomialFeatures for systematic expansiondef polynomial_interaction_expansion(X, degree=2, include_bias=False): """ Create polynomial feature expansion up to specified degree. For 3 features [a, b, c] with degree=2: Output: [a, b, c, a², ab, ac, b², bc, c²] """ poly = PolynomialFeatures(degree=degree, include_bias=include_bias, interaction_only=False) X_poly = poly.fit_transform(X) # Get feature names for interpretability feature_names = poly.get_feature_names_out() return pd.DataFrame(X_poly, columns=feature_names) # Demonstrationif __name__ == "__main__": # Create sample data np.random.seed(42) df = pd.DataFrame({ 'sqft': np.random.uniform(1000, 3000, 100), 'bedrooms': np.random.randint(2, 6, 100), 'bathrooms': np.random.uniform(1, 4, 100), 'lot_size': np.random.uniform(5000, 20000, 100), }) # Create pairwise interactions df_interactions = create_pairwise_interactions( df, numerical_cols=['sqft', 'bedrooms', 'bathrooms'], interaction_types=['product', 'ratio'] ) print("Original features:", df.shape[1]) print("With interactions:", df_interactions.shape[1]) print("\nNew feature names:") for col in df_interactions.columns: if col not in df.columns: print(f" - {col}")Categorical × Categorical Interactions:
For categorical features, interactions create new combined categories:
Category A: {"small", "medium", "large"}
Category B: {"red", "blue"}
Interaction A×B: {"small_red", "small_blue", "medium_red", "medium_blue", "large_red", "large_blue"}
This can lead to cardinality explosion: if $A$ has $|A|$ categories and $B$ has $|B|$ categories, the interaction has $|A| \times |B|$ categories. For high-cardinality features, this becomes problematic—a feature with 1000 categories interacted with one having 500 categories produces 500,000 combined categories!
Categorical × Numerical Interactions:
A powerful pattern is creating category-conditional statistics:
# For each category in 'region', compute the mean of 'price' within that category
df['price_mean_by_region'] = df.groupby('region')['price'].transform('mean')
# The deviation from category mean captures individual vs. group behavior
df['price_vs_region_mean'] = df['price'] - df['price_mean_by_region']
This pattern—computing within-group statistics—is foundational to target encoding, which we cover in detail in the next section.
With $p$ features, there are $\binom{p}{2} = \frac{p(p-1)}{2}$ possible pairwise interactions and exponentially more higher-order interactions. Systematically searching this space requires principled approaches to identify the most valuable interactions.
Method 1: Tree-Based Interaction Detection
Decision trees provide a natural mechanism for detecting interactions: features that frequently appear together in the same path through the tree are likely interacting. We can quantify this by computing co-occurrence statistics across an ensemble:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125
import numpy as npfrom collections import defaultdictfrom sklearn.ensemble import GradientBoostingRegressorimport xgboost as xgb def detect_tree_interactions(model, feature_names, min_depth_diff=1): """ Detect feature interactions by analyzing decision paths in tree ensemble. Two features interact if they appear together in the same path, with one feature splitting on a node that is ancestor of the other. Parameters: ----------- model : tree ensemble (sklearn or xgboost) Trained tree-based model feature_names : list Names of features min_depth_diff : int Minimum depth difference to consider as interaction Returns: -------- dict : interaction scores for feature pairs """ interaction_counts = defaultdict(int) total_paths = 0 # Extract trees from model (works for sklearn GB and RF) if hasattr(model, 'estimators_'): # sklearn GradientBoosting/RandomForest estimators = [est[0] if hasattr(est, '__getitem__') else est for est in model.estimators_] else: raise ValueError("Unsupported model type") for tree_model in estimators: tree = tree_model.tree_ # Traverse all root-to-leaf paths def traverse_path(node_id, path_features, depth_map): if tree.children_left[node_id] == tree.children_right[node_id]: # Leaf node - analyze the path path_feature_list = list(path_features) for i, feat_i in enumerate(path_feature_list): for feat_j in path_feature_list[i+1:]: depth_i = depth_map[feat_i] depth_j = depth_map[feat_j] if abs(depth_i - depth_j) >= min_depth_diff: pair = tuple(sorted([feat_i, feat_j])) interaction_counts[pair] += 1 return 1 # Internal node - get feature and recurse feature = tree.feature[node_id] current_depth = len(path_features) new_path = path_features | {feature} new_depth_map = depth_map.copy() if feature not in new_depth_map: new_depth_map[feature] = current_depth left_paths = traverse_path(tree.children_left[node_id], new_path, new_depth_map) right_paths = traverse_path(tree.children_right[node_id], new_path, new_depth_map) return left_paths + right_paths total_paths += traverse_path(0, set(), {}) # Normalize by total paths interaction_scores = { (feature_names[pair[0]], feature_names[pair[1]]): count / total_paths for pair, count in interaction_counts.items() } return dict(sorted(interaction_scores.items(), key=lambda x: -x[1])) def detect_interactions_friedman_h(model, X, feature_names, n_samples=1000): """ Compute Friedman's H-statistic for pairwise interactions. H(i,j) measures the fraction of variance of F(x_i, x_j) not captured by the sum of the partial dependence functions PD(x_i) + PD(x_j). H = 0 indicates no interaction H = 1 indicates pure interaction (no main effects) Note: This is computationally expensive O(n² * p²) """ from sklearn.inspection import partial_dependence n_features = len(feature_names) h_statistics = {} # Sample subset for efficiency if len(X) > n_samples: idx = np.random.choice(len(X), n_samples, replace=False) X_sample = X[idx] else: X_sample = X for i in range(n_features): for j in range(i + 1, n_features): # Get partial dependence for individual features pd_i = partial_dependence(model, X_sample, [i], kind='average') pd_j = partial_dependence(model, X_sample, [j], kind='average') # Get joint partial dependence pd_ij = partial_dependence(model, X_sample, [i, j], kind='average') # Compute H-statistic (simplified version) # Full computation requires integration over the grid var_pdij = np.var(pd_ij['average']) var_sum = np.var(pd_i['average']) + np.var(pd_j['average']) if var_pdij > 0: h_stat = 1 - var_sum / var_pdij h_stat = max(0, h_stat) # Clamp to [0, 1] else: h_stat = 0 h_statistics[(feature_names[i], feature_names[j])] = h_stat return dict(sorted(h_statistics.items(), key=lambda x: -x[1]))Method 2: Friedman's H-Statistic
Friedman's H-statistic provides a theoretically grounded measure of interaction strength based on partial dependence decomposition. For features $x_i$ and $x_j$:
$$H^2_{ij} = \frac{\sum_k \left[ f_{ij}(x_i^{(k)}, x_j^{(k)}) - f_i(x_i^{(k)}) - f_j(x_j^{(k)}) \right]^2}{\sum_k f_{ij}^2(x_i^{(k)}, x_j^{(k)})}$$
Where:
The H-statistic ranges from 0 (no interaction, fully additive) to 1 (pure interaction, no main effects).
Method 3: ANOVA-based Interaction Testing
For designed experiments or when computational resources are limited, classical ANOVA approaches can identify significant interactions:
This approach is statistically rigorous but assumes linear/polynomial functional forms.
| Method | Strengths | Limitations | Complexity |
|---|---|---|---|
| Tree co-occurrence | Fast, model-specific, captures what the model actually learned | Biased toward features with many splits | O(n_trees × tree_size) |
| H-statistic | Theoretically grounded, interpretable scale | Computationally expensive, requires many samples | O(n² × p × grid_size) |
| ANOVA/F-test | Statistical significance testing, confidence intervals | Assumes parametric form, misses non-polynomial interactions | O(p² × n) |
| Permutation-based | Model-agnostic, captures any interaction type | Very slow, high variance estimates | O(p² × n × n_permutations) |
Translating interaction theory into practice requires balancing multiple concerns: feature redundancy, computational overhead, overfitting risk, and interpretability. The following guidelines synthesize best practices from production machine learning systems.
Guideline 1: Start with Domain Knowledge
Before automated interaction detection, consult domain experts. Known physical, economic, or business relationships should be encoded explicitly:
These domain interactions are almost certainly valuable and should be included unconditionally.
Guideline 2: Handle Correlation with Care
Interaction features are often highly correlated with their constituent features, which can cause issues:
For tree-based models, correlation is less problematic than for linear models, but it affects interpretability and can waste capacity.
Guideline 3: Scaling Considerations
Product interactions can produce features with very different scales:
Original: feature_A in [0, 100], feature_B in [0, 100]
Product: feature_A × feature_B in [0, 10000]
For tree-based models, scaling is not critical since splits adapt to any scale. But for regularized models or when combining with non-tree methods, standardization after creating interactions is recommended.
In most practical datasets, a small number of high-value interactions provide the majority of improvement. Resist the temptation to add many weak interactions—the computational and complexity costs typically outweigh marginal gains. Focus on the top 5-10 interactions as measured by detection methods or domain knowledge.
Beyond basic pairwise interactions, advanced techniques can capture more complex relationships or operate more efficiently at scale.
Interaction Networks and Factorization Machines:
Factorization Machines (FMs) represent interactions through latent factor decomposition:
$$\hat{y}(x) = w_0 + \sum_{i=1}^{p} w_i x_i + \sum_{i=1}^{p} \sum_{j=i+1}^{p} \langle v_i, v_j \rangle x_i x_j$$
Where $v_i \in \mathbb{R}^k$ is a $k$-dimensional latent vector for feature $i$, and $\langle v_i, v_j \rangle$ is their inner product.
The key insight: instead of learning $O(p^2)$ interaction weights, FMs learn $O(p \times k)$ latent factors, enabling efficient modeling of sparse, high-dimensional interactions (e.g., user-item interactions in recommender systems).
Neural Network Interaction Learning:
Deep neural networks can learn complex, non-linear interactions through hidden layer representations. Architectures specifically designed for tabular data with interactions include:
For gradient boosting, neural-learned interaction features can be added as inputs to GBDT models, combining the strengths of both approaches.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135
import numpy as npfrom sklearn.base import BaseEstimator, TransformerMixinfrom sklearn.preprocessing import StandardScaler class AutoInteractionTransformer(BaseEstimator, TransformerMixin): """ Automatic interaction feature generator that identifies and creates valuable interaction features based on a pre-trained model. Designed to integrate with sklearn Pipelines and boosting workflows. """ def __init__(self, base_model=None, top_k=10, interaction_types=['product', 'ratio'], min_importance_threshold=0.01): """ Parameters: ----------- base_model : estimator Tree-based model for interaction detection (if None, uses GradientBoostingRegressor) top_k : int Number of top interactions to create interaction_types : list Types of interactions: 'product', 'ratio', 'diff', 'min', 'max' min_importance_threshold : float Minimum importance score to consider a feature for interactions """ self.base_model = base_model self.top_k = top_k self.interaction_types = interaction_types self.min_importance_threshold = min_importance_threshold self.selected_interactions_ = [] self.scaler_ = None def fit(self, X, y): """Fit the transformer by identifying valuable interactions.""" from sklearn.ensemble import GradientBoostingRegressor if self.base_model is None: self.base_model = GradientBoostingRegressor( n_estimators=100, max_depth=4, random_state=42 ) # Fit the base model self.base_model.fit(X, y) # Get feature importances importances = self.base_model.feature_importances_ # Filter features by importance threshold important_features = np.where(importances >= self.min_importance_threshold)[0] # Compute interaction candidates (using co-occurrence in trees) interaction_scores = self._compute_interaction_scores(X, important_features) # Select top-k interactions sorted_interactions = sorted(interaction_scores.items(), key=lambda x: -x[1]) self.selected_interactions_ = [pair for pair, score in sorted_interactions[:self.top_k]] # Fit scaler on training data X_interactions = self._create_interactions(X) if X_interactions.shape[1] > 0: self.scaler_ = StandardScaler() self.scaler_.fit(X_interactions) return self def _compute_interaction_scores(self, X, important_features): """Compute interaction scores based on model structure.""" scores = {} # Simple heuristic: product of importances as interaction potential importances = self.base_model.feature_importances_ for i, feat_i in enumerate(important_features): for feat_j in important_features[i+1:]: # Score by geometric mean of importances score = np.sqrt(importances[feat_i] * importances[feat_j]) scores[(feat_i, feat_j)] = score return scores def _create_interactions(self, X): """Create interaction features for selected pairs.""" if len(self.selected_interactions_) == 0: return np.empty((X.shape[0], 0)) interactions = [] epsilon = 1e-8 for feat_i, feat_j in self.selected_interactions_: a, b = X[:, feat_i], X[:, feat_j] if 'product' in self.interaction_types: interactions.append(a * b) if 'ratio' in self.interaction_types: interactions.append(a / (b + epsilon)) if 'diff' in self.interaction_types: interactions.append(a - b) if 'min' in self.interaction_types: interactions.append(np.minimum(a, b)) if 'max' in self.interaction_types: interactions.append(np.maximum(a, b)) return np.column_stack(interactions) if interactions else np.empty((X.shape[0], 0)) def transform(self, X): """Transform by adding interaction features.""" X_interactions = self._create_interactions(X) if X_interactions.shape[1] > 0 and self.scaler_ is not None: X_interactions = self.scaler_.transform(X_interactions) return np.hstack([X, X_interactions]) def get_interaction_names(self, feature_names): """Get names of created interaction features.""" names = [] for feat_i, feat_j in self.selected_interactions_: name_i = feature_names[feat_i] if feature_names else f"f{feat_i}" name_j = feature_names[feat_j] if feature_names else f"f{feat_j}" for int_type in self.interaction_types: if int_type == 'product': names.append(f"{name_i}_x_{name_j}") elif int_type == 'ratio': names.append(f"{name_i}_div_{name_j}") elif int_type == 'diff': names.append(f"{name_i}_minus_{name_j}") elif int_type in ['min', 'max']: names.append(f"{int_type}_{name_i}_{name_j}") return namesHigher-Order Interactions:
For problems with complex, multi-way relationships, systematically generating higher-order interactions follows combinatorial patterns:
Exhaustive generation is infeasible for even moderate $p$. Strategies for handling higher-order interactions:
Feature interactions are fundamental to capturing the complexity of real-world relationships in predictive models. This page has covered the theory, detection, engineering, and advanced techniques for leveraging interactions in gradient boosting systems.
What's Next:
The next page explores Target Encoding—a powerful technique for handling categorical features that naturally incorporates target information while managing overfitting risk. Target encoding can be viewed as a sophisticated form of categorical-to-numerical interaction, connecting directly to the concepts we've covered here.
You now have a comprehensive understanding of feature interactions in gradient boosting—from theoretical foundations to practical engineering. You can identify when explicit interactions are needed, detect the most valuable interaction candidates, and implement interaction features that enhance model performance.