Machine LearningFeature Engineering for Boosting

Feature Engineering for Boosting

LevelAdvanced

Duration90 mins

TopicFeature Engineering for Boosting

1 / 5

Feature Interactions

The Hidden Power of Combined Features

In the realm of machine learning, individual features rarely tell the complete story. Consider predicting house prices: the number of bedrooms alone provides limited information, and square footage by itself is similarly incomplete. But when we combine these features—bedrooms relative to square footage—we unlock a powerful signal: spaciousness per room. This synergy between features, known as feature interaction, often holds more predictive power than any single feature could provide alone.

Gradient boosting algorithms have a remarkable, often underappreciated property: they can automatically discover and leverage feature interactions through their hierarchical tree structure. Yet understanding this mechanism deeply—and knowing when to explicitly engineer interactions—separates practitioners who achieve good results from those who achieve exceptional ones.

What You Will Learn

By the end of this page, you will understand how gradient boosting models capture feature interactions, the theoretical foundations of interaction effects, techniques for explicit interaction engineering, and advanced methods for discovering high-value interactions in high-dimensional datasets. You'll gain the expertise to systematically enhance boosting model performance through strategic feature interaction design.

Understanding Feature Interactions

A feature interaction occurs when the effect of one feature on the target variable depends on the value of another feature. This is fundamentally different from additive effects, where each feature contributes independently to the prediction.

Formal Definition:

Let $f(x)$ be a model predicting target $y$ from features $x = (x_1, x_2, \ldots, x_p)$. Features $x_i$ and $x_j$ exhibit an interaction if the second-order partial derivative is non-zero:

$$\frac{\partial^2 f(x)}{\partial x_i \partial x_j} \neq 0$$

This mathematical formulation captures the intuition that changing $x_i$ affects how $x_j$ influences the prediction—the hallmark of interaction effects.

Types of Interactions:

Interactions manifest in several distinct forms, each with different implications for modeling:

Interaction Taxonomy

•Multiplicative Interactions — The effect of $x_i$ is scaled by $x_j$: $y = \beta_{ij} x_i x_j$. Example: Price sensitivity (elasticity) varies with income level.
•Synergistic Interactions — Combined effect exceeds the sum of individual effects. Example: Marketing impact is amplified when both TV and digital channels are active.
•Threshold Interactions — One feature matters only when another crosses a threshold. Example: Age affects survival probability differently above vs. below a critical tumor size.
•XOR-type Interactions — Prediction depends on the relationship between features, not their individual values. Example: Success requires either experience OR education, but not necessarily both.
•Higher-Order Interactions — Three or more features interact simultaneously. Example: Drug efficacy depends on patient age, weight, AND genetic markers together.

The Ubiquity of Interactions

In real-world datasets, feature interactions are the norm rather than the exception. Physical systems exhibit multiplicative relationships (force = mass × acceleration), biological systems show threshold behaviors (enzyme activation), and economic systems display complex non-linear dependencies. Any model that cannot capture interactions will systematically underperform on realistic problems.

Why Interactions Matter for Prediction:

Consider a concrete example from credit risk modeling. Suppose we have two features:

$x_1$: Debt-to-income ratio
$x_2$: Employment tenure (years)

A linear model might learn: $P(\text{default}) = \beta_0 + \beta_1 x_1 + \beta_2 x_2$

But the reality is more nuanced: high debt-to-income is dangerous for new employees but manageable for stable, long-tenured workers. The true relationship involves an interaction:

$$P(\text{default}) = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \beta_{12} x_1 x_2$$

Where $\beta_{12} < 0$ indicates that employment tenure mitigates the risk associated with high debt ratios. Without modeling this interaction, the predictor systematically misjudges risk for specific population segments.

How Trees Capture Interactions

One of the most elegant properties of tree-based models, including gradient boosting decision trees (GBDT), is their natural ability to capture feature interactions without explicit specification. This capability emerges from the hierarchical structure of decision trees.

The Mechanics of Implicit Interaction Learning:

When a decision tree makes successive splits on different features, it partitions the feature space into rectangular regions. Each region corresponds to a unique path from root to leaf, and importantly, the prediction for samples in that region depends on the combination of feature values—not just individual features.

Consider a tree that first splits on feature $A$ at threshold $t_A$, then splits on feature $B$ at threshold $t_B$. The resulting four leaf nodes represent:

$A < t_A$ AND $B < t_B$
$A < t_A$ AND $B \geq t_B$
$A \geq t_A$ AND $B < t_B$
$A \geq t_A$ AND $B \geq t_B$

Each leaf can have a different prediction, meaning the effect of $B$ on the prediction depends on whether $A$ is above or below $t_A$. This is precisely the definition of an interaction!

tree_interaction_visualization.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
import numpy as np
import matplotlib.pyplot as plt
from sklearn.tree import DecisionTreeRegressor
 
# Generate XOR-like interaction data
np.random.seed(42)
n_samples = 1000
X = np.random.randn(n_samples, 2)
# XOR interaction: target high when both features same sign
y = (X[:, 0] * X[:, 1] > 0).astype(float) + 0.1 * np.random.randn(n_samples)
 
# Fit a decision tree - it will naturally capture the interaction
tree = DecisionTreeRegressor(max_depth=3, random_state=42)
tree.fit(X, y)
 
# Visualize the learned decision boundaries
xx, yy = np.meshgrid(np.linspace(-3, 3, 100), np.linspace(-3, 3, 100))
Z = tree.predict(np.c_[xx.ravel(), yy.ravel()]).reshape(xx.shape)
 
fig, axes = plt.subplots(1, 2, figsize=(14, 5))
 
# True interaction surface
axes[0].contourf(xx, yy, (xx * yy > 0).astype(float), alpha=0.8, cmap='RdBu')
axes[0].scatter(X[:, 0], X[:, 1], c=y, cmap='RdBu', edgecolor='k', s=20)
axes[0].set_title('True XOR Interaction Pattern', fontsize=12)
axes[0].set_xlabel('Feature A')
axes[0].set_ylabel('Feature B')
 
# Tree's learned approximation
axes[1].contourf(xx, yy, Z, alpha=0.8, cmap='RdBu')
axes[1].scatter(X[:, 0], X[:, 1], c=y, cmap='RdBu', edgecolor='k', s=20)
axes[1].set_title('Tree-Learned Decision Boundaries', fontsize=12)
axes[1].set_xlabel('Feature A')
axes[1].set_ylabel('Feature B')
 
plt.tight_layout()
plt.show()
 
# The tree captures the interaction by creating axis-aligned rectangles
# that approximate the diagonal XOR decision boundary

Interaction Depth and Tree Depth:

The depth of a tree directly determines the maximum order of interactions it can capture:

Tree Depth	Maximum Interaction Order	Example
1 (stump)	0 (no interactions)	Single feature threshold
2	2-way interactions	A × B
3	3-way interactions	A × B × C
d	d-way interactions	Up to d features combined

This relationship has profound implications for gradient boosting. When using shallow trees (depth 2-4), the ensemble builds up complex interactions by combining simple interactions across many trees. This differs fundamentally from a single deep tree, which captures interactions within one structure.

The Boosting Advantage:

Gradient boosting's iterative nature provides a specific advantage for learning interactions: each subsequent tree can focus on residual errors in interaction-rich regions of the feature space. If the first tree captures a main effect, the second tree fits the residual, which often contains interaction patterns that the main effect missed.

Practical Implication

When tuning gradient boosting models, max_depth controls interaction complexity. Setting max_depth=1 (stumps) creates an additive model with no interactions. Increasing max_depth allows higher-order interactions but risks overfitting. For most tabular problems, max_depth between 3-8 provides a good balance, allowing meaningful 3-way to 8-way interactions while maintaining regularization.

Limitations of Implicit Interaction Learning

While trees naturally capture interactions, this capability has significant limitations that practitioners must understand. Relying solely on implicit interaction learning can lead to suboptimal models in several scenarios.

The "Split Dilution" Problem:

Consider an interaction between features $A$ and $B$ where the true relationship is $y = A \times B$. For a tree to capture this:

The tree must first split on either $A$ or $B$
Then it must split on the other feature in the relevant child nodes
The split thresholds must align well with the interaction structure

The challenge: the first split on $A$ alone provides relatively weak signal (since $A$ by itself has limited predictive power). The algorithm might prefer a different feature $C$ with stronger marginal signal, never discovering the $A \times B$ interaction.

This phenomenon—where strong interactions between weak marginal features get overlooked—is called split dilution.

When Implicit Interaction Learning Fails
Scenario	Problem	Solution
Weak marginal features	Individual features have low importance but interact strongly	Explicitly create interaction features
High-cardinality categoricals	Too many unique values to split effectively	Target encoding with interaction awareness
Rare interaction patterns	Interaction only matters in small subpopulation	Segment-specific features or oversampling
Symmetric interactions	Order of splits doesn't matter but tree must choose	Create symmetric interaction features (e.g., A×B)
Continuous × continuous	Multiplicative relationship hard to approximate with steps	Explicit polynomial or ratio features

Sample Efficiency:

Implicitly learning interactions requires sufficient samples in each leaf to estimate the interaction effect reliably. For a depth-$d$ tree with balanced splits, the number of leaves is $2^d$. If total training samples is $n$, each leaf has approximately $n/2^d$ samples.

For a depth-6 tree with 100,000 samples: each leaf has ~1,500 samples. For 1,000,000 samples: ~15,000 per leaf. This seems adequate, but remember:

Real splits are rarely balanced
Some leaves may have very few samples
Estimating interaction effects requires comparing across leaves

Explicitly engineered interaction features often provide stronger signal with fewer samples because the model directly observes the combined effect.

Computational Considerations:

Deep trees require more computation both in training (more splits to evaluate) and inference (longer paths to traverse). If a multiplicative interaction $A \times B$ is known to be important, adding it as an explicit feature allows shallower trees to capture the pattern, improving both speed and generalization.

The Expertise Paradox

There's a tension in practice: domain experts who know which interactions matter can engineer powerful features, but this requires significant domain knowledge. Meanwhile, tree-based models can discover interactions automatically but may miss the most valuable ones. The best practitioners combine both approaches—using domain knowledge for known interactions while letting the model discover unexpected ones.

Explicit Interaction Engineering

Explicit interaction engineering involves creating new features that directly capture the combined effect of two or more original features. This transforms implicit patterns into explicit signals that the model can leverage more efficiently.

Common Interaction Operators:

For numerical features $A$ and $B$, common interaction constructs include:

Numerical Interaction Patterns

•Product: $A \times B$ — Captures multiplicative relationships. Use when one feature scales the effect of another.
•Ratio: $A / (B + \epsilon)$ — Captures relative magnitude. The $\epsilon$ prevents division by zero.
•Difference: $A - B$ — Captures relative position. Useful for comparing related quantities.
•Sum: $A + B$ — Captures total effect. Useful when individual contributions combine.
•Polynomial: $A^2$, $A \times B$, $B^2$ — Full second-order expansion for quadratic relationships.
•Min/Max: $\min(A, B)$, $\max(A, B)$ — Captures bottleneck or dominant effects.
•Geometric Mean: $\sqrt{A \times B}$ — Balanced combination, robust to scale differences.

explicit_interactions.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
import numpy as np
import pandas as pd
from sklearn.preprocessing import PolynomialFeatures
from itertools import combinations
 
def create_pairwise_interactions(df, numerical_cols, interaction_types=['product', 'ratio', 'diff']):
    """
    Create explicit pairwise interaction features for numerical columns.
    
    Parameters:
    -----------
    df : DataFrame
        Input data with numerical features
    numerical_cols : list
        List of column names to create interactions for
    interaction_types : list
        Types of interactions to create: 'product', 'ratio', 'diff', 'sum', 'sqrt_product'
    
    Returns:
    --------
    DataFrame with original columns plus interaction features
    """
    result = df.copy()
    epsilon = 1e-8  # Prevent division by zero
    
    for col_a, col_b in combinations(numerical_cols, 2):
        a, b = df[col_a], df[col_b]
        
        if 'product' in interaction_types:
            result[f'{col_a}_x_{col_b}'] = a * b
            
        if 'ratio' in interaction_types:
            # Create both directions for asymmetric ratios
            result[f'{col_a}_div_{col_b}'] = a / (b + epsilon)
            result[f'{col_b}_div_{col_a}'] = b / (a + epsilon)
            
        if 'diff' in interaction_types:
            result[f'{col_a}_minus_{col_b}'] = a - b
            
        if 'sum' in interaction_types:
            result[f'{col_a}_plus_{col_b}'] = a + b
            
        if 'sqrt_product' in interaction_types:
            # Geometric mean - handle negative values
            result[f'{col_a}_geomean_{col_b}'] = np.sign(a * b) * np.sqrt(np.abs(a * b))
    
    return result
 
 
def create_domain_interactions(df):
    """
    Example: Domain-specific interactions for housing price prediction.
    These encode expert knowledge about meaningful feature combinations.
    """
    result = df.copy()
    
    # Spaciousness: square footage relative to rooms
    if 'sqft' in df.columns and 'bedrooms' in df.columns:
        result['sqft_per_bedroom'] = df['sqft'] / (df['bedrooms'] + 1)
    
    # Bathroom ratio: bathrooms per bedroom (indicates luxury)
    if 'bathrooms' in df.columns and 'bedrooms' in df.columns:
        result['bath_bedroom_ratio'] = df['bathrooms'] / (df['bedrooms'] + 1)
    
    # Age-condition interaction: old but renovated vs old and dated
    if 'year_built' in df.columns and 'year_renovated' in df.columns:
        current_year = 2024
        result['effective_age'] = current_year - np.maximum(df['year_built'], df['year_renovated'])
    
    # Price per sqft for comparable features
    if 'lot_size' in df.columns and 'sqft' in df.columns:
        result['building_coverage'] = df['sqft'] / (df['lot_size'] + 1)
    
    return result
 
 
# Example usage with sklearn's PolynomialFeatures for systematic expansion
def polynomial_interaction_expansion(X, degree=2, include_bias=False):
    """
    Create polynomial feature expansion up to specified degree.
    
    For 3 features [a, b, c] with degree=2:
    Output: [a, b, c, a², ab, ac, b², bc, c²]
    """
    poly = PolynomialFeatures(degree=degree, include_bias=include_bias, interaction_only=False)
    X_poly = poly.fit_transform(X)
    
    # Get feature names for interpretability
    feature_names = poly.get_feature_names_out()
    
    return pd.DataFrame(X_poly, columns=feature_names)
 
 
# Demonstration
if __name__ == "__main__":
    # Create sample data
    np.random.seed(42)
    df = pd.DataFrame({
        'sqft': np.random.uniform(1000, 3000, 100),
        'bedrooms': np.random.randint(2, 6, 100),
        'bathrooms': np.random.uniform(1, 4, 100),
        'lot_size': np.random.uniform(5000, 20000, 100),
    })
    
    # Create pairwise interactions
    df_interactions = create_pairwise_interactions(
        df, 
        numerical_cols=['sqft', 'bedrooms', 'bathrooms'],
        interaction_types=['product', 'ratio']
    )
    
    print("Original features:", df.shape[1])
    print("With interactions:", df_interactions.shape[1])
    print("\nNew feature names:")
    for col in df_interactions.columns:
        if col not in df.columns:
            print(f"  - {col}")

Categorical × Categorical Interactions:

For categorical features, interactions create new combined categories:

Category A: {"small", "medium", "large"}
Category B: {"red", "blue"}
Interaction A×B: {"small_red", "small_blue", "medium_red", "medium_blue", "large_red", "large_blue"}

This can lead to cardinality explosion: if $A$ has $|A|$ categories and $B$ has $|B|$ categories, the interaction has $|A| \times |B|$ categories. For high-cardinality features, this becomes problematic—a feature with 1000 categories interacted with one having 500 categories produces 500,000 combined categories!

Categorical × Numerical Interactions:

A powerful pattern is creating category-conditional statistics:

# For each category in 'region', compute the mean of 'price' within that category
df['price_mean_by_region'] = df.groupby('region')['price'].transform('mean')

# The deviation from category mean captures individual vs. group behavior
df['price_vs_region_mean'] = df['price'] - df['price_mean_by_region']

This pattern—computing within-group statistics—is foundational to target encoding, which we cover in detail in the next section.

Interaction Discovery Methods

With $p$ features, there are $\binom{p}{2} = \frac{p(p-1)}{2}$ possible pairwise interactions and exponentially more higher-order interactions. Systematically searching this space requires principled approaches to identify the most valuable interactions.

Method 1: Tree-Based Interaction Detection

Decision trees provide a natural mechanism for detecting interactions: features that frequently appear together in the same path through the tree are likely interacting. We can quantify this by computing co-occurrence statistics across an ensemble:

interaction_detection.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
import numpy as np
from collections import defaultdict
from sklearn.ensemble import GradientBoostingRegressor
import xgboost as xgb
 
def detect_tree_interactions(model, feature_names, min_depth_diff=1):
    """
    Detect feature interactions by analyzing decision paths in tree ensemble.
    
    Two features interact if they appear together in the same path,
    with one feature splitting on a node that is ancestor of the other.
    
    Parameters:
    -----------
    model : tree ensemble (sklearn or xgboost)
        Trained tree-based model
    feature_names : list
        Names of features
    min_depth_diff : int
        Minimum depth difference to consider as interaction
    
    Returns:
    --------
    dict : interaction scores for feature pairs
    """
    interaction_counts = defaultdict(int)
    total_paths = 0
    
    # Extract trees from model (works for sklearn GB and RF)
    if hasattr(model, 'estimators_'):
        # sklearn GradientBoosting/RandomForest
        estimators = [est[0] if hasattr(est, '__getitem__') else est 
                      for est in model.estimators_]
    else:
        raise ValueError("Unsupported model type")
    
    for tree_model in estimators:
        tree = tree_model.tree_
        
        # Traverse all root-to-leaf paths
        def traverse_path(node_id, path_features, depth_map):
            if tree.children_left[node_id] == tree.children_right[node_id]:
                # Leaf node - analyze the path
                path_feature_list = list(path_features)
                for i, feat_i in enumerate(path_feature_list):
                    for feat_j in path_feature_list[i+1:]:
                        depth_i = depth_map[feat_i]
                        depth_j = depth_map[feat_j]
                        if abs(depth_i - depth_j) >= min_depth_diff:
                            pair = tuple(sorted([feat_i, feat_j]))
                            interaction_counts[pair] += 1
                return 1
            
            # Internal node - get feature and recurse
            feature = tree.feature[node_id]
            current_depth = len(path_features)
            
            new_path = path_features | {feature}
            new_depth_map = depth_map.copy()
            if feature not in new_depth_map:
                new_depth_map[feature] = current_depth
            
            left_paths = traverse_path(tree.children_left[node_id], new_path, new_depth_map)
            right_paths = traverse_path(tree.children_right[node_id], new_path, new_depth_map)
            
            return left_paths + right_paths
        
        total_paths += traverse_path(0, set(), {})
    
    # Normalize by total paths
    interaction_scores = {
        (feature_names[pair[0]], feature_names[pair[1]]): count / total_paths
        for pair, count in interaction_counts.items()
    }
    
    return dict(sorted(interaction_scores.items(), key=lambda x: -x[1]))
 
 
def detect_interactions_friedman_h(model, X, feature_names, n_samples=1000):
    """
    Compute Friedman's H-statistic for pairwise interactions.
    
    H(i,j) measures the fraction of variance of F(x_i, x_j) not captured
    by the sum of the partial dependence functions PD(x_i) + PD(x_j).
    
    H = 0 indicates no interaction
    H = 1 indicates pure interaction (no main effects)
    
    Note: This is computationally expensive O(n² * p²)
    """
    from sklearn.inspection import partial_dependence
    
    n_features = len(feature_names)
    h_statistics = {}
    
    # Sample subset for efficiency
    if len(X) > n_samples:
        idx = np.random.choice(len(X), n_samples, replace=False)
        X_sample = X[idx]
    else:
        X_sample = X
    
    for i in range(n_features):
        for j in range(i + 1, n_features):
            # Get partial dependence for individual features
            pd_i = partial_dependence(model, X_sample, [i], kind='average')
            pd_j = partial_dependence(model, X_sample, [j], kind='average')
            
            # Get joint partial dependence
            pd_ij = partial_dependence(model, X_sample, [i, j], kind='average')
            
            # Compute H-statistic (simplified version)
            # Full computation requires integration over the grid
            var_pdij = np.var(pd_ij['average'])
            var_sum = np.var(pd_i['average']) + np.var(pd_j['average'])
            
            if var_pdij > 0:
                h_stat = 1 - var_sum / var_pdij
                h_stat = max(0, h_stat)  # Clamp to [0, 1]
            else:
                h_stat = 0
            
            h_statistics[(feature_names[i], feature_names[j])] = h_stat
    
    return dict(sorted(h_statistics.items(), key=lambda x: -x[1]))

Method 2: Friedman's H-Statistic

Friedman's H-statistic provides a theoretically grounded measure of interaction strength based on partial dependence decomposition. For features $x_i$ and $x_j$:

$$H^2_{ij} = \frac{\sum_k \left[ f_{ij}(x_i^{(k)}, x_j^{(k)}) - f_i(x_i^{(k)}) - f_j(x_j^{(k)}) \right]^2}{\sum_k f_{ij}^2(x_i^{(k)}, x_j^{(k)})}$$

Where:

$f_{ij}$ is the centered joint partial dependence function
$f_i$, $f_j$ are centered individual partial dependence functions
Summation is over all data points $k$

The H-statistic ranges from 0 (no interaction, fully additive) to 1 (pure interaction, no main effects).

Method 3: ANOVA-based Interaction Testing

For designed experiments or when computational resources are limited, classical ANOVA approaches can identify significant interactions:

Fit an additive model: $y = \beta_0 + \sum \beta_i x_i$
Fit a model with interaction term: $y = \beta_0 + \sum \beta_i x_i + \beta_{ij} x_i x_j$
Test significance of $\beta_{ij}$ using F-test or likelihood ratio test

This approach is statistically rigorous but assumes linear/polynomial functional forms.

Comparison of Interaction Detection Methods
Method	Strengths	Limitations	Complexity
Tree co-occurrence	Fast, model-specific, captures what the model actually learned	Biased toward features with many splits	O(n_trees × tree_size)
H-statistic	Theoretically grounded, interpretable scale	Computationally expensive, requires many samples	O(n² × p × grid_size)
ANOVA/F-test	Statistical significance testing, confidence intervals	Assumes parametric form, misses non-polynomial interactions	O(p² × n)
Permutation-based	Model-agnostic, captures any interaction type	Very slow, high variance estimates	O(p² × n × n_permutations)

Practical Engineering Guidelines

Translating interaction theory into practice requires balancing multiple concerns: feature redundancy, computational overhead, overfitting risk, and interpretability. The following guidelines synthesize best practices from production machine learning systems.

Guideline 1: Start with Domain Knowledge

Before automated interaction detection, consult domain experts. Known physical, economic, or business relationships should be encoded explicitly:

Finance: Risk = Exposure × Probability × Loss-Given-Default
Physics: Force = Mass × Acceleration, Power = Voltage × Current
Marketing: ROI = Revenue × Conversion Rate / Cost
Healthcare: Dosage adjustment = Base dose × Weight factor × Age factor

These domain interactions are almost certainly valuable and should be included unconditionally.

Feature Interaction Engineering Checklist

•Baseline first: Train a model without explicit interactions to establish baseline performance and identify weak spots.
•Target the gaps: Use interaction detection methods on the baseline model to find where interactions would help most.
•Create selectively: Don't generate all possible interactions—focus on the top-K candidates from detection methods.
•Validate incrementally: Add interactions in batches and validate improvement on holdout data.
•Monitor feature importance: After adding interactions, check if they rank higher than constituent features—if not, they may be redundant.
•Consider interpretation needs: Interaction features complicate model explanation. For interpretable models, limit to well-understood interactions.
•Watch for leakage: Ensure interaction features are computed correctly in cross-validation and production pipelines.

Guideline 2: Handle Correlation with Care

Interaction features are often highly correlated with their constituent features, which can cause issues:

Regularization sensitivity: The model may arbitrarily pick the interaction or the original feature
Importance instability: Feature importance values become unstable across random seeds
Multicollinearity (for linear models): Can cause coefficient instability

For tree-based models, correlation is less problematic than for linear models, but it affects interpretability and can waste capacity.

Guideline 3: Scaling Considerations

Product interactions can produce features with very different scales:

Original: feature_A in [0, 100], feature_B in [0, 100]
Product: feature_A × feature_B in [0, 10000]

For tree-based models, scaling is not critical since splits adapt to any scale. But for regularized models or when combining with non-tree methods, standardization after creating interactions is recommended.

The 80/20 Rule for Interactions

In most practical datasets, a small number of high-value interactions provide the majority of improvement. Resist the temptation to add many weak interactions—the computational and complexity costs typically outweigh marginal gains. Focus on the top 5-10 interactions as measured by detection methods or domain knowledge.

Advanced Interaction Techniques

Beyond basic pairwise interactions, advanced techniques can capture more complex relationships or operate more efficiently at scale.

Interaction Networks and Factorization Machines:

Factorization Machines (FMs) represent interactions through latent factor decomposition:

$$\hat{y}(x) = w_0 + \sum_{i=1}^{p} w_i x_i + \sum_{i=1}^{p} \sum_{j=i+1}^{p} \langle v_i, v_j \rangle x_i x_j$$

Where $v_i \in \mathbb{R}^k$ is a $k$-dimensional latent vector for feature $i$, and $\langle v_i, v_j \rangle$ is their inner product.

The key insight: instead of learning $O(p^2)$ interaction weights, FMs learn $O(p \times k)$ latent factors, enabling efficient modeling of sparse, high-dimensional interactions (e.g., user-item interactions in recommender systems).

Neural Network Interaction Learning:

Deep neural networks can learn complex, non-linear interactions through hidden layer representations. Architectures specifically designed for tabular data with interactions include:

Wide & Deep: Combines a wide linear model (explicit interactions) with a deep neural network (implicit interactions)
DeepFM: Factorization Machine layer + deep neural network
TabNet: Attention-based architecture that learns sequential feature interactions

For gradient boosting, neural-learned interaction features can be added as inputs to GBDT models, combining the strengths of both approaches.

advanced_interactions.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
import numpy as np
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.preprocessing import StandardScaler
 
class AutoInteractionTransformer(BaseEstimator, TransformerMixin):
    """
    Automatic interaction feature generator that identifies and creates
    valuable interaction features based on a pre-trained model.
    
    Designed to integrate with sklearn Pipelines and boosting workflows.
    """
    
    def __init__(self, base_model=None, top_k=10, interaction_types=['product', 'ratio'],
                 min_importance_threshold=0.01):
        """
        Parameters:
        -----------
        base_model : estimator
            Tree-based model for interaction detection (if None, uses GradientBoostingRegressor)
        top_k : int
            Number of top interactions to create
        interaction_types : list
            Types of interactions: 'product', 'ratio', 'diff', 'min', 'max'
        min_importance_threshold : float
            Minimum importance score to consider a feature for interactions
        """
        self.base_model = base_model
        self.top_k = top_k
        self.interaction_types = interaction_types
        self.min_importance_threshold = min_importance_threshold
        self.selected_interactions_ = []
        self.scaler_ = None
        
    def fit(self, X, y):
        """Fit the transformer by identifying valuable interactions."""
        from sklearn.ensemble import GradientBoostingRegressor
        
        if self.base_model is None:
            self.base_model = GradientBoostingRegressor(
                n_estimators=100, max_depth=4, random_state=42
            )
        
        # Fit the base model
        self.base_model.fit(X, y)
        
        # Get feature importances
        importances = self.base_model.feature_importances_
        
        # Filter features by importance threshold
        important_features = np.where(importances >= self.min_importance_threshold)[0]
        
        # Compute interaction candidates (using co-occurrence in trees)
        interaction_scores = self._compute_interaction_scores(X, important_features)
        
        # Select top-k interactions
        sorted_interactions = sorted(interaction_scores.items(), key=lambda x: -x[1])
        self.selected_interactions_ = [pair for pair, score in sorted_interactions[:self.top_k]]
        
        # Fit scaler on training data
        X_interactions = self._create_interactions(X)
        if X_interactions.shape[1] > 0:
            self.scaler_ = StandardScaler()
            self.scaler_.fit(X_interactions)
        
        return self
    
    def _compute_interaction_scores(self, X, important_features):
        """Compute interaction scores based on model structure."""
        scores = {}
        
        # Simple heuristic: product of importances as interaction potential
        importances = self.base_model.feature_importances_
        for i, feat_i in enumerate(important_features):
            for feat_j in important_features[i+1:]:
                # Score by geometric mean of importances
                score = np.sqrt(importances[feat_i] * importances[feat_j])
                scores[(feat_i, feat_j)] = score
        
        return scores
    
    def _create_interactions(self, X):
        """Create interaction features for selected pairs."""
        if len(self.selected_interactions_) == 0:
            return np.empty((X.shape[0], 0))
        
        interactions = []
        epsilon = 1e-8
        
        for feat_i, feat_j in self.selected_interactions_:
            a, b = X[:, feat_i], X[:, feat_j]
            
            if 'product' in self.interaction_types:
                interactions.append(a * b)
            
            if 'ratio' in self.interaction_types:
                interactions.append(a / (b + epsilon))
                
            if 'diff' in self.interaction_types:
                interactions.append(a - b)
                
            if 'min' in self.interaction_types:
                interactions.append(np.minimum(a, b))
                
            if 'max' in self.interaction_types:
                interactions.append(np.maximum(a, b))
        
        return np.column_stack(interactions) if interactions else np.empty((X.shape[0], 0))
    
    def transform(self, X):
        """Transform by adding interaction features."""
        X_interactions = self._create_interactions(X)
        
        if X_interactions.shape[1] > 0 and self.scaler_ is not None:
            X_interactions = self.scaler_.transform(X_interactions)
        
        return np.hstack([X, X_interactions])
    
    def get_interaction_names(self, feature_names):
        """Get names of created interaction features."""
        names = []
        for feat_i, feat_j in self.selected_interactions_:
            name_i = feature_names[feat_i] if feature_names else f"f{feat_i}"
            name_j = feature_names[feat_j] if feature_names else f"f{feat_j}"
            
            for int_type in self.interaction_types:
                if int_type == 'product':
                    names.append(f"{name_i}_x_{name_j}")
                elif int_type == 'ratio':
                    names.append(f"{name_i}_div_{name_j}")
                elif int_type == 'diff':
                    names.append(f"{name_i}_minus_{name_j}")
                elif int_type in ['min', 'max']:
                    names.append(f"{int_type}_{name_i}_{name_j}")
        
        return names

Higher-Order Interactions:

For problems with complex, multi-way relationships, systematically generating higher-order interactions follows combinatorial patterns:

3-way: $\binom{p}{3}$ possibilities, grows as $O(p^3)$
4-way: $\binom{p}{4}$ possibilities, grows as $O(p^4)$

Exhaustive generation is infeasible for even moderate $p$. Strategies for handling higher-order interactions:

Domain-driven selection: Only create interactions known from domain knowledge
Hierarchical expansion: Start with 2-way, add 3-way only for feature pairs that showed strong 2-way interaction
Tensor factorization: Use CP or Tucker decomposition to parameterize higher-order terms compactly
Let trees handle it: Rely on tree depth to capture higher-order effects implicitly

Summary: Feature Interactions for Boosting

Feature interactions are fundamental to capturing the complexity of real-world relationships in predictive models. This page has covered the theory, detection, engineering, and advanced techniques for leveraging interactions in gradient boosting systems.

Key Takeaways

•Feature interactions occur when the effect of one feature depends on another — Mathematically characterized by non-zero second-order partial derivatives.
•Trees naturally capture interactions through hierarchical splits — Depth determines maximum interaction order; deeper trees capture higher-order interactions.
•Implicit learning has limitations — Split dilution can miss weak-marginal/strong-interaction patterns; sample efficiency may be suboptimal.
•Explicit interaction engineering can boost performance — Product, ratio, difference, and domain-specific interactions encode known relationships directly.
•Detection methods identify valuable interactions — Tree co-occurrence, H-statistic, and ANOVA approaches each offer different tradeoffs.
•Practical engineering requires balance — Too few interactions miss signal; too many add noise and complexity. Focus on high-value interactions.
•Advanced techniques exist for complex scenarios — Factorization machines, neural approaches, and automated selection can handle high-dimensional interaction spaces.

What's Next:

The next page explores Target Encoding—a powerful technique for handling categorical features that naturally incorporates target information while managing overfitting risk. Target encoding can be viewed as a sophisticated form of categorical-to-numerical interaction, connecting directly to the concepts we've covered here.

Page Complete

You now have a comprehensive understanding of feature interactions in gradient boosting—from theoretical foundations to practical engineering. You can identify when explicit interactions are needed, detect the most valuable interaction candidates, and implement interaction features that enhance model performance.

1 / 5

Loading learning content...

Machine LearningFeature Engineering for Boosting

Feature Engineering for Boosting

LevelAdvanced

Duration90 mins

TopicFeature Engineering for Boosting

1 / 5

Feature Interactions

The Hidden Power of Combined Features

What You Will Learn

Understanding Feature Interactions

Formal Definition:

Let $f(x)$ be a model predicting target $y$ from features $x = (x_1, x_2, \ldots, x_p)$. Features $x_i$ and $x_j$ exhibit an interaction if the second-order partial derivative is non-zero:

$$\frac{\partial^2 f(x)}{\partial x_i \partial x_j} \neq 0$$

This mathematical formulation captures the intuition that changing $x_i$ affects how $x_j$ influences the prediction—the hallmark of interaction effects.

Types of Interactions:

Interactions manifest in several distinct forms, each with different implications for modeling:

Interaction Taxonomy

•Multiplicative Interactions — The effect of $x_i$ is scaled by $x_j$: $y = \beta_{ij} x_i x_j$. Example: Price sensitivity (elasticity) varies with income level.
•Synergistic Interactions — Combined effect exceeds the sum of individual effects. Example: Marketing impact is amplified when both TV and digital channels are active.
•Threshold Interactions — One feature matters only when another crosses a threshold. Example: Age affects survival probability differently above vs. below a critical tumor size.
•XOR-type Interactions — Prediction depends on the relationship between features, not their individual values. Example: Success requires either experience OR education, but not necessarily both.
•Higher-Order Interactions — Three or more features interact simultaneously. Example: Drug efficacy depends on patient age, weight, AND genetic markers together.

The Ubiquity of Interactions

Why Interactions Matter for Prediction:

Consider a concrete example from credit risk modeling. Suppose we have two features:

$x_1$: Debt-to-income ratio
$x_2$: Employment tenure (years)

A linear model might learn: $P(\text{default}) = \beta_0 + \beta_1 x_1 + \beta_2 x_2$

But the reality is more nuanced: high debt-to-income is dangerous for new employees but manageable for stable, long-tenured workers. The true relationship involves an interaction:

$$P(\text{default}) = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \beta_{12} x_1 x_2$$

How Trees Capture Interactions

The Mechanics of Implicit Interaction Learning:

Consider a tree that first splits on feature $A$ at threshold $t_A$, then splits on feature $B$ at threshold $t_B$. The resulting four leaf nodes represent:

$A < t_A$ AND $B < t_B$
$A < t_A$ AND $B \geq t_B$
$A \geq t_A$ AND $B < t_B$
$A \geq t_A$ AND $B \geq t_B$

Each leaf can have a different prediction, meaning the effect of $B$ on the prediction depends on whether $A$ is above or below $t_A$. This is precisely the definition of an interaction!

tree_interaction_visualization.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
import numpy as np
import matplotlib.pyplot as plt
from sklearn.tree import DecisionTreeRegressor
 
# Generate XOR-like interaction data
np.random.seed(42)
n_samples = 1000
X = np.random.randn(n_samples, 2)
# XOR interaction: target high when both features same sign
y = (X[:, 0] * X[:, 1] > 0).astype(float) + 0.1 * np.random.randn(n_samples)
 
# Fit a decision tree - it will naturally capture the interaction
tree = DecisionTreeRegressor(max_depth=3, random_state=42)
tree.fit(X, y)
 
# Visualize the learned decision boundaries
xx, yy = np.meshgrid(np.linspace(-3, 3, 100), np.linspace(-3, 3, 100))
Z = tree.predict(np.c_[xx.ravel(), yy.ravel()]).reshape(xx.shape)
 
fig, axes = plt.subplots(1, 2, figsize=(14, 5))
 
# True interaction surface
axes[0].contourf(xx, yy, (xx * yy > 0).astype(float), alpha=0.8, cmap='RdBu')
axes[0].scatter(X[:, 0], X[:, 1], c=y, cmap='RdBu', edgecolor='k', s=20)
axes[0].set_title('True XOR Interaction Pattern', fontsize=12)
axes[0].set_xlabel('Feature A')
axes[0].set_ylabel('Feature B')
 
# Tree's learned approximation
axes[1].contourf(xx, yy, Z, alpha=0.8, cmap='RdBu')
axes[1].scatter(X[:, 0], X[:, 1], c=y, cmap='RdBu', edgecolor='k', s=20)
axes[1].set_title('Tree-Learned Decision Boundaries', fontsize=12)
axes[1].set_xlabel('Feature A')
axes[1].set_ylabel('Feature B')
 
plt.tight_layout()
plt.show()
 
# The tree captures the interaction by creating axis-aligned rectangles
# that approximate the diagonal XOR decision boundary

Interaction Depth and Tree Depth:

The depth of a tree directly determines the maximum order of interactions it can capture:

Tree Depth	Maximum Interaction Order	Example
1 (stump)	0 (no interactions)	Single feature threshold
2	2-way interactions	A × B
3	3-way interactions	A × B × C
d	d-way interactions	Up to d features combined

The Boosting Advantage:

Practical Implication

Limitations of Implicit Interaction Learning

The "Split Dilution" Problem:

Consider an interaction between features $A$ and $B$ where the true relationship is $y = A \times B$. For a tree to capture this:

The tree must first split on either $A$ or $B$
Then it must split on the other feature in the relevant child nodes
The split thresholds must align well with the interaction structure

This phenomenon—where strong interactions between weak marginal features get overlooked—is called split dilution.

When Implicit Interaction Learning Fails
Scenario	Problem	Solution
Weak marginal features	Individual features have low importance but interact strongly	Explicitly create interaction features
High-cardinality categoricals	Too many unique values to split effectively	Target encoding with interaction awareness
Rare interaction patterns	Interaction only matters in small subpopulation	Segment-specific features or oversampling
Symmetric interactions	Order of splits doesn't matter but tree must choose	Create symmetric interaction features (e.g., A×B)
Continuous × continuous	Multiplicative relationship hard to approximate with steps	Explicit polynomial or ratio features

Sample Efficiency:

For a depth-6 tree with 100,000 samples: each leaf has ~1,500 samples. For 1,000,000 samples: ~15,000 per leaf. This seems adequate, but remember:

Real splits are rarely balanced
Some leaves may have very few samples
Estimating interaction effects requires comparing across leaves

Explicitly engineered interaction features often provide stronger signal with fewer samples because the model directly observes the combined effect.

Computational Considerations:

The Expertise Paradox

Explicit Interaction Engineering

Common Interaction Operators:

For numerical features $A$ and $B$, common interaction constructs include:

Numerical Interaction Patterns

•Product: $A \times B$ — Captures multiplicative relationships. Use when one feature scales the effect of another.
•Ratio: $A / (B + \epsilon)$ — Captures relative magnitude. The $\epsilon$ prevents division by zero.
•Difference: $A - B$ — Captures relative position. Useful for comparing related quantities.
•Sum: $A + B$ — Captures total effect. Useful when individual contributions combine.
•Polynomial: $A^2$, $A \times B$, $B^2$ — Full second-order expansion for quadratic relationships.
•Min/Max: $\min(A, B)$, $\max(A, B)$ — Captures bottleneck or dominant effects.
•Geometric Mean: $\sqrt{A \times B}$ — Balanced combination, robust to scale differences.

explicit_interactions.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
import numpy as np
import pandas as pd
from sklearn.preprocessing import PolynomialFeatures
from itertools import combinations
 
def create_pairwise_interactions(df, numerical_cols, interaction_types=['product', 'ratio', 'diff']):
    """
    Create explicit pairwise interaction features for numerical columns.
    
    Parameters:
    -----------
    df : DataFrame
        Input data with numerical features
    numerical_cols : list
        List of column names to create interactions for
    interaction_types : list
        Types of interactions to create: 'product', 'ratio', 'diff', 'sum', 'sqrt_product'
    
    Returns:
    --------
    DataFrame with original columns plus interaction features
    """
    result = df.copy()
    epsilon = 1e-8  # Prevent division by zero
    
    for col_a, col_b in combinations(numerical_cols, 2):
        a, b = df[col_a], df[col_b]
        
        if 'product' in interaction_types:
            result[f'{col_a}_x_{col_b}'] = a * b
            
        if 'ratio' in interaction_types:
            # Create both directions for asymmetric ratios
            result[f'{col_a}_div_{col_b}'] = a / (b + epsilon)
            result[f'{col_b}_div_{col_a}'] = b / (a + epsilon)
            
        if 'diff' in interaction_types:
            result[f'{col_a}_minus_{col_b}'] = a - b
            
        if 'sum' in interaction_types:
            result[f'{col_a}_plus_{col_b}'] = a + b
            
        if 'sqrt_product' in interaction_types:
            # Geometric mean - handle negative values
            result[f'{col_a}_geomean_{col_b}'] = np.sign(a * b) * np.sqrt(np.abs(a * b))
    
    return result
 
 
def create_domain_interactions(df):
    """
    Example: Domain-specific interactions for housing price prediction.
    These encode expert knowledge about meaningful feature combinations.
    """
    result = df.copy()
    
    # Spaciousness: square footage relative to rooms
    if 'sqft' in df.columns and 'bedrooms' in df.columns:
        result['sqft_per_bedroom'] = df['sqft'] / (df['bedrooms'] + 1)
    
    # Bathroom ratio: bathrooms per bedroom (indicates luxury)
    if 'bathrooms' in df.columns and 'bedrooms' in df.columns:
        result['bath_bedroom_ratio'] = df['bathrooms'] / (df['bedrooms'] + 1)
    
    # Age-condition interaction: old but renovated vs old and dated
    if 'year_built' in df.columns and 'year_renovated' in df.columns:
        current_year = 2024
        result['effective_age'] = current_year - np.maximum(df['year_built'], df['year_renovated'])
    
    # Price per sqft for comparable features
    if 'lot_size' in df.columns and 'sqft' in df.columns:
        result['building_coverage'] = df['sqft'] / (df['lot_size'] + 1)
    
    return result
 
 
# Example usage with sklearn's PolynomialFeatures for systematic expansion
def polynomial_interaction_expansion(X, degree=2, include_bias=False):
    """
    Create polynomial feature expansion up to specified degree.
    
    For 3 features [a, b, c] with degree=2:
    Output: [a, b, c, a², ab, ac, b², bc, c²]
    """
    poly = PolynomialFeatures(degree=degree, include_bias=include_bias, interaction_only=False)
    X_poly = poly.fit_transform(X)
    
    # Get feature names for interpretability
    feature_names = poly.get_feature_names_out()
    
    return pd.DataFrame(X_poly, columns=feature_names)
 
 
# Demonstration
if __name__ == "__main__":
    # Create sample data
    np.random.seed(42)
    df = pd.DataFrame({
        'sqft': np.random.uniform(1000, 3000, 100),
        'bedrooms': np.random.randint(2, 6, 100),
        'bathrooms': np.random.uniform(1, 4, 100),
        'lot_size': np.random.uniform(5000, 20000, 100),
    })
    
    # Create pairwise interactions
    df_interactions = create_pairwise_interactions(
        df, 
        numerical_cols=['sqft', 'bedrooms', 'bathrooms'],
        interaction_types=['product', 'ratio']
    )
    
    print("Original features:", df.shape[1])
    print("With interactions:", df_interactions.shape[1])
    print("\nNew feature names:")
    for col in df_interactions.columns:
        if col not in df.columns:
            print(f"  - {col}")

Categorical × Categorical Interactions:

For categorical features, interactions create new combined categories:

Category A: {"small", "medium", "large"}
Category B: {"red", "blue"}
Interaction A×B: {"small_red", "small_blue", "medium_red", "medium_blue", "large_red", "large_blue"}

Categorical × Numerical Interactions:

A powerful pattern is creating category-conditional statistics:

# For each category in 'region', compute the mean of 'price' within that category
df['price_mean_by_region'] = df.groupby('region')['price'].transform('mean')

# The deviation from category mean captures individual vs. group behavior
df['price_vs_region_mean'] = df['price'] - df['price_mean_by_region']

This pattern—computing within-group statistics—is foundational to target encoding, which we cover in detail in the next section.

Interaction Discovery Methods

Method 1: Tree-Based Interaction Detection

interaction_detection.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
import numpy as np
from collections import defaultdict
from sklearn.ensemble import GradientBoostingRegressor
import xgboost as xgb
 
def detect_tree_interactions(model, feature_names, min_depth_diff=1):
    """
    Detect feature interactions by analyzing decision paths in tree ensemble.
    
    Two features interact if they appear together in the same path,
    with one feature splitting on a node that is ancestor of the other.
    
    Parameters:
    -----------
    model : tree ensemble (sklearn or xgboost)
        Trained tree-based model
    feature_names : list
        Names of features
    min_depth_diff : int
        Minimum depth difference to consider as interaction
    
    Returns:
    --------
    dict : interaction scores for feature pairs
    """
    interaction_counts = defaultdict(int)
    total_paths = 0
    
    # Extract trees from model (works for sklearn GB and RF)
    if hasattr(model, 'estimators_'):
        # sklearn GradientBoosting/RandomForest
        estimators = [est[0] if hasattr(est, '__getitem__') else est 
                      for est in model.estimators_]
    else:
        raise ValueError("Unsupported model type")
    
    for tree_model in estimators:
        tree = tree_model.tree_
        
        # Traverse all root-to-leaf paths
        def traverse_path(node_id, path_features, depth_map):
            if tree.children_left[node_id] == tree.children_right[node_id]:
                # Leaf node - analyze the path
                path_feature_list = list(path_features)
                for i, feat_i in enumerate(path_feature_list):
                    for feat_j in path_feature_list[i+1:]:
                        depth_i = depth_map[feat_i]
                        depth_j = depth_map[feat_j]
                        if abs(depth_i - depth_j) >= min_depth_diff:
                            pair = tuple(sorted([feat_i, feat_j]))
                            interaction_counts[pair] += 1
                return 1
            
            # Internal node - get feature and recurse
            feature = tree.feature[node_id]
            current_depth = len(path_features)
            
            new_path = path_features | {feature}
            new_depth_map = depth_map.copy()
            if feature not in new_depth_map:
                new_depth_map[feature] = current_depth
            
            left_paths = traverse_path(tree.children_left[node_id], new_path, new_depth_map)
            right_paths = traverse_path(tree.children_right[node_id], new_path, new_depth_map)
            
            return left_paths + right_paths
        
        total_paths += traverse_path(0, set(), {})
    
    # Normalize by total paths
    interaction_scores = {
        (feature_names[pair[0]], feature_names[pair[1]]): count / total_paths
        for pair, count in interaction_counts.items()
    }
    
    return dict(sorted(interaction_scores.items(), key=lambda x: -x[1]))
 
 
def detect_interactions_friedman_h(model, X, feature_names, n_samples=1000):
    """
    Compute Friedman's H-statistic for pairwise interactions.
    
    H(i,j) measures the fraction of variance of F(x_i, x_j) not captured
    by the sum of the partial dependence functions PD(x_i) + PD(x_j).
    
    H = 0 indicates no interaction
    H = 1 indicates pure interaction (no main effects)
    
    Note: This is computationally expensive O(n² * p²)
    """
    from sklearn.inspection import partial_dependence
    
    n_features = len(feature_names)
    h_statistics = {}
    
    # Sample subset for efficiency
    if len(X) > n_samples:
        idx = np.random.choice(len(X), n_samples, replace=False)
        X_sample = X[idx]
    else:
        X_sample = X
    
    for i in range(n_features):
        for j in range(i + 1, n_features):
            # Get partial dependence for individual features
            pd_i = partial_dependence(model, X_sample, [i], kind='average')
            pd_j = partial_dependence(model, X_sample, [j], kind='average')
            
            # Get joint partial dependence
            pd_ij = partial_dependence(model, X_sample, [i, j], kind='average')
            
            # Compute H-statistic (simplified version)
            # Full computation requires integration over the grid
            var_pdij = np.var(pd_ij['average'])
            var_sum = np.var(pd_i['average']) + np.var(pd_j['average'])
            
            if var_pdij > 0:
                h_stat = 1 - var_sum / var_pdij
                h_stat = max(0, h_stat)  # Clamp to [0, 1]
            else:
                h_stat = 0
            
            h_statistics[(feature_names[i], feature_names[j])] = h_stat
    
    return dict(sorted(h_statistics.items(), key=lambda x: -x[1]))

Method 2: Friedman's H-Statistic

Friedman's H-statistic provides a theoretically grounded measure of interaction strength based on partial dependence decomposition. For features $x_i$ and $x_j$:

$$H^2_{ij} = \frac{\sum_k \left[ f_{ij}(x_i^{(k)}, x_j^{(k)}) - f_i(x_i^{(k)}) - f_j(x_j^{(k)}) \right]^2}{\sum_k f_{ij}^2(x_i^{(k)}, x_j^{(k)})}$$

Where:

$f_{ij}$ is the centered joint partial dependence function
$f_i$, $f_j$ are centered individual partial dependence functions
Summation is over all data points $k$

The H-statistic ranges from 0 (no interaction, fully additive) to 1 (pure interaction, no main effects).

Method 3: ANOVA-based Interaction Testing

For designed experiments or when computational resources are limited, classical ANOVA approaches can identify significant interactions:

Fit an additive model: $y = \beta_0 + \sum \beta_i x_i$
Fit a model with interaction term: $y = \beta_0 + \sum \beta_i x_i + \beta_{ij} x_i x_j$
Test significance of $\beta_{ij}$ using F-test or likelihood ratio test

This approach is statistically rigorous but assumes linear/polynomial functional forms.

Comparison of Interaction Detection Methods
Method	Strengths	Limitations	Complexity
Tree co-occurrence	Fast, model-specific, captures what the model actually learned	Biased toward features with many splits	O(n_trees × tree_size)
H-statistic	Theoretically grounded, interpretable scale	Computationally expensive, requires many samples	O(n² × p × grid_size)
ANOVA/F-test	Statistical significance testing, confidence intervals	Assumes parametric form, misses non-polynomial interactions	O(p² × n)
Permutation-based	Model-agnostic, captures any interaction type	Very slow, high variance estimates	O(p² × n × n_permutations)

Practical Engineering Guidelines

Guideline 1: Start with Domain Knowledge

Before automated interaction detection, consult domain experts. Known physical, economic, or business relationships should be encoded explicitly:

Finance: Risk = Exposure × Probability × Loss-Given-Default
Physics: Force = Mass × Acceleration, Power = Voltage × Current
Marketing: ROI = Revenue × Conversion Rate / Cost
Healthcare: Dosage adjustment = Base dose × Weight factor × Age factor

These domain interactions are almost certainly valuable and should be included unconditionally.

Feature Interaction Engineering Checklist

•Baseline first: Train a model without explicit interactions to establish baseline performance and identify weak spots.
•Target the gaps: Use interaction detection methods on the baseline model to find where interactions would help most.
•Create selectively: Don't generate all possible interactions—focus on the top-K candidates from detection methods.
•Validate incrementally: Add interactions in batches and validate improvement on holdout data.
•Monitor feature importance: After adding interactions, check if they rank higher than constituent features—if not, they may be redundant.
•Consider interpretation needs: Interaction features complicate model explanation. For interpretable models, limit to well-understood interactions.
•Watch for leakage: Ensure interaction features are computed correctly in cross-validation and production pipelines.

Guideline 2: Handle Correlation with Care

Interaction features are often highly correlated with their constituent features, which can cause issues:

Regularization sensitivity: The model may arbitrarily pick the interaction or the original feature
Importance instability: Feature importance values become unstable across random seeds
Multicollinearity (for linear models): Can cause coefficient instability

For tree-based models, correlation is less problematic than for linear models, but it affects interpretability and can waste capacity.

Guideline 3: Scaling Considerations

Product interactions can produce features with very different scales:

Original: feature_A in [0, 100], feature_B in [0, 100]
Product: feature_A × feature_B in [0, 10000]

The 80/20 Rule for Interactions

Advanced Interaction Techniques

Beyond basic pairwise interactions, advanced techniques can capture more complex relationships or operate more efficiently at scale.

Interaction Networks and Factorization Machines:

Factorization Machines (FMs) represent interactions through latent factor decomposition:

$$\hat{y}(x) = w_0 + \sum_{i=1}^{p} w_i x_i + \sum_{i=1}^{p} \sum_{j=i+1}^{p} \langle v_i, v_j \rangle x_i x_j$$

Where $v_i \in \mathbb{R}^k$ is a $k$-dimensional latent vector for feature $i$, and $\langle v_i, v_j \rangle$ is their inner product.

Neural Network Interaction Learning:

Deep neural networks can learn complex, non-linear interactions through hidden layer representations. Architectures specifically designed for tabular data with interactions include:

Wide & Deep: Combines a wide linear model (explicit interactions) with a deep neural network (implicit interactions)
DeepFM: Factorization Machine layer + deep neural network
TabNet: Attention-based architecture that learns sequential feature interactions

For gradient boosting, neural-learned interaction features can be added as inputs to GBDT models, combining the strengths of both approaches.

advanced_interactions.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
import numpy as np
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.preprocessing import StandardScaler
 
class AutoInteractionTransformer(BaseEstimator, TransformerMixin):
    """
    Automatic interaction feature generator that identifies and creates
    valuable interaction features based on a pre-trained model.
    
    Designed to integrate with sklearn Pipelines and boosting workflows.
    """
    
    def __init__(self, base_model=None, top_k=10, interaction_types=['product', 'ratio'],
                 min_importance_threshold=0.01):
        """
        Parameters:
        -----------
        base_model : estimator
            Tree-based model for interaction detection (if None, uses GradientBoostingRegressor)
        top_k : int
            Number of top interactions to create
        interaction_types : list
            Types of interactions: 'product', 'ratio', 'diff', 'min', 'max'
        min_importance_threshold : float
            Minimum importance score to consider a feature for interactions
        """
        self.base_model = base_model
        self.top_k = top_k
        self.interaction_types = interaction_types
        self.min_importance_threshold = min_importance_threshold
        self.selected_interactions_ = []
        self.scaler_ = None
        
    def fit(self, X, y):
        """Fit the transformer by identifying valuable interactions."""
        from sklearn.ensemble import GradientBoostingRegressor
        
        if self.base_model is None:
            self.base_model = GradientBoostingRegressor(
                n_estimators=100, max_depth=4, random_state=42
            )
        
        # Fit the base model
        self.base_model.fit(X, y)
        
        # Get feature importances
        importances = self.base_model.feature_importances_
        
        # Filter features by importance threshold
        important_features = np.where(importances >= self.min_importance_threshold)[0]
        
        # Compute interaction candidates (using co-occurrence in trees)
        interaction_scores = self._compute_interaction_scores(X, important_features)
        
        # Select top-k interactions
        sorted_interactions = sorted(interaction_scores.items(), key=lambda x: -x[1])
        self.selected_interactions_ = [pair for pair, score in sorted_interactions[:self.top_k]]
        
        # Fit scaler on training data
        X_interactions = self._create_interactions(X)
        if X_interactions.shape[1] > 0:
            self.scaler_ = StandardScaler()
            self.scaler_.fit(X_interactions)
        
        return self
    
    def _compute_interaction_scores(self, X, important_features):
        """Compute interaction scores based on model structure."""
        scores = {}
        
        # Simple heuristic: product of importances as interaction potential
        importances = self.base_model.feature_importances_
        for i, feat_i in enumerate(important_features):
            for feat_j in important_features[i+1:]:
                # Score by geometric mean of importances
                score = np.sqrt(importances[feat_i] * importances[feat_j])
                scores[(feat_i, feat_j)] = score
        
        return scores
    
    def _create_interactions(self, X):
        """Create interaction features for selected pairs."""
        if len(self.selected_interactions_) == 0:
            return np.empty((X.shape[0], 0))
        
        interactions = []
        epsilon = 1e-8
        
        for feat_i, feat_j in self.selected_interactions_:
            a, b = X[:, feat_i], X[:, feat_j]
            
            if 'product' in self.interaction_types:
                interactions.append(a * b)
            
            if 'ratio' in self.interaction_types:
                interactions.append(a / (b + epsilon))
                
            if 'diff' in self.interaction_types:
                interactions.append(a - b)
                
            if 'min' in self.interaction_types:
                interactions.append(np.minimum(a, b))
                
            if 'max' in self.interaction_types:
                interactions.append(np.maximum(a, b))
        
        return np.column_stack(interactions) if interactions else np.empty((X.shape[0], 0))
    
    def transform(self, X):
        """Transform by adding interaction features."""
        X_interactions = self._create_interactions(X)
        
        if X_interactions.shape[1] > 0 and self.scaler_ is not None:
            X_interactions = self.scaler_.transform(X_interactions)
        
        return np.hstack([X, X_interactions])
    
    def get_interaction_names(self, feature_names):
        """Get names of created interaction features."""
        names = []
        for feat_i, feat_j in self.selected_interactions_:
            name_i = feature_names[feat_i] if feature_names else f"f{feat_i}"
            name_j = feature_names[feat_j] if feature_names else f"f{feat_j}"
            
            for int_type in self.interaction_types:
                if int_type == 'product':
                    names.append(f"{name_i}_x_{name_j}")
                elif int_type == 'ratio':
                    names.append(f"{name_i}_div_{name_j}")
                elif int_type == 'diff':
                    names.append(f"{name_i}_minus_{name_j}")
                elif int_type in ['min', 'max']:
                    names.append(f"{int_type}_{name_i}_{name_j}")
        
        return names

Higher-Order Interactions:

For problems with complex, multi-way relationships, systematically generating higher-order interactions follows combinatorial patterns:

3-way: $\binom{p}{3}$ possibilities, grows as $O(p^3)$
4-way: $\binom{p}{4}$ possibilities, grows as $O(p^4)$

Exhaustive generation is infeasible for even moderate $p$. Strategies for handling higher-order interactions:

Domain-driven selection: Only create interactions known from domain knowledge
Hierarchical expansion: Start with 2-way, add 3-way only for feature pairs that showed strong 2-way interaction
Tensor factorization: Use CP or Tucker decomposition to parameterize higher-order terms compactly
Let trees handle it: Rely on tree depth to capture higher-order effects implicitly

Summary: Feature Interactions for Boosting

Key Takeaways

•Feature interactions occur when the effect of one feature depends on another — Mathematically characterized by non-zero second-order partial derivatives.
•Trees naturally capture interactions through hierarchical splits — Depth determines maximum interaction order; deeper trees capture higher-order interactions.
•Implicit learning has limitations — Split dilution can miss weak-marginal/strong-interaction patterns; sample efficiency may be suboptimal.
•Explicit interaction engineering can boost performance — Product, ratio, difference, and domain-specific interactions encode known relationships directly.
•Detection methods identify valuable interactions — Tree co-occurrence, H-statistic, and ANOVA approaches each offer different tradeoffs.
•Practical engineering requires balance — Too few interactions miss signal; too many add noise and complexity. Focus on high-value interactions.
•Advanced techniques exist for complex scenarios — Factorization machines, neural approaches, and automated selection can handle high-dimensional interaction spaces.

What's Next:

Page Complete

1 / 5