Feature Selection Theory - Learning Module

Loading content...

0/278

Embedded Methods

Understanding Embedded Methods

Embedded methods represent the elegant middle ground between filter and wrapper approaches. They perform feature selection as an integral part of the model training process, rather than as a separate preprocessing step (filter) or through external search (wrapper).

The key insight: many learning algorithms can be modified—or naturally possess—mechanisms that drive certain feature weights toward zero or explicitly select features during optimization. The selection emerges embedded within the learning process itself.

This integration offers compelling advantages:

Computational efficiency: No need for repeated model training on different feature subsets
Algorithm-specific selection: Features are chosen based on their actual contribution to the model
Interaction awareness: The model sees all features during training, enabling interaction effects to influence selection
Theoretical foundations: Many embedded methods have provable properties (consistency, oracle properties)

The Embedded Philosophy

Embedded methods answer the question: 'Instead of selecting features then training, or training to evaluate feature subsets, can we do both simultaneously?' The answer is yes—through regularization, tree-based splitting rules, and attention mechanisms that naturally induce sparsity or feature weighting.

Regularization-Based Feature Selection

Regularization adds a penalty term to the loss function that discourages model complexity. When this penalty is structured appropriately, it drives irrelevant feature weights to zero, effectively performing feature selection.

The General Form

A regularized learning objective takes the form:

$$\min_w \mathcal{L}(w; X, y) + \lambda \Omega(w)$$

where:

$\mathcal{L}$ is the loss function (MSE, cross-entropy, etc.)
$\Omega(w)$ is the regularization term (penalty on weights)
$\lambda$ is the regularization strength (hyperparameter)

Different forms of $\Omega(w)$ induce different behaviors:

Regularization	Formula	Sparsity	Behavior
L2 (Ridge)	$\|w\|_2^2 = \sum w_i^2$	No	Shrinks all weights uniformly
L1 (Lasso)	$\|w\|_1 = \sum	w_i	$
Elastic Net	$\alpha\|w\|_1 + (1-\alpha)\|w\|_2^2$	Yes	Combines L1 sparsity with L2 stability

Why L1 Induces Sparsity

The L1 penalty has corners at the axes in the constraint region. When optimizing, the solution often lands exactly on these corners, making some weights exactly zero. L2's circular constraint region has no corners—solutions almost never hit the axes, so weights shrink but rarely vanish entirely. This geometric intuition explains why L1 selects features while L2 merely shrinks them.

LASSO: Least Absolute Shrinkage and Selection Operator

LASSO (Least Absolute Shrinkage and Selection Operator) is the foundational embedded method for linear models. By adding an L1 penalty, LASSO simultaneously fits the model and selects features.

Mathematical Formulation

For linear regression with features $X \in \mathbb{R}^{n \times p}$ and target $y \in \mathbb{R}^n$:

$$\min_w \frac{1}{2n}|y - Xw|_2^2 + \lambda |w|_1$$

The solution $w^$ will have many entries exactly equal to zero when $\lambda$ is sufficiently large. Features with $w_i^ = 0$ are effectively removed.

The Regularization Path

As $\lambda$ varies:

$\lambda = 0$: Ordinary least squares (all features)
$\lambda \to \infty$: All weights zero (no features)
Intermediate $\lambda$: Sparse solution with subset of features

The regularization path shows how each weight evolves as $\lambda$ changes. Features that persist across a wide range of $\lambda$ values are generally more important.

lasso_feature_selection.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import Lasso, LassoCV
from sklearn.preprocessing import StandardScaler
from sklearn.datasets import load_diabetes
 
# Load data
data = load_diabetes()
X, y = data.data, data.target
feature_names = data.feature_names
 
# IMPORTANT: Standardize features for fair penalization
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
 
# Explore regularization path
alphas = np.logspace(-4, 1, 100)
coefs = []
 
for alpha in alphas:
    lasso = Lasso(alpha=alpha, max_iter=10000)
    lasso.fit(X_scaled, y)
    coefs.append(lasso.coef_)
 
coefs = np.array(coefs)
 
# Plot regularization path
plt.figure(figsize=(12, 5))
 
plt.subplot(1, 2, 1)
for i, name in enumerate(feature_names):
    plt.plot(alphas, coefs[:, i], label=name)
plt.xscale('log')
plt.xlabel('Alpha (regularization strength)')
plt.ylabel('Coefficient value')
plt.title('LASSO Regularization Path')
plt.legend(bbox_to_anchor=(1.05, 1), loc='upper left')
plt.axhline(y=0, color='k', linestyle='--', alpha=0.3)
 
# Use cross-validation to find optimal alpha
plt.subplot(1, 2, 2)
lasso_cv = LassoCV(alphas=alphas, cv=5)
lasso_cv.fit(X_scaled, y)
 
print(f"Optimal alpha: {lasso_cv.alpha_:.6f}")
print(f"
Feature selection results:")
for name, coef in zip(feature_names, lasso_cv.coef_):
    status = "SELECTED" if np.abs(coef) > 1e-10 else "removed"
    print(f"  {name:10s}: {coef:8.4f} [{status}]")
 
# Plot CV scores
mse_mean = lasso_cv.mse_path_.mean(axis=1)
mse_std = lasso_cv.mse_path_.std(axis=1)
plt.errorbar(lasso_cv.alphas_, mse_mean, yerr=mse_std, alpha=0.5)
plt.axvline(lasso_cv.alpha_, color='r', linestyle='--', label=f'Optimal α={lasso_cv.alpha_:.4f}')
plt.xscale('log')
plt.xlabel('Alpha')
plt.ylabel('Mean Squared Error')
plt.title('LASSO Cross-Validation')
plt.legend()
 
plt.tight_layout()
plt.show()
 
n_selected = np.sum(np.abs(lasso_cv.coef_) > 1e-10)
print(f"
Selected {n_selected} of {len(feature_names)} features")

LASSO Properties and Limitations

Theoretical Properties:

Consistency: Under certain conditions (irrepresentable condition), LASSO recovers the true sparse support
Oracle property: Can achieve estimation accuracy as if the true model were known (under strong conditions)

Practical Limitations:

Correlated features: When features are highly correlated, LASSO arbitrarily selects one and ignores others—even if all are truly relevant
Bound on selections: LASSO can select at most $\min(n, p)$ features
Bias: Non-zero coefficients are shrunk toward zero, biasing estimates
Instability: Small data changes can switch which correlated feature is selected

The Correlated Feature Problem

If features X₁ and X₂ are highly correlated and both predict Y, LASSO typically selects only one and sets the other to zero—the choice can be essentially arbitrary. This means LASSO's feature selection shouldn't be interpreted as 'X₂ is irrelevant.' For grouped correlated features, consider Elastic Net or Group LASSO.

Elastic Net: Combining L1 and L2 Regularization

Elastic Net addresses LASSO's limitations with correlated features by combining L1 and L2 penalties:

$$\min_w \frac{1}{2n}|y - Xw|_2^2 + \lambda \left( \alpha |w|_1 + \frac{1-\alpha}{2}|w|_2^2 \right)$$

where $\alpha \in [0, 1]$ controls the mix:

$\alpha = 1$: Pure LASSO
$\alpha = 0$: Pure Ridge
$\alpha \in (0, 1)$: Elastic Net

How It Helps

The L2 component provides:

Grouping effect: Correlated features tend to have similar coefficients (all shrunk together, not one selected arbitrarily)
Stability: Less sensitivity to small data perturbations
Unlimited selection: Can select more than $n$ features (unlike pure LASSO)

The L1 component provides:

Sparsity: Still drives some coefficients to exactly zero
Feature selection: Removes truly irrelevant features

elastic_net_selection.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
import numpy as np
from sklearn.linear_model import ElasticNetCV, LassoCV
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import cross_val_score
 
# Create data with correlated features
np.random.seed(42)
n_samples = 200
n_features = 20
 
# Base features
X_base = np.random.randn(n_samples, 5)
 
# Create correlated features (duplicates with noise)
X_correlated = np.hstack([
    X_base,
    X_base + np.random.randn(n_samples, 5) * 0.1,  # Highly correlated
    X_base * 0.5 + np.random.randn(n_samples, 5) * 0.5,  # Moderately correlated
    np.random.randn(n_samples, 5)  # Independent noise features
])
 
# Target depends on first 5 features
y = X_base @ np.array([3, 2, 1, 0.5, 0.25]) + np.random.randn(n_samples) * 0.5
 
# Standardize
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X_correlated)
 
# Compare LASSO vs Elastic Net
lasso_cv = LassoCV(cv=5, random_state=42)
lasso_cv.fit(X_scaled, y)
 
elastic_cv = ElasticNetCV(l1_ratio=[0.1, 0.3, 0.5, 0.7, 0.9], cv=5, random_state=42)
elastic_cv.fit(X_scaled, y)
 
print("Coefficient comparison (features 0-9 are correlated versions of same base):")
print(f"{'Feature':>10} {'LASSO':>10} {'ElasticNet':>10}")
print("-" * 32)
for i in range(20):
    l_coef = lasso_cv.coef_[i]
    e_coef = elastic_cv.coef_[i]
    print(f"{i:>10} {l_coef:>10.4f} {e_coef:>10.4f}")
 
print(f"
LASSO alpha: {lasso_cv.alpha_:.4f}")
print(f"ElasticNet alpha: {elastic_cv.alpha_:.4f}, l1_ratio: {elastic_cv.l1_ratio_:.2f}")
 
# Count non-zero features
lasso_selected = np.sum(np.abs(lasso_cv.coef_) > 1e-10)
elastic_selected = np.sum(np.abs(elastic_cv.coef_) > 1e-10)
print(f"
LASSO selected: {lasso_selected} features")
print(f"ElasticNet selected: {elastic_selected} features")
 
# Notice: ElasticNet keeps correlated features with similar weights
# while LASSO picks one and discards others

When to Use Each Regularization Method
Scenario	Recommended Method	Rationale
Few features are relevant, low correlation	LASSO	Clean sparse selection
Correlated feature groups, want all representatives	Elastic Net (low α)	Grouping effect keeps correlated features
p >> n (more features than samples)	Elastic Net	Handles rank-deficiency better than LASSO
Need interpretable sparse model	LASSO	Simplest sparse output
Uncertain about correlation structure	Elastic Net with CV over α	Let data decide optimal balance

Tree-Based Feature Importance

Decision trees and their ensembles (Random Forest, Gradient Boosting) naturally provide feature importance through their splitting mechanism. Unlike regularization that zeros out weights, trees implicitly select features by choosing which features to split on.

Mean Decrease in Impurity (MDI)

The most common tree-based importance measure computes how much each feature reduces impurity across all splits:

$$\text{MDI}(f) = \sum_{\text{nodes } t \text{ splitting on } f} p(t) \cdot \Delta i(t)$$

where:

$p(t)$ is the proportion of samples reaching node $t$
$\Delta i(t)$ is the impurity decrease (Gini, entropy, or MSE reduction)

For a Random Forest, MDI is averaged across all trees.

Permutation Importance

An alternative that doesn't rely on impurity:

Train model on full data
For each feature $f$:
- Permute $f$'s values randomly (breaking its relationship with $y$)
- Measure how much model performance drops
Features where permutation hurts most are most important

Advantages over MDI:

Model-agnostic (works for any model)
Not biased toward high-cardinality features (MDI is)
Uses actual predictive performance, not proxy

tree_feature_importance.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
import numpy as np
import matplotlib.pyplot as plt
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.inspection import permutation_importance
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_breast_cancer
 
# Load data
data = load_breast_cancer()
X, y = data.data, data.target
feature_names = data.feature_names
 
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42
)
 
# Train Random Forest
rf = RandomForestClassifier(n_estimators=100, random_state=42)
rf.fit(X_train, y_train)
 
# Get MDI importance
mdi_importance = rf.feature_importances_
 
# Get permutation importance on test set
perm_importance = permutation_importance(
    rf, X_test, y_test, 
    n_repeats=30, 
    random_state=42,
    n_jobs=-1
)
 
# Compare importances
fig, axes = plt.subplots(1, 2, figsize=(14, 8))
 
# Sort by MDI importance
mdi_sorted_idx = np.argsort(mdi_importance)[-15:]  # Top 15
 
axes[0].barh(range(15), mdi_importance[mdi_sorted_idx])
axes[0].set_yticks(range(15))
axes[0].set_yticklabels(feature_names[mdi_sorted_idx])
axes[0].set_xlabel('Mean Decrease in Impurity')
axes[0].set_title('MDI Importance (Top 15)')
 
# Sort by permutation importance
perm_sorted_idx = np.argsort(perm_importance.importances_mean)[-15:]
 
axes[1].barh(range(15), perm_importance.importances_mean[perm_sorted_idx])
axes[1].errorbar(
    perm_importance.importances_mean[perm_sorted_idx],
    range(15),
    xerr=perm_importance.importances_std[perm_sorted_idx],
    fmt='none', color='black', capsize=3
)
axes[1].set_yticks(range(15))
axes[1].set_yticklabels(feature_names[perm_sorted_idx])
axes[1].set_xlabel('Mean Accuracy Decrease')
axes[1].set_title('Permutation Importance (Top 15)')
 
plt.tight_layout()
plt.show()
 
# Feature selection based on importance threshold
importance_threshold = 0.01
selected_mdi = feature_names[mdi_importance > importance_threshold]
selected_perm = feature_names[perm_importance.importances_mean > importance_threshold]
 
print(f"Selected by MDI (threshold={importance_threshold}): {len(selected_mdi)} features")
print(f"Selected by Permutation (threshold={importance_threshold}): {len(selected_perm)} features")

MDI Bias with High-Cardinality Features

MDI is biased toward features with many unique values (high cardinality). A random ID column might appear highly important because it provides many potential split points, each perfectly separating some samples. Permutation importance avoids this bias but is more computationally expensive.

Gradient Boosting and Feature Selection

Gradient Boosting methods (XGBoost, LightGBM, CatBoost) have become dominant in structured data tasks and offer sophisticated embedded feature selection capabilities.

Built-in Feature Selection Mechanisms

1. Regularized Objectives

XGBoost and LightGBM include regularization in their objectives:

$$\mathcal{L} = \sum_i l(y_i, \hat{y}_i) + \sum_k \left( \gamma T_k + \frac{1}{2}\lambda |w_k|^2 + \alpha |w_k|_1 \right)$$

where for tree $k$: $T_k$ is the number of leaves, $w_k$ are leaf weights, and $\gamma, \lambda, \alpha$ are regularization parameters.

2. Minimum Split Gain

A split is only made if it improves the objective by at least min_split_gain. This prevents splits on marginally useful features.

3. Feature Sampling

colsample_bytree and colsample_bylevel randomly select feature subsets, reducing reliance on any single feature and enabling importance estimation.

4. Maximum Features

Limit which features can be used, forcing the model to identify the most essential ones.

xgboost_feature_selection.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
import numpy as np
import xgboost as xgb
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt
 
# Load data
data = load_breast_cancer()
X, y = data.data, data.target
feature_names = data.feature_names
 
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)
 
# Create DMatrix for XGBoost
dtrain = xgb.DMatrix(X_train, label=y_train, feature_names=feature_names)
dtest = xgb.DMatrix(X_test, label=y_test, feature_names=feature_names)
 
# Train with regularization for implicit feature selection
params = {
    'objective': 'binary:logistic',
    'eval_metric': 'auc',
    'max_depth': 4,
    'learning_rate': 0.1,
    'reg_alpha': 1.0,        # L1 regularization (induces sparsity)
    'reg_lambda': 1.0,       # L2 regularization
    'min_split_loss': 0.1,   # Minimum gain for split (gamma)
    'colsample_bytree': 0.8, # Feature sampling
    'subsample': 0.8,
    'seed': 42
}
 
model = xgb.train(
    params, dtrain, 
    num_boost_round=100,
    evals=[(dtest, 'test')],
    early_stopping_rounds=10,
    verbose_eval=False
)
 
# Get feature importance (multiple types)
importance_types = ['weight', 'gain', 'cover']
importances = {}
 
for imp_type in importance_types:
    importances[imp_type] = model.get_score(importance_type=imp_type)
 
# Compare importance types
fig, axes = plt.subplots(1, 3, figsize=(15, 6))
 
for ax, imp_type in zip(axes, importance_types):
    imp_dict = importances[imp_type]
    if imp_dict:
        sorted_imp = sorted(imp_dict.items(), key=lambda x: x[1], reverse=True)[:10]
        features, values = zip(*sorted_imp)
        ax.barh(range(len(features)), values)
        ax.set_yticks(range(len(features)))
        ax.set_yticklabels(features)
        ax.set_xlabel(imp_type.capitalize())
        ax.set_title(f'Top 10 by {imp_type}')
        ax.invert_yaxis()
 
plt.tight_layout()
plt.show()
 
# Features with zero importance are effectively not selected
all_features = set(feature_names)
used_features = set(importances['weight'].keys())
unused_features = all_features - used_features
 
print(f"Features used: {len(used_features)}/{len(all_features)}")
print(f"Unused features: {unused_features if unused_features else 'None'}")

XGBoost Feature Importance Types
Type	Meaning	When to Use
weight (frequency)	Number of times feature used in splits	Quick overview of feature usage
gain	Average gain when feature is used	Best for assessing predictive contribution
cover	Average number of samples affected by splits	Understanding feature reach
total_gain	Sum of gains (gain × weight)	Overall importance considering frequency
total_cover	Sum of coverage across splits	Total sample impact

Attention Mechanisms for Feature Selection

Neural networks can learn to attend to relevant features, providing a differentiable form of soft feature selection. Unlike hard selection (feature in or out), attention assigns continuous importance weights that can vary per sample.

Self-Attention for Tabular Data

Models like TabNet use sequential attention to select features step-by-step:

At each step, compute attention weights over all features
Process attended features through a decision layer
Use masks to enforce sparsity and prevent re-attending

The attention weights provide interpretable feature importance: features receiving high attention are deemed important for prediction.

Key Insight: Instance-Wise Selection

Unlike global methods that select the same features for all samples, attention mechanisms can select different features for different instances. A loan application from a student might focus on education features, while one from a retiree focuses on assets.

attention_feature_selection.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
import torch
import torch.nn as nn
import numpy as np
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
 
class AttentionFeatureSelector(nn.Module):
    """
    Simple attention-based feature selection network.
    Learns to weight features based on their relevance.
    """
    def __init__(self, n_features, hidden_dim=64):
        super().__init__()
        
        # Attention mechanism
        self.attention = nn.Sequential(
            nn.Linear(n_features, hidden_dim),
            nn.ReLU(),
            nn.Linear(hidden_dim, n_features),
            nn.Sigmoid()  # Attention weights in [0, 1]
        )
        
        # Classifier on attended features
        self.classifier = nn.Sequential(
            nn.Linear(n_features, hidden_dim),
            nn.ReLU(),
            nn.Dropout(0.3),
            nn.Linear(hidden_dim, 1),
            nn.Sigmoid()
        )
        
    def forward(self, x):
        # Compute per-sample attention weights
        attention_weights = self.attention(x)
        
        # Apply attention (soft feature selection)
        attended = x * attention_weights
        
        # Classify
        output = self.classifier(attended)
        
        return output, attention_weights
    
    def get_feature_importance(self, X):
        """Average attention weights across samples."""
        self.eval()
        with torch.no_grad():
            _, weights = self.forward(X)
            return weights.mean(dim=0).numpy()
 
# Example usage
data = load_breast_cancer()
X, y = data.data, data.target
 
# Preprocess
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
 
X_train, X_test, y_train, y_test = train_test_split(
    X_scaled, y, test_size=0.2, random_state=42
)
 
# Convert to tensors
X_train_t = torch.FloatTensor(X_train)
y_train_t = torch.FloatTensor(y_train).unsqueeze(1)
X_test_t = torch.FloatTensor(X_test)
y_test_t = torch.FloatTensor(y_test).unsqueeze(1)
 
# Train model
model = AttentionFeatureSelector(n_features=X.shape[1])
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)
criterion = nn.BCELoss()
 
# Training loop
for epoch in range(100):
    model.train()
    optimizer.zero_grad()
    
    outputs, _ = model(X_train_t)
    loss = criterion(outputs, y_train_t)
    
    # Optional: Add L1 penalty on attention weights for sparsity
    _, attn = model(X_train_t)
    sparsity_loss = 0.01 * attn.mean()  # Encourage lower attention values
    
    total_loss = loss + sparsity_loss
    total_loss.backward()
    optimizer.step()
 
# Get learned feature importance
importance = model.get_feature_importance(X_test_t)
 
# Display results
sorted_idx = np.argsort(importance)[::-1]
print("Learned feature importance (top 10):")
for idx in sorted_idx[:10]:
    print(f"  {data.feature_names[idx]}: {importance[idx]:.4f}")

TabNet and Modern Tabular Deep Learning

TabNet (Google, 2019) uses sequential sparse attention for interpretable feature selection on tabular data. It achieves competitive accuracy with gradient boosting while providing instance-wise feature importance. Other modern approaches include TabTransformer, FT-Transformer, and SAINT—all incorporating attention-based feature weighting.

Comparing Embedded Methods

Embedded Methods Comparison
Method	Model Type	Selection Type	Handles Correlation	Interpretability
LASSO	Linear	Hard (exact zeros)	Poorly (arbitrary)	High (coefficients)
Elastic Net	Linear	Hard	Well (grouping)	High (coefficients)
Random Forest MDI	Tree ensemble	Soft (importance scores)	Moderately	Medium
XGBoost/LightGBM	Boosted trees	Soft + Hard (via regularization)	Well	Medium-High
Neural Attention	Neural network	Soft (instance-wise)	Well	Medium (attention maps)

Decision Framework

Use LASSO when:

You need a simple, interpretable linear model
True sparsity is expected (few features matter)
Features are not highly correlated

Use Elastic Net when:

Features may be correlated in groups
You have p >> n (high-dimensional)
You want grouping behavior

Use Tree-based importance when:

Relationships may be non-linear
Interactions are important
Model will be tree-based anyway

Use Attention-based when:

Instance-wise importance matters
Working with deep learning pipeline
Need to capture complex dependencies

Summary: Embedded Methods

Key Takeaways

•Embedded methods perform feature selection during model training, avoiding the computational overhead of wrapper methods while being algorithm-specific.
•L1 regularization (LASSO) induces exact sparsity by driving coefficients to zero, effectively selecting features.
•Elastic Net combines L1 and L2 penalties to handle correlated features through grouping effects.
•Tree-based methods provide feature importance through split statistics (MDI) or permutation testing.
•Gradient boosting frameworks offer multiple regularization mechanisms and importance types for sophisticated selection.
•Attention mechanisms enable instance-wise soft feature selection in neural networks.
•Choice of method depends on model type, correlation structure, interpretability needs, and whether global or instance-wise selection is desired.

What's Next

We've covered the three main paradigms: filter, wrapper, and embedded methods. Next, we'll explore Stability Selection—a meta-approach that addresses a crucial question: how do we know if our selected features are truly important, or just lucky artifacts of the particular data sample?

Embedded Methods

Understanding Embedded Methods

This integration offers compelling advantages:

Computational efficiency: No need for repeated model training on different feature subsets
Algorithm-specific selection: Features are chosen based on their actual contribution to the model
Interaction awareness: The model sees all features during training, enabling interaction effects to influence selection
Theoretical foundations: Many embedded methods have provable properties (consistency, oracle properties)

The Embedded Philosophy

Regularization-Based Feature Selection

The General Form

A regularized learning objective takes the form:

$$\min_w \mathcal{L}(w; X, y) + \lambda \Omega(w)$$

where:

$\mathcal{L}$ is the loss function (MSE, cross-entropy, etc.)
$\Omega(w)$ is the regularization term (penalty on weights)
$\lambda$ is the regularization strength (hyperparameter)

Different forms of $\Omega(w)$ induce different behaviors:

Regularization	Formula	Sparsity	Behavior
L2 (Ridge)	$\|w\|_2^2 = \sum w_i^2$	No	Shrinks all weights uniformly
L1 (Lasso)	$\|w\|_1 = \sum	w_i	$
Elastic Net	$\alpha\|w\|_1 + (1-\alpha)\|w\|_2^2$	Yes	Combines L1 sparsity with L2 stability

Why L1 Induces Sparsity

LASSO: Least Absolute Shrinkage and Selection Operator

LASSO (Least Absolute Shrinkage and Selection Operator) is the foundational embedded method for linear models. By adding an L1 penalty, LASSO simultaneously fits the model and selects features.

Mathematical Formulation

For linear regression with features $X \in \mathbb{R}^{n \times p}$ and target $y \in \mathbb{R}^n$:

$$\min_w \frac{1}{2n}|y - Xw|_2^2 + \lambda |w|_1$$

The solution $w^$ will have many entries exactly equal to zero when $\lambda$ is sufficiently large. Features with $w_i^ = 0$ are effectively removed.

The Regularization Path

As $\lambda$ varies:

$\lambda = 0$: Ordinary least squares (all features)
$\lambda \to \infty$: All weights zero (no features)
Intermediate $\lambda$: Sparse solution with subset of features

The regularization path shows how each weight evolves as $\lambda$ changes. Features that persist across a wide range of $\lambda$ values are generally more important.

lasso_feature_selection.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import Lasso, LassoCV
from sklearn.preprocessing import StandardScaler
from sklearn.datasets import load_diabetes
 
# Load data
data = load_diabetes()
X, y = data.data, data.target
feature_names = data.feature_names
 
# IMPORTANT: Standardize features for fair penalization
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
 
# Explore regularization path
alphas = np.logspace(-4, 1, 100)
coefs = []
 
for alpha in alphas:
    lasso = Lasso(alpha=alpha, max_iter=10000)
    lasso.fit(X_scaled, y)
    coefs.append(lasso.coef_)
 
coefs = np.array(coefs)
 
# Plot regularization path
plt.figure(figsize=(12, 5))
 
plt.subplot(1, 2, 1)
for i, name in enumerate(feature_names):
    plt.plot(alphas, coefs[:, i], label=name)
plt.xscale('log')
plt.xlabel('Alpha (regularization strength)')
plt.ylabel('Coefficient value')
plt.title('LASSO Regularization Path')
plt.legend(bbox_to_anchor=(1.05, 1), loc='upper left')
plt.axhline(y=0, color='k', linestyle='--', alpha=0.3)
 
# Use cross-validation to find optimal alpha
plt.subplot(1, 2, 2)
lasso_cv = LassoCV(alphas=alphas, cv=5)
lasso_cv.fit(X_scaled, y)
 
print(f"Optimal alpha: {lasso_cv.alpha_:.6f}")
print(f"
Feature selection results:")
for name, coef in zip(feature_names, lasso_cv.coef_):
    status = "SELECTED" if np.abs(coef) > 1e-10 else "removed"
    print(f"  {name:10s}: {coef:8.4f} [{status}]")
 
# Plot CV scores
mse_mean = lasso_cv.mse_path_.mean(axis=1)
mse_std = lasso_cv.mse_path_.std(axis=1)
plt.errorbar(lasso_cv.alphas_, mse_mean, yerr=mse_std, alpha=0.5)
plt.axvline(lasso_cv.alpha_, color='r', linestyle='--', label=f'Optimal α={lasso_cv.alpha_:.4f}')
plt.xscale('log')
plt.xlabel('Alpha')
plt.ylabel('Mean Squared Error')
plt.title('LASSO Cross-Validation')
plt.legend()
 
plt.tight_layout()
plt.show()
 
n_selected = np.sum(np.abs(lasso_cv.coef_) > 1e-10)
print(f"
Selected {n_selected} of {len(feature_names)} features")

LASSO Properties and Limitations

Theoretical Properties:

Consistency: Under certain conditions (irrepresentable condition), LASSO recovers the true sparse support
Oracle property: Can achieve estimation accuracy as if the true model were known (under strong conditions)

Practical Limitations:

Correlated features: When features are highly correlated, LASSO arbitrarily selects one and ignores others—even if all are truly relevant
Bound on selections: LASSO can select at most $\min(n, p)$ features
Bias: Non-zero coefficients are shrunk toward zero, biasing estimates
Instability: Small data changes can switch which correlated feature is selected

The Correlated Feature Problem

Elastic Net: Combining L1 and L2 Regularization

Elastic Net addresses LASSO's limitations with correlated features by combining L1 and L2 penalties:

$$\min_w \frac{1}{2n}|y - Xw|_2^2 + \lambda \left( \alpha |w|_1 + \frac{1-\alpha}{2}|w|_2^2 \right)$$

where $\alpha \in [0, 1]$ controls the mix:

$\alpha = 1$: Pure LASSO
$\alpha = 0$: Pure Ridge
$\alpha \in (0, 1)$: Elastic Net

How It Helps

The L2 component provides:

Grouping effect: Correlated features tend to have similar coefficients (all shrunk together, not one selected arbitrarily)
Stability: Less sensitivity to small data perturbations
Unlimited selection: Can select more than $n$ features (unlike pure LASSO)

The L1 component provides:

Sparsity: Still drives some coefficients to exactly zero
Feature selection: Removes truly irrelevant features

elastic_net_selection.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
import numpy as np
from sklearn.linear_model import ElasticNetCV, LassoCV
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import cross_val_score
 
# Create data with correlated features
np.random.seed(42)
n_samples = 200
n_features = 20
 
# Base features
X_base = np.random.randn(n_samples, 5)
 
# Create correlated features (duplicates with noise)
X_correlated = np.hstack([
    X_base,
    X_base + np.random.randn(n_samples, 5) * 0.1,  # Highly correlated
    X_base * 0.5 + np.random.randn(n_samples, 5) * 0.5,  # Moderately correlated
    np.random.randn(n_samples, 5)  # Independent noise features
])
 
# Target depends on first 5 features
y = X_base @ np.array([3, 2, 1, 0.5, 0.25]) + np.random.randn(n_samples) * 0.5
 
# Standardize
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X_correlated)
 
# Compare LASSO vs Elastic Net
lasso_cv = LassoCV(cv=5, random_state=42)
lasso_cv.fit(X_scaled, y)
 
elastic_cv = ElasticNetCV(l1_ratio=[0.1, 0.3, 0.5, 0.7, 0.9], cv=5, random_state=42)
elastic_cv.fit(X_scaled, y)
 
print("Coefficient comparison (features 0-9 are correlated versions of same base):")
print(f"{'Feature':>10} {'LASSO':>10} {'ElasticNet':>10}")
print("-" * 32)
for i in range(20):
    l_coef = lasso_cv.coef_[i]
    e_coef = elastic_cv.coef_[i]
    print(f"{i:>10} {l_coef:>10.4f} {e_coef:>10.4f}")
 
print(f"
LASSO alpha: {lasso_cv.alpha_:.4f}")
print(f"ElasticNet alpha: {elastic_cv.alpha_:.4f}, l1_ratio: {elastic_cv.l1_ratio_:.2f}")
 
# Count non-zero features
lasso_selected = np.sum(np.abs(lasso_cv.coef_) > 1e-10)
elastic_selected = np.sum(np.abs(elastic_cv.coef_) > 1e-10)
print(f"
LASSO selected: {lasso_selected} features")
print(f"ElasticNet selected: {elastic_selected} features")
 
# Notice: ElasticNet keeps correlated features with similar weights
# while LASSO picks one and discards others

When to Use Each Regularization Method
Scenario	Recommended Method	Rationale
Few features are relevant, low correlation	LASSO	Clean sparse selection
Correlated feature groups, want all representatives	Elastic Net (low α)	Grouping effect keeps correlated features
p >> n (more features than samples)	Elastic Net	Handles rank-deficiency better than LASSO
Need interpretable sparse model	LASSO	Simplest sparse output
Uncertain about correlation structure	Elastic Net with CV over α	Let data decide optimal balance

Tree-Based Feature Importance

Mean Decrease in Impurity (MDI)

The most common tree-based importance measure computes how much each feature reduces impurity across all splits:

$$\text{MDI}(f) = \sum_{\text{nodes } t \text{ splitting on } f} p(t) \cdot \Delta i(t)$$

where:

$p(t)$ is the proportion of samples reaching node $t$
$\Delta i(t)$ is the impurity decrease (Gini, entropy, or MSE reduction)

For a Random Forest, MDI is averaged across all trees.

Permutation Importance

An alternative that doesn't rely on impurity:

Train model on full data
For each feature $f$:
- Permute $f$'s values randomly (breaking its relationship with $y$)
- Measure how much model performance drops
Features where permutation hurts most are most important

Advantages over MDI:

Model-agnostic (works for any model)
Not biased toward high-cardinality features (MDI is)
Uses actual predictive performance, not proxy

tree_feature_importance.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
import numpy as np
import matplotlib.pyplot as plt
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.inspection import permutation_importance
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_breast_cancer
 
# Load data
data = load_breast_cancer()
X, y = data.data, data.target
feature_names = data.feature_names
 
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42
)
 
# Train Random Forest
rf = RandomForestClassifier(n_estimators=100, random_state=42)
rf.fit(X_train, y_train)
 
# Get MDI importance
mdi_importance = rf.feature_importances_
 
# Get permutation importance on test set
perm_importance = permutation_importance(
    rf, X_test, y_test, 
    n_repeats=30, 
    random_state=42,
    n_jobs=-1
)
 
# Compare importances
fig, axes = plt.subplots(1, 2, figsize=(14, 8))
 
# Sort by MDI importance
mdi_sorted_idx = np.argsort(mdi_importance)[-15:]  # Top 15
 
axes[0].barh(range(15), mdi_importance[mdi_sorted_idx])
axes[0].set_yticks(range(15))
axes[0].set_yticklabels(feature_names[mdi_sorted_idx])
axes[0].set_xlabel('Mean Decrease in Impurity')
axes[0].set_title('MDI Importance (Top 15)')
 
# Sort by permutation importance
perm_sorted_idx = np.argsort(perm_importance.importances_mean)[-15:]
 
axes[1].barh(range(15), perm_importance.importances_mean[perm_sorted_idx])
axes[1].errorbar(
    perm_importance.importances_mean[perm_sorted_idx],
    range(15),
    xerr=perm_importance.importances_std[perm_sorted_idx],
    fmt='none', color='black', capsize=3
)
axes[1].set_yticks(range(15))
axes[1].set_yticklabels(feature_names[perm_sorted_idx])
axes[1].set_xlabel('Mean Accuracy Decrease')
axes[1].set_title('Permutation Importance (Top 15)')
 
plt.tight_layout()
plt.show()
 
# Feature selection based on importance threshold
importance_threshold = 0.01
selected_mdi = feature_names[mdi_importance > importance_threshold]
selected_perm = feature_names[perm_importance.importances_mean > importance_threshold]
 
print(f"Selected by MDI (threshold={importance_threshold}): {len(selected_mdi)} features")
print(f"Selected by Permutation (threshold={importance_threshold}): {len(selected_perm)} features")

MDI Bias with High-Cardinality Features

Gradient Boosting and Feature Selection

Gradient Boosting methods (XGBoost, LightGBM, CatBoost) have become dominant in structured data tasks and offer sophisticated embedded feature selection capabilities.

Built-in Feature Selection Mechanisms

1. Regularized Objectives

XGBoost and LightGBM include regularization in their objectives:

$$\mathcal{L} = \sum_i l(y_i, \hat{y}_i) + \sum_k \left( \gamma T_k + \frac{1}{2}\lambda |w_k|^2 + \alpha |w_k|_1 \right)$$

where for tree $k$: $T_k$ is the number of leaves, $w_k$ are leaf weights, and $\gamma, \lambda, \alpha$ are regularization parameters.

2. Minimum Split Gain

A split is only made if it improves the objective by at least min_split_gain. This prevents splits on marginally useful features.

3. Feature Sampling

colsample_bytree and colsample_bylevel randomly select feature subsets, reducing reliance on any single feature and enabling importance estimation.

4. Maximum Features

Limit which features can be used, forcing the model to identify the most essential ones.

xgboost_feature_selection.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
import numpy as np
import xgboost as xgb
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt
 
# Load data
data = load_breast_cancer()
X, y = data.data, data.target
feature_names = data.feature_names
 
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)
 
# Create DMatrix for XGBoost
dtrain = xgb.DMatrix(X_train, label=y_train, feature_names=feature_names)
dtest = xgb.DMatrix(X_test, label=y_test, feature_names=feature_names)
 
# Train with regularization for implicit feature selection
params = {
    'objective': 'binary:logistic',
    'eval_metric': 'auc',
    'max_depth': 4,
    'learning_rate': 0.1,
    'reg_alpha': 1.0,        # L1 regularization (induces sparsity)
    'reg_lambda': 1.0,       # L2 regularization
    'min_split_loss': 0.1,   # Minimum gain for split (gamma)
    'colsample_bytree': 0.8, # Feature sampling
    'subsample': 0.8,
    'seed': 42
}
 
model = xgb.train(
    params, dtrain, 
    num_boost_round=100,
    evals=[(dtest, 'test')],
    early_stopping_rounds=10,
    verbose_eval=False
)
 
# Get feature importance (multiple types)
importance_types = ['weight', 'gain', 'cover']
importances = {}
 
for imp_type in importance_types:
    importances[imp_type] = model.get_score(importance_type=imp_type)
 
# Compare importance types
fig, axes = plt.subplots(1, 3, figsize=(15, 6))
 
for ax, imp_type in zip(axes, importance_types):
    imp_dict = importances[imp_type]
    if imp_dict:
        sorted_imp = sorted(imp_dict.items(), key=lambda x: x[1], reverse=True)[:10]
        features, values = zip(*sorted_imp)
        ax.barh(range(len(features)), values)
        ax.set_yticks(range(len(features)))
        ax.set_yticklabels(features)
        ax.set_xlabel(imp_type.capitalize())
        ax.set_title(f'Top 10 by {imp_type}')
        ax.invert_yaxis()
 
plt.tight_layout()
plt.show()
 
# Features with zero importance are effectively not selected
all_features = set(feature_names)
used_features = set(importances['weight'].keys())
unused_features = all_features - used_features
 
print(f"Features used: {len(used_features)}/{len(all_features)}")
print(f"Unused features: {unused_features if unused_features else 'None'}")

XGBoost Feature Importance Types
Type	Meaning	When to Use
weight (frequency)	Number of times feature used in splits	Quick overview of feature usage
gain	Average gain when feature is used	Best for assessing predictive contribution
cover	Average number of samples affected by splits	Understanding feature reach
total_gain	Sum of gains (gain × weight)	Overall importance considering frequency
total_cover	Sum of coverage across splits	Total sample impact

Attention Mechanisms for Feature Selection

Self-Attention for Tabular Data

Models like TabNet use sequential attention to select features step-by-step:

At each step, compute attention weights over all features
Process attended features through a decision layer
Use masks to enforce sparsity and prevent re-attending

The attention weights provide interpretable feature importance: features receiving high attention are deemed important for prediction.

Key Insight: Instance-Wise Selection

attention_feature_selection.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
import torch
import torch.nn as nn
import numpy as np
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
 
class AttentionFeatureSelector(nn.Module):
    """
    Simple attention-based feature selection network.
    Learns to weight features based on their relevance.
    """
    def __init__(self, n_features, hidden_dim=64):
        super().__init__()
        
        # Attention mechanism
        self.attention = nn.Sequential(
            nn.Linear(n_features, hidden_dim),
            nn.ReLU(),
            nn.Linear(hidden_dim, n_features),
            nn.Sigmoid()  # Attention weights in [0, 1]
        )
        
        # Classifier on attended features
        self.classifier = nn.Sequential(
            nn.Linear(n_features, hidden_dim),
            nn.ReLU(),
            nn.Dropout(0.3),
            nn.Linear(hidden_dim, 1),
            nn.Sigmoid()
        )
        
    def forward(self, x):
        # Compute per-sample attention weights
        attention_weights = self.attention(x)
        
        # Apply attention (soft feature selection)
        attended = x * attention_weights
        
        # Classify
        output = self.classifier(attended)
        
        return output, attention_weights
    
    def get_feature_importance(self, X):
        """Average attention weights across samples."""
        self.eval()
        with torch.no_grad():
            _, weights = self.forward(X)
            return weights.mean(dim=0).numpy()
 
# Example usage
data = load_breast_cancer()
X, y = data.data, data.target
 
# Preprocess
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
 
X_train, X_test, y_train, y_test = train_test_split(
    X_scaled, y, test_size=0.2, random_state=42
)
 
# Convert to tensors
X_train_t = torch.FloatTensor(X_train)
y_train_t = torch.FloatTensor(y_train).unsqueeze(1)
X_test_t = torch.FloatTensor(X_test)
y_test_t = torch.FloatTensor(y_test).unsqueeze(1)
 
# Train model
model = AttentionFeatureSelector(n_features=X.shape[1])
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)
criterion = nn.BCELoss()
 
# Training loop
for epoch in range(100):
    model.train()
    optimizer.zero_grad()
    
    outputs, _ = model(X_train_t)
    loss = criterion(outputs, y_train_t)
    
    # Optional: Add L1 penalty on attention weights for sparsity
    _, attn = model(X_train_t)
    sparsity_loss = 0.01 * attn.mean()  # Encourage lower attention values
    
    total_loss = loss + sparsity_loss
    total_loss.backward()
    optimizer.step()
 
# Get learned feature importance
importance = model.get_feature_importance(X_test_t)
 
# Display results
sorted_idx = np.argsort(importance)[::-1]
print("Learned feature importance (top 10):")
for idx in sorted_idx[:10]:
    print(f"  {data.feature_names[idx]}: {importance[idx]:.4f}")

TabNet and Modern Tabular Deep Learning

Comparing Embedded Methods

Embedded Methods Comparison
Method	Model Type	Selection Type	Handles Correlation	Interpretability
LASSO	Linear	Hard (exact zeros)	Poorly (arbitrary)	High (coefficients)
Elastic Net	Linear	Hard	Well (grouping)	High (coefficients)
Random Forest MDI	Tree ensemble	Soft (importance scores)	Moderately	Medium
XGBoost/LightGBM	Boosted trees	Soft + Hard (via regularization)	Well	Medium-High
Neural Attention	Neural network	Soft (instance-wise)	Well	Medium (attention maps)

Decision Framework

Use LASSO when:

You need a simple, interpretable linear model
True sparsity is expected (few features matter)
Features are not highly correlated

Use Elastic Net when:

Features may be correlated in groups
You have p >> n (high-dimensional)
You want grouping behavior

Use Tree-based importance when:

Relationships may be non-linear
Interactions are important
Model will be tree-based anyway

Use Attention-based when:

Instance-wise importance matters
Working with deep learning pipeline
Need to capture complex dependencies

Summary: Embedded Methods

Key Takeaways

•Embedded methods perform feature selection during model training, avoiding the computational overhead of wrapper methods while being algorithm-specific.
•L1 regularization (LASSO) induces exact sparsity by driving coefficients to zero, effectively selecting features.
•Elastic Net combines L1 and L2 penalties to handle correlated features through grouping effects.
•Tree-based methods provide feature importance through split statistics (MDI) or permutation testing.
•Gradient boosting frameworks offer multiple regularization mechanisms and importance types for sophisticated selection.
•Attention mechanisms enable instance-wise soft feature selection in neural networks.
•Choice of method depends on model type, correlation structure, interpretability needs, and whether global or instance-wise selection is desired.

What's Next