Loading content...
Embedded methods represent the elegant middle ground between filter and wrapper approaches. They perform feature selection as an integral part of the model training process, rather than as a separate preprocessing step (filter) or through external search (wrapper).
The key insight: many learning algorithms can be modified—or naturally possess—mechanisms that drive certain feature weights toward zero or explicitly select features during optimization. The selection emerges embedded within the learning process itself.
This integration offers compelling advantages:
Embedded methods answer the question: 'Instead of selecting features then training, or training to evaluate feature subsets, can we do both simultaneously?' The answer is yes—through regularization, tree-based splitting rules, and attention mechanisms that naturally induce sparsity or feature weighting.
Regularization adds a penalty term to the loss function that discourages model complexity. When this penalty is structured appropriately, it drives irrelevant feature weights to zero, effectively performing feature selection.
A regularized learning objective takes the form:
$$\min_w \mathcal{L}(w; X, y) + \lambda \Omega(w)$$
where:
Different forms of $\Omega(w)$ induce different behaviors:
| Regularization | Formula | Sparsity | Behavior |
|---|---|---|---|
| L2 (Ridge) | $|w|_2^2 = \sum w_i^2$ | No | Shrinks all weights uniformly |
| L1 (Lasso) | $|w|_1 = \sum | w_i | $ |
| Elastic Net | $\alpha|w|_1 + (1-\alpha)|w|_2^2$ | Yes | Combines L1 sparsity with L2 stability |
The L1 penalty has corners at the axes in the constraint region. When optimizing, the solution often lands exactly on these corners, making some weights exactly zero. L2's circular constraint region has no corners—solutions almost never hit the axes, so weights shrink but rarely vanish entirely. This geometric intuition explains why L1 selects features while L2 merely shrinks them.
LASSO (Least Absolute Shrinkage and Selection Operator) is the foundational embedded method for linear models. By adding an L1 penalty, LASSO simultaneously fits the model and selects features.
For linear regression with features $X \in \mathbb{R}^{n \times p}$ and target $y \in \mathbb{R}^n$:
$$\min_w \frac{1}{2n}|y - Xw|_2^2 + \lambda |w|_1$$
The solution $w^$ will have many entries exactly equal to zero when $\lambda$ is sufficiently large. Features with $w_i^ = 0$ are effectively removed.
As $\lambda$ varies:
The regularization path shows how each weight evolves as $\lambda$ changes. Features that persist across a wide range of $\lambda$ values are generally more important.
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768
import numpy as npimport matplotlib.pyplot as pltfrom sklearn.linear_model import Lasso, LassoCVfrom sklearn.preprocessing import StandardScalerfrom sklearn.datasets import load_diabetes # Load datadata = load_diabetes()X, y = data.data, data.targetfeature_names = data.feature_names # IMPORTANT: Standardize features for fair penalizationscaler = StandardScaler()X_scaled = scaler.fit_transform(X) # Explore regularization pathalphas = np.logspace(-4, 1, 100)coefs = [] for alpha in alphas: lasso = Lasso(alpha=alpha, max_iter=10000) lasso.fit(X_scaled, y) coefs.append(lasso.coef_) coefs = np.array(coefs) # Plot regularization pathplt.figure(figsize=(12, 5)) plt.subplot(1, 2, 1)for i, name in enumerate(feature_names): plt.plot(alphas, coefs[:, i], label=name)plt.xscale('log')plt.xlabel('Alpha (regularization strength)')plt.ylabel('Coefficient value')plt.title('LASSO Regularization Path')plt.legend(bbox_to_anchor=(1.05, 1), loc='upper left')plt.axhline(y=0, color='k', linestyle='--', alpha=0.3) # Use cross-validation to find optimal alphaplt.subplot(1, 2, 2)lasso_cv = LassoCV(alphas=alphas, cv=5)lasso_cv.fit(X_scaled, y) print(f"Optimal alpha: {lasso_cv.alpha_:.6f}")print(f"Feature selection results:")for name, coef in zip(feature_names, lasso_cv.coef_): status = "SELECTED" if np.abs(coef) > 1e-10 else "removed" print(f" {name:10s}: {coef:8.4f} [{status}]") # Plot CV scoresmse_mean = lasso_cv.mse_path_.mean(axis=1)mse_std = lasso_cv.mse_path_.std(axis=1)plt.errorbar(lasso_cv.alphas_, mse_mean, yerr=mse_std, alpha=0.5)plt.axvline(lasso_cv.alpha_, color='r', linestyle='--', label=f'Optimal α={lasso_cv.alpha_:.4f}')plt.xscale('log')plt.xlabel('Alpha')plt.ylabel('Mean Squared Error')plt.title('LASSO Cross-Validation')plt.legend() plt.tight_layout()plt.show() n_selected = np.sum(np.abs(lasso_cv.coef_) > 1e-10)print(f"Selected {n_selected} of {len(feature_names)} features")Theoretical Properties:
Practical Limitations:
If features X₁ and X₂ are highly correlated and both predict Y, LASSO typically selects only one and sets the other to zero—the choice can be essentially arbitrary. This means LASSO's feature selection shouldn't be interpreted as 'X₂ is irrelevant.' For grouped correlated features, consider Elastic Net or Group LASSO.
Elastic Net addresses LASSO's limitations with correlated features by combining L1 and L2 penalties:
$$\min_w \frac{1}{2n}|y - Xw|_2^2 + \lambda \left( \alpha |w|_1 + \frac{1-\alpha}{2}|w|_2^2 \right)$$
where $\alpha \in [0, 1]$ controls the mix:
The L2 component provides:
The L1 component provides:
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556
import numpy as npfrom sklearn.linear_model import ElasticNetCV, LassoCVfrom sklearn.preprocessing import StandardScalerfrom sklearn.model_selection import cross_val_score # Create data with correlated featuresnp.random.seed(42)n_samples = 200n_features = 20 # Base featuresX_base = np.random.randn(n_samples, 5) # Create correlated features (duplicates with noise)X_correlated = np.hstack([ X_base, X_base + np.random.randn(n_samples, 5) * 0.1, # Highly correlated X_base * 0.5 + np.random.randn(n_samples, 5) * 0.5, # Moderately correlated np.random.randn(n_samples, 5) # Independent noise features]) # Target depends on first 5 featuresy = X_base @ np.array([3, 2, 1, 0.5, 0.25]) + np.random.randn(n_samples) * 0.5 # Standardizescaler = StandardScaler()X_scaled = scaler.fit_transform(X_correlated) # Compare LASSO vs Elastic Netlasso_cv = LassoCV(cv=5, random_state=42)lasso_cv.fit(X_scaled, y) elastic_cv = ElasticNetCV(l1_ratio=[0.1, 0.3, 0.5, 0.7, 0.9], cv=5, random_state=42)elastic_cv.fit(X_scaled, y) print("Coefficient comparison (features 0-9 are correlated versions of same base):")print(f"{'Feature':>10} {'LASSO':>10} {'ElasticNet':>10}")print("-" * 32)for i in range(20): l_coef = lasso_cv.coef_[i] e_coef = elastic_cv.coef_[i] print(f"{i:>10} {l_coef:>10.4f} {e_coef:>10.4f}") print(f"LASSO alpha: {lasso_cv.alpha_:.4f}")print(f"ElasticNet alpha: {elastic_cv.alpha_:.4f}, l1_ratio: {elastic_cv.l1_ratio_:.2f}") # Count non-zero featureslasso_selected = np.sum(np.abs(lasso_cv.coef_) > 1e-10)elastic_selected = np.sum(np.abs(elastic_cv.coef_) > 1e-10)print(f"LASSO selected: {lasso_selected} features")print(f"ElasticNet selected: {elastic_selected} features") # Notice: ElasticNet keeps correlated features with similar weights# while LASSO picks one and discards others| Scenario | Recommended Method | Rationale |
|---|---|---|
| Few features are relevant, low correlation | LASSO | Clean sparse selection |
| Correlated feature groups, want all representatives | Elastic Net (low α) | Grouping effect keeps correlated features |
| p >> n (more features than samples) | Elastic Net | Handles rank-deficiency better than LASSO |
| Need interpretable sparse model | LASSO | Simplest sparse output |
| Uncertain about correlation structure | Elastic Net with CV over α | Let data decide optimal balance |
Decision trees and their ensembles (Random Forest, Gradient Boosting) naturally provide feature importance through their splitting mechanism. Unlike regularization that zeros out weights, trees implicitly select features by choosing which features to split on.
The most common tree-based importance measure computes how much each feature reduces impurity across all splits:
$$\text{MDI}(f) = \sum_{\text{nodes } t \text{ splitting on } f} p(t) \cdot \Delta i(t)$$
where:
For a Random Forest, MDI is averaged across all trees.
An alternative that doesn't rely on impurity:
Advantages over MDI:
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768
import numpy as npimport matplotlib.pyplot as pltfrom sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifierfrom sklearn.inspection import permutation_importancefrom sklearn.model_selection import train_test_splitfrom sklearn.datasets import load_breast_cancer # Load datadata = load_breast_cancer()X, y = data.data, data.targetfeature_names = data.feature_names X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.3, random_state=42) # Train Random Forestrf = RandomForestClassifier(n_estimators=100, random_state=42)rf.fit(X_train, y_train) # Get MDI importancemdi_importance = rf.feature_importances_ # Get permutation importance on test setperm_importance = permutation_importance( rf, X_test, y_test, n_repeats=30, random_state=42, n_jobs=-1) # Compare importancesfig, axes = plt.subplots(1, 2, figsize=(14, 8)) # Sort by MDI importancemdi_sorted_idx = np.argsort(mdi_importance)[-15:] # Top 15 axes[0].barh(range(15), mdi_importance[mdi_sorted_idx])axes[0].set_yticks(range(15))axes[0].set_yticklabels(feature_names[mdi_sorted_idx])axes[0].set_xlabel('Mean Decrease in Impurity')axes[0].set_title('MDI Importance (Top 15)') # Sort by permutation importanceperm_sorted_idx = np.argsort(perm_importance.importances_mean)[-15:] axes[1].barh(range(15), perm_importance.importances_mean[perm_sorted_idx])axes[1].errorbar( perm_importance.importances_mean[perm_sorted_idx], range(15), xerr=perm_importance.importances_std[perm_sorted_idx], fmt='none', color='black', capsize=3)axes[1].set_yticks(range(15))axes[1].set_yticklabels(feature_names[perm_sorted_idx])axes[1].set_xlabel('Mean Accuracy Decrease')axes[1].set_title('Permutation Importance (Top 15)') plt.tight_layout()plt.show() # Feature selection based on importance thresholdimportance_threshold = 0.01selected_mdi = feature_names[mdi_importance > importance_threshold]selected_perm = feature_names[perm_importance.importances_mean > importance_threshold] print(f"Selected by MDI (threshold={importance_threshold}): {len(selected_mdi)} features")print(f"Selected by Permutation (threshold={importance_threshold}): {len(selected_perm)} features")MDI is biased toward features with many unique values (high cardinality). A random ID column might appear highly important because it provides many potential split points, each perfectly separating some samples. Permutation importance avoids this bias but is more computationally expensive.
Gradient Boosting methods (XGBoost, LightGBM, CatBoost) have become dominant in structured data tasks and offer sophisticated embedded feature selection capabilities.
1. Regularized Objectives
XGBoost and LightGBM include regularization in their objectives:
$$\mathcal{L} = \sum_i l(y_i, \hat{y}_i) + \sum_k \left( \gamma T_k + \frac{1}{2}\lambda |w_k|^2 + \alpha |w_k|_1 \right)$$
where for tree $k$: $T_k$ is the number of leaves, $w_k$ are leaf weights, and $\gamma, \lambda, \alpha$ are regularization parameters.
2. Minimum Split Gain
A split is only made if it improves the objective by at least min_split_gain. This prevents splits on marginally useful features.
3. Feature Sampling
colsample_bytree and colsample_bylevel randomly select feature subsets, reducing reliance on any single feature and enabling importance estimation.
4. Maximum Features
Limit which features can be used, forcing the model to identify the most essential ones.
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273
import numpy as npimport xgboost as xgbfrom sklearn.datasets import load_breast_cancerfrom sklearn.model_selection import train_test_splitimport matplotlib.pyplot as plt # Load datadata = load_breast_cancer()X, y = data.data, data.targetfeature_names = data.feature_names X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.2, random_state=42) # Create DMatrix for XGBoostdtrain = xgb.DMatrix(X_train, label=y_train, feature_names=feature_names)dtest = xgb.DMatrix(X_test, label=y_test, feature_names=feature_names) # Train with regularization for implicit feature selectionparams = { 'objective': 'binary:logistic', 'eval_metric': 'auc', 'max_depth': 4, 'learning_rate': 0.1, 'reg_alpha': 1.0, # L1 regularization (induces sparsity) 'reg_lambda': 1.0, # L2 regularization 'min_split_loss': 0.1, # Minimum gain for split (gamma) 'colsample_bytree': 0.8, # Feature sampling 'subsample': 0.8, 'seed': 42} model = xgb.train( params, dtrain, num_boost_round=100, evals=[(dtest, 'test')], early_stopping_rounds=10, verbose_eval=False) # Get feature importance (multiple types)importance_types = ['weight', 'gain', 'cover']importances = {} for imp_type in importance_types: importances[imp_type] = model.get_score(importance_type=imp_type) # Compare importance typesfig, axes = plt.subplots(1, 3, figsize=(15, 6)) for ax, imp_type in zip(axes, importance_types): imp_dict = importances[imp_type] if imp_dict: sorted_imp = sorted(imp_dict.items(), key=lambda x: x[1], reverse=True)[:10] features, values = zip(*sorted_imp) ax.barh(range(len(features)), values) ax.set_yticks(range(len(features))) ax.set_yticklabels(features) ax.set_xlabel(imp_type.capitalize()) ax.set_title(f'Top 10 by {imp_type}') ax.invert_yaxis() plt.tight_layout()plt.show() # Features with zero importance are effectively not selectedall_features = set(feature_names)used_features = set(importances['weight'].keys())unused_features = all_features - used_features print(f"Features used: {len(used_features)}/{len(all_features)}")print(f"Unused features: {unused_features if unused_features else 'None'}")| Type | Meaning | When to Use |
|---|---|---|
| weight (frequency) | Number of times feature used in splits | Quick overview of feature usage |
| gain | Average gain when feature is used | Best for assessing predictive contribution |
| cover | Average number of samples affected by splits | Understanding feature reach |
| total_gain | Sum of gains (gain × weight) | Overall importance considering frequency |
| total_cover | Sum of coverage across splits | Total sample impact |
Neural networks can learn to attend to relevant features, providing a differentiable form of soft feature selection. Unlike hard selection (feature in or out), attention assigns continuous importance weights that can vary per sample.
Models like TabNet use sequential attention to select features step-by-step:
The attention weights provide interpretable feature importance: features receiving high attention are deemed important for prediction.
Unlike global methods that select the same features for all samples, attention mechanisms can select different features for different instances. A loan application from a student might focus on education features, while one from a retiree focuses on assets.
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768697071727374757677787980818283848586878889909192939495969798
import torchimport torch.nn as nnimport numpy as npfrom sklearn.datasets import load_breast_cancerfrom sklearn.model_selection import train_test_splitfrom sklearn.preprocessing import StandardScaler class AttentionFeatureSelector(nn.Module): """ Simple attention-based feature selection network. Learns to weight features based on their relevance. """ def __init__(self, n_features, hidden_dim=64): super().__init__() # Attention mechanism self.attention = nn.Sequential( nn.Linear(n_features, hidden_dim), nn.ReLU(), nn.Linear(hidden_dim, n_features), nn.Sigmoid() # Attention weights in [0, 1] ) # Classifier on attended features self.classifier = nn.Sequential( nn.Linear(n_features, hidden_dim), nn.ReLU(), nn.Dropout(0.3), nn.Linear(hidden_dim, 1), nn.Sigmoid() ) def forward(self, x): # Compute per-sample attention weights attention_weights = self.attention(x) # Apply attention (soft feature selection) attended = x * attention_weights # Classify output = self.classifier(attended) return output, attention_weights def get_feature_importance(self, X): """Average attention weights across samples.""" self.eval() with torch.no_grad(): _, weights = self.forward(X) return weights.mean(dim=0).numpy() # Example usagedata = load_breast_cancer()X, y = data.data, data.target # Preprocessscaler = StandardScaler()X_scaled = scaler.fit_transform(X) X_train, X_test, y_train, y_test = train_test_split( X_scaled, y, test_size=0.2, random_state=42) # Convert to tensorsX_train_t = torch.FloatTensor(X_train)y_train_t = torch.FloatTensor(y_train).unsqueeze(1)X_test_t = torch.FloatTensor(X_test)y_test_t = torch.FloatTensor(y_test).unsqueeze(1) # Train modelmodel = AttentionFeatureSelector(n_features=X.shape[1])optimizer = torch.optim.Adam(model.parameters(), lr=0.001)criterion = nn.BCELoss() # Training loopfor epoch in range(100): model.train() optimizer.zero_grad() outputs, _ = model(X_train_t) loss = criterion(outputs, y_train_t) # Optional: Add L1 penalty on attention weights for sparsity _, attn = model(X_train_t) sparsity_loss = 0.01 * attn.mean() # Encourage lower attention values total_loss = loss + sparsity_loss total_loss.backward() optimizer.step() # Get learned feature importanceimportance = model.get_feature_importance(X_test_t) # Display resultssorted_idx = np.argsort(importance)[::-1]print("Learned feature importance (top 10):")for idx in sorted_idx[:10]: print(f" {data.feature_names[idx]}: {importance[idx]:.4f}")TabNet (Google, 2019) uses sequential sparse attention for interpretable feature selection on tabular data. It achieves competitive accuracy with gradient boosting while providing instance-wise feature importance. Other modern approaches include TabTransformer, FT-Transformer, and SAINT—all incorporating attention-based feature weighting.
| Method | Model Type | Selection Type | Handles Correlation | Interpretability |
|---|---|---|---|---|
| LASSO | Linear | Hard (exact zeros) | Poorly (arbitrary) | High (coefficients) |
| Elastic Net | Linear | Hard | Well (grouping) | High (coefficients) |
| Random Forest MDI | Tree ensemble | Soft (importance scores) | Moderately | Medium |
| XGBoost/LightGBM | Boosted trees | Soft + Hard (via regularization) | Well | Medium-High |
| Neural Attention | Neural network | Soft (instance-wise) | Well | Medium (attention maps) |
Use LASSO when:
Use Elastic Net when:
Use Tree-based importance when:
Use Attention-based when:
We've covered the three main paradigms: filter, wrapper, and embedded methods. Next, we'll explore Stability Selection—a meta-approach that addresses a crucial question: how do we know if our selected features are truly important, or just lucky artifacts of the particular data sample?