Loading learning content...
Filter methods represent the foundational approach to feature selection in machine learning. They evaluate features based on their intrinsic properties and statistical relationships with the target variable, operating entirely independently of any learning algorithm. This independence is both their greatest strength and their most significant limitation.
Unlike wrapper or embedded methods that evaluate features within the context of a specific model, filter methods compute scores or rankings for each feature using statistical measures. Features scoring above a threshold—or the top-k ranked features—are retained for downstream modeling.
The term "filter" aptly describes their role: they act as a preprocessing step that filters out irrelevant or redundant features before any model training occurs. This makes them exceptionally fast and scalable, capable of handling datasets with millions of features where other approaches would be computationally prohibitive.
In high-dimensional settings like genomics (thousands of genes), text analysis (millions of n-grams), or sensor networks (hundreds of measurements), filter methods are often the only computationally feasible first pass. Understanding their theoretical foundations enables you to select appropriate filters and interpret their results correctly.
Filter methods rest on a fundamental assumption: features that have strong statistical relationships with the target variable are likely to be useful for prediction. This assumption, while intuitive, carries important implications and limitations that practitioners must understand.
Most filter methods evaluate features univariately—each feature is scored in isolation, without considering its relationship to other features. Mathematically, if we have features $X_1, X_2, ..., X_p$ and target $Y$, a univariate filter computes scores $s(X_i, Y)$ for each feature independently.
This univariate approach enables:
However, it also means filter methods can miss:
Consider two binary features X₁ and X₂ where Y = X₁ XOR X₂. Each feature individually has zero correlation with Y (knowing X₁ alone tells you nothing about Y). A univariate filter would assign both features zero importance and potentially discard them—yet together they perfectly determine Y. This illustrates the fundamental limitation of univariate approaches.
From an information-theoretic standpoint, feature selection seeks to identify a feature subset $S$ that maximizes the mutual information with the target:
$$I(S; Y) = H(Y) - H(Y|S)$$
where $H(Y)$ is the entropy of the target and $H(Y|S)$ is the conditional entropy given the features in $S$. The goal is to find features that maximally reduce our uncertainty about $Y$.
Univariately, this reduces to ranking features by $I(X_i; Y)$. However, the optimal subset $S^*$ is generally not the union of the top-k individually informative features. The subset selection problem is NP-hard, motivating the use of greedy or approximate algorithms.
Correlation-based methods measure the linear relationship between each feature and the target. They are the simplest and most widely used filter techniques, particularly effective when relationships are approximately linear.
For continuous features and targets, the Pearson correlation coefficient measures linear dependence:
$$r(X, Y) = \frac{\sum_{i=1}^{n}(x_i - \bar{x})(y_i - \bar{y})}{\sqrt{\sum_{i=1}^{n}(x_i - \bar{x})^2} \sqrt{\sum_{i=1}^{n}(y_i - \bar{y})^2}}$$
The coefficient ranges from -1 (perfect negative correlation) to +1 (perfect positive correlation), with 0 indicating no linear relationship.
For feature selection, we typically use the absolute value $|r(X, Y)|$ since both strong positive and negative correlations indicate predictive power.
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162
import numpy as npfrom scipy import statsfrom typing import Tuple, List def pearson_filter(X: np.ndarray, y: np.ndarray, threshold: float = None, k: int = None) -> Tuple[np.ndarray, List[int]]: """ Filter features using Pearson correlation coefficient. Parameters: ----------- X : np.ndarray of shape (n_samples, n_features) Feature matrix y : np.ndarray of shape (n_samples,) Target vector (continuous) threshold : float, optional Minimum absolute correlation to retain feature k : int, optional Number of top features to retain Returns: -------- correlations : array of correlation values for each feature selected_indices : list of indices of selected features """ n_samples, n_features = X.shape correlations = np.zeros(n_features) p_values = np.zeros(n_features) for i in range(n_features): # Compute Pearson correlation and p-value corr, pval = stats.pearsonr(X[:, i], y) correlations[i] = corr p_values[i] = pval abs_correlations = np.abs(correlations) # Select features based on threshold or top-k if threshold is not None: selected_indices = np.where(abs_correlations >= threshold)[0].tolist() elif k is not None: selected_indices = np.argsort(abs_correlations)[-k:][::-1].tolist() else: # Default: return all with positive correlation selected_indices = list(range(n_features)) return correlations, selected_indices # Example usage with synthetic datanp.random.seed(42)n_samples, n_features = 1000, 20 # Create features: some correlated with target, some notX = np.random.randn(n_samples, n_features)# Target is linear combination of first 5 features + noisetrue_weights = np.array([3, 2, 1, 0.5, 0.25] + [0] * 15)y = X @ true_weights + np.random.randn(n_samples) * 0.5 correlations, selected = pearson_filter(X, y, k=5)print(f"Selected features: {selected}")print(f"Their correlations: {correlations[selected]}")When relationships may be monotonic but non-linear, Spearman's rank correlation is more appropriate. It computes the Pearson correlation on the ranks of the data rather than the raw values:
$$\rho(X, Y) = r(\text{rank}(X), \text{rank}(Y))$$
Spearman correlation:
For binary targets with continuous features, the point-biserial correlation is the appropriate measure. It's mathematically equivalent to Pearson correlation when one variable is dichotomous:
$$r_{pb} = \frac{\bar{X}_1 - \bar{X}_0}{s_X}\sqrt{\frac{n_0 n_1}{n^2}}$$
where $\bar{X}_1$ and $\bar{X}_0$ are the means of the feature for the two classes, $s_X$ is the overall standard deviation, and $n_0$, $n_1$ are the class counts.
Use Pearson for continuous targets with approximately linear relationships. Use Spearman when you expect monotonic but possibly non-linear relationships, or when outliers are present. Use Point-Biserial (or just Pearson) for binary classification with continuous features. For categorical features, consider mutual information or chi-squared tests instead.
Statistical hypothesis tests provide a principled framework for feature selection, allowing us to assess whether a feature's relationship with the target is statistically significant or likely due to random chance.
The chi-squared test of independence evaluates whether two categorical variables are associated. For feature selection with a categorical target, we compute:
$$\chi^2 = \sum_{i,j} \frac{(O_{ij} - E_{ij})^2}{E_{ij}}$$
where $O_{ij}$ is the observed frequency in cell $(i,j)$ of the contingency table and $E_{ij} = \frac{(\text{row}_i \text{total})(\text{col}_j \text{total})}{n}$ is the expected frequency under independence.
Higher chi-squared values indicate stronger association. Features with low p-values (rejecting the null hypothesis of independence) are considered informative.
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061
from sklearn.feature_selection import chi2, SelectKBestfrom sklearn.preprocessing import MinMaxScalerimport numpy as np def chi_squared_filter(X: np.ndarray, y: np.ndarray, k: int = 10) -> tuple: """ Select top-k features using chi-squared test. Note: Chi-squared requires non-negative features. Typically used with count data or after scaling to [0, 1]. Parameters: ----------- X : Feature matrix (must be non-negative) y : Target vector (categorical/discrete) k : Number of features to select Returns: -------- chi2_scores : Chi-squared statistics for each feature p_values : P-values for each feature selected_indices : Indices of selected features """ # Ensure non-negative (required for chi2) scaler = MinMaxScaler() X_scaled = scaler.fit_transform(X) # Compute chi-squared statistics chi2_scores, p_values = chi2(X_scaled, y) # Select top-k features selector = SelectKBest(chi2, k=k) selector.fit(X_scaled, y) selected_indices = selector.get_support(indices=True) return chi2_scores, p_values, selected_indices # Example: Text classification feature selectionfrom sklearn.datasets import fetch_20newsgroupsfrom sklearn.feature_extraction.text import CountVectorizer # Load a subset of 20 newsgroupscategories = ['sci.med', 'sci.space']newsgroups = fetch_20newsgroups(subset='train', categories=categories) # Create bag-of-words featuresvectorizer = CountVectorizer(max_features=1000, stop_words='english')X = vectorizer.fit_transform(newsgroups.data).toarray()y = newsgroups.target # Apply chi-squared filterscores, pvals, selected = chi_squared_filter(X, y, k=20) # Show top featuresfeature_names = vectorizer.get_feature_names_out()top_features = [(feature_names[i], scores[i]) for i in selected]top_features.sort(key=lambda x: x[1], reverse=True)print("Top discriminative words:")for word, score in top_features[:10]: print(f" {word}: χ² = {score:.2f}")When features are continuous and the target is categorical, the ANOVA F-test (Analysis of Variance) assesses whether feature means differ significantly across classes:
$$F = \frac{\text{Between-group variance}}{\text{Within-group variance}} = \frac{\sum_{k} n_k (\bar{x}k - \bar{x})^2 / (K-1)}{\sum{k}\sum_{i \in k} (x_i - \bar{x}_k)^2 / (n-K)}$$
where $K$ is the number of classes, $n_k$ is the count in class $k$, $\bar{x}_k$ is the class mean, and $\bar{x}$ is the overall mean.
Higher F-values indicate greater class separation along that feature dimension.
Mutual information (MI) provides a non-parametric measure of dependency that captures both linear and non-linear relationships:
$$I(X; Y) = \sum_{x,y} p(x,y) \log\frac{p(x,y)}{p(x)p(y)}$$
For continuous variables, estimation requires density estimation (typically via k-nearest neighbors). MI has several advantages:
However, MI estimation is more computationally expensive and requires careful hyperparameter tuning (e.g., number of neighbors).
| Feature Type | Target Type | Recommended Methods |
|---|---|---|
| Continuous | Continuous | Pearson correlation, Spearman correlation, Mutual information |
| Continuous | Categorical | ANOVA F-test, Point-biserial correlation, Mutual information |
| Categorical | Continuous | ANOVA (feature as grouping variable), Mutual information |
| Categorical | Categorical | Chi-squared test, Cramér's V, Mutual information |
Not all filter methods require target labels. Unsupervised filters evaluate features based solely on their intrinsic properties, useful for:
The simplest unsupervised filter removes features with low variance. The intuition: a feature that barely varies across samples provides little discriminative information.
$$\text{Var}(X) = \frac{1}{n}\sum_{i=1}^{n}(x_i - \bar{x})^2$$
For binary features, variance is maximized when the feature is split 50/50: $\text{Var} = 0.5 \times 0.5 = 0.25$. A feature that is 0 for 99% of samples has variance $0.99 \times 0.01 \approx 0.01$.
12345678910111213141516171819202122232425262728293031323334353637383940414243444546
from sklearn.feature_selection import VarianceThresholdimport numpy as np def variance_filter(X: np.ndarray, threshold: float = 0.0) -> tuple: """ Remove features with variance below threshold. Parameters: ----------- X : Feature matrix threshold : Minimum variance to retain feature - For binary features, 0.8*(1-0.8) = 0.16 removes features where >80% samples have same value Returns: -------- X_filtered : Filtered feature matrix selected_indices : Indices of retained features variances : Variance of each original feature """ variances = np.var(X, axis=0) selector = VarianceThreshold(threshold=threshold) X_filtered = selector.fit_transform(X) selected_indices = selector.get_support(indices=True) return X_filtered, selected_indices, variances # Example: Remove quasi-constant featuresnp.random.seed(42)n_samples = 1000 # Create features with varying varianceX = np.column_stack([ np.random.randn(n_samples), # High variance (normal) np.random.randn(n_samples) * 0.01, # Very low variance np.random.choice([0, 1], n_samples, p=[0.99, 0.01]), # Quasi-constant np.random.randn(n_samples) * 2, # High variance np.zeros(n_samples) + 5, # Zero variance (constant)]) X_filtered, selected, variances = variance_filter(X, threshold=0.01)print(f"Original features: {X.shape[1]}")print(f"Features after filtering: {X_filtered.shape[1]}")print(f"Variances: {variances.round(4)}")print(f"Selected indices: {selected}")While supervised filters identify relevant features, they don't account for redundancy—multiple features carrying the same information. Highly correlated features provide diminishing returns and can:
A common approach: for each pair of features with correlation above a threshold, remove one (typically the one with lower target correlation).
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364
import numpy as npimport pandas as pd def remove_correlated_features(X: np.ndarray, threshold: float = 0.95, target_correlations: np.ndarray = None) -> list: """ Remove highly correlated features, keeping the more predictive one. Parameters: ----------- X : Feature matrix threshold : Correlation threshold for considering features redundant target_correlations : If provided, keep feature with higher target correlation Returns: -------- features_to_remove : List of feature indices to remove """ n_features = X.shape[1] corr_matrix = np.corrcoef(X.T) features_to_remove = set() for i in range(n_features): if i in features_to_remove: continue for j in range(i + 1, n_features): if j in features_to_remove: continue if abs(corr_matrix[i, j]) > threshold: # Features are highly correlated - remove one if target_correlations is not None: # Remove the one less correlated with target if abs(target_correlations[i]) >= abs(target_correlations[j]): features_to_remove.add(j) else: features_to_remove.add(i) break # i is removed, move to next i else: # Default: remove the second one features_to_remove.add(j) return sorted(list(features_to_remove)) # Examplenp.random.seed(42)n_samples = 500 # Create features with known correlationsbase_feature = np.random.randn(n_samples)X = np.column_stack([ base_feature, # Feature 0 base_feature + np.random.randn(n_samples) * 0.1, # Highly correlated with 0 np.random.randn(n_samples), # Independent base_feature * 2 + 1, # Perfectly correlated with 0 np.random.randn(n_samples), # Independent]) to_remove = remove_correlated_features(X, threshold=0.9)print(f"Features to remove: {to_remove}")print(f"Correlation matrix:\n{np.corrcoef(X.T).round(2)}")To address the limitations of univariate filters, several multivariate filter methods have been developed. These consider feature interactions and redundancy while maintaining computational efficiency.
The mRMR algorithm balances two criteria:
Starting from an empty set $S$, mRMR greedily adds the feature that maximizes:
$$\text{mRMR}(X_i) = I(X_i; Y) - \frac{1}{|S|}\sum_{X_j \in S} I(X_i; X_j)$$
The first term measures relevance (mutual information with target), the second penalizes redundancy with already-selected features.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778
import numpy as npfrom sklearn.feature_selection import mutual_info_classif, mutual_info_regression def mrmr_filter(X: np.ndarray, y: np.ndarray, k: int = 10, is_classification: bool = True) -> list: """ Minimum Redundancy Maximum Relevance feature selection. Parameters: ----------- X : Feature matrix of shape (n_samples, n_features) y : Target vector k : Number of features to select is_classification : Whether this is a classification task Returns: -------- selected_features : List of selected feature indices in order of selection """ n_features = X.shape[1] # Compute relevance: mutual information with target if is_classification: relevance = mutual_info_classif(X, y, random_state=42) else: relevance = mutual_info_regression(X, y, random_state=42) # Initialize selected = [] remaining = list(range(n_features)) # First feature: highest relevance first = np.argmax(relevance) selected.append(first) remaining.remove(first) # Precompute pairwise mutual information (for efficiency) # Using correlation as proxy (faster, often sufficient) corr_matrix = np.abs(np.corrcoef(X.T)) # Greedily select remaining features while len(selected) < k and remaining: best_score = -np.inf best_feature = None for f in remaining: # Relevance term rel = relevance[f] # Redundancy term: average correlation with selected features if selected: red = np.mean([corr_matrix[f, s] for s in selected]) else: red = 0 # mRMR score score = rel - red if score > best_score: best_score = score best_feature = f selected.append(best_feature) remaining.remove(best_feature) return selected # Example usagefrom sklearn.datasets import load_breast_cancer data = load_breast_cancer()X, y = data.data, data.target selected_features = mrmr_filter(X, y, k=10)print("Selected features (in order of selection):")for i, idx in enumerate(selected_features): print(f" {i+1}. {data.feature_names[idx]}")ReliefF is a feature weighting algorithm that evaluates features by their ability to distinguish between instances that are close to each other. For each sample, it:
A feature is deemed important if it:
The update rule for feature $f$ based on instance $R_i$:
$$W_f = W_f - \frac{\text{diff}(f, R_i, H)}{mk} + \sum_{c \neq \text{class}(R_i)} \frac{P(c)}{1-P(\text{class}(R_i))} \cdot \frac{\text{diff}(f, R_i, M(c))}{mk}$$
where $m$ is the number of sampled instances and $k$ is the number of neighbors.
ReliefF naturally handles feature interactions because it considers local neighborhoods. If two features together distinguish classes (like XOR), instances will have similar hit distances and different miss distances based on the combination. This makes ReliefF one of the few filter methods that can detect interaction effects.
Applying filter methods effectively requires attention to several practical concerns that can significantly impact results.
Many filter metrics are not scale-invariant. Before applying variance thresholds or distance-based methods like ReliefF, consider standardization:
$$X_{\text{standardized}} = \frac{X - \mu}{\sigma}$$
However, correlation-based methods and mutual information are naturally scale-invariant.
When computing p-values for many features, the risk of false discoveries increases. With 10,000 features and α = 0.05, we expect ~500 false positives by chance.
Common corrections:
| Criterion | Recommended Approach |
|---|---|
| Very high dimensionality (p >> 10,000) | Start with variance threshold, then univariate correlation/MI |
| Moderate dimensionality | mRMR or ReliefF for interaction-aware selection |
| Speed is critical | Univariate methods (parallelizable, O(np) complexity) |
| Non-linear relationships expected | Mutual information over correlation |
| Binary/categorical features | Chi-squared test or mutual information |
| Interpretability required | Correlation coefficients (directly interpretable) |
Filter methods should typically be the first step, not the final word, in feature selection. They efficiently reduce dimensionality from thousands/millions to a manageable set, which can then be refined using wrapper or embedded methods that account for the specific learning algorithm.
Filter methods provide fast, model-agnostic feature screening. In the next section, we'll explore Wrapper Methods—approaches that evaluate feature subsets using actual model performance, enabling selection optimized for specific learning algorithms at the cost of increased computation.