Machine LearningFeature Selection Theory

Feature Selection Theory

LevelIntermediate

Duration90 mins

TopicFeature Selection Theory

1 / 5

Filter Methods

Understanding Filter Methods

Filter methods represent the foundational approach to feature selection in machine learning. They evaluate features based on their intrinsic properties and statistical relationships with the target variable, operating entirely independently of any learning algorithm. This independence is both their greatest strength and their most significant limitation.

Unlike wrapper or embedded methods that evaluate features within the context of a specific model, filter methods compute scores or rankings for each feature using statistical measures. Features scoring above a threshold—or the top-k ranked features—are retained for downstream modeling.

The term "filter" aptly describes their role: they act as a preprocessing step that filters out irrelevant or redundant features before any model training occurs. This makes them exceptionally fast and scalable, capable of handling datasets with millions of features where other approaches would be computationally prohibitive.

Why Filter Methods Matter

In high-dimensional settings like genomics (thousands of genes), text analysis (millions of n-grams), or sensor networks (hundreds of measurements), filter methods are often the only computationally feasible first pass. Understanding their theoretical foundations enables you to select appropriate filters and interpret their results correctly.

Theoretical Foundation of Filter Methods

Filter methods rest on a fundamental assumption: features that have strong statistical relationships with the target variable are likely to be useful for prediction. This assumption, while intuitive, carries important implications and limitations that practitioners must understand.

The Univariate Assumption

Most filter methods evaluate features univariately—each feature is scored in isolation, without considering its relationship to other features. Mathematically, if we have features $X_1, X_2, ..., X_p$ and target $Y$, a univariate filter computes scores $s(X_i, Y)$ for each feature independently.

This univariate approach enables:

O(p) complexity for scoring all features (linear in the number of features)
Embarrassingly parallel computation across features
No risk of overfitting to specific model architectures

However, it also means filter methods can miss:

Interaction effects: Two features might be uninformative individually but highly predictive together
Redundancy: Multiple highly-correlated features may all score high, even though including all provides no additional information
Complementarity: A feature that seems weak alone might complement other features perfectly

The XOR Problem

Consider two binary features X₁ and X₂ where Y = X₁ XOR X₂. Each feature individually has zero correlation with Y (knowing X₁ alone tells you nothing about Y). A univariate filter would assign both features zero importance and potentially discard them—yet together they perfectly determine Y. This illustrates the fundamental limitation of univariate approaches.

Information-Theoretic Perspective

From an information-theoretic standpoint, feature selection seeks to identify a feature subset $S$ that maximizes the mutual information with the target:

$$I(S; Y) = H(Y) - H(Y|S)$$

where $H(Y)$ is the entropy of the target and $H(Y|S)$ is the conditional entropy given the features in $S$. The goal is to find features that maximally reduce our uncertainty about $Y$.

Univariately, this reduces to ranking features by $I(X_i; Y)$. However, the optimal subset $S^*$ is generally not the union of the top-k individually informative features. The subset selection problem is NP-hard, motivating the use of greedy or approximate algorithms.

Correlation-Based Filter Methods

Correlation-based methods measure the linear relationship between each feature and the target. They are the simplest and most widely used filter techniques, particularly effective when relationships are approximately linear.

Pearson Correlation Coefficient

For continuous features and targets, the Pearson correlation coefficient measures linear dependence:

$$r(X, Y) = \frac{\sum_{i=1}^{n}(x_i - \bar{x})(y_i - \bar{y})}{\sqrt{\sum_{i=1}^{n}(x_i - \bar{x})^2} \sqrt{\sum_{i=1}^{n}(y_i - \bar{y})^2}}$$

The coefficient ranges from -1 (perfect negative correlation) to +1 (perfect positive correlation), with 0 indicating no linear relationship.

For feature selection, we typically use the absolute value $|r(X, Y)|$ since both strong positive and negative correlations indicate predictive power.

pearson_correlation_filter.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
import numpy as np
from scipy import stats
from typing import Tuple, List
 
def pearson_filter(X: np.ndarray, y: np.ndarray, 
                   threshold: float = None, 
                   k: int = None) -> Tuple[np.ndarray, List[int]]:
    """
    Filter features using Pearson correlation coefficient.
    
    Parameters:
    -----------
    X : np.ndarray of shape (n_samples, n_features)
        Feature matrix
    y : np.ndarray of shape (n_samples,)
        Target vector (continuous)
    threshold : float, optional
        Minimum absolute correlation to retain feature
    k : int, optional
        Number of top features to retain
        
    Returns:
    --------
    correlations : array of correlation values for each feature
    selected_indices : list of indices of selected features
    """
    n_samples, n_features = X.shape
    correlations = np.zeros(n_features)
    p_values = np.zeros(n_features)
    
    for i in range(n_features):
        # Compute Pearson correlation and p-value
        corr, pval = stats.pearsonr(X[:, i], y)
        correlations[i] = corr
        p_values[i] = pval
    
    abs_correlations = np.abs(correlations)
    
    # Select features based on threshold or top-k
    if threshold is not None:
        selected_indices = np.where(abs_correlations >= threshold)[0].tolist()
    elif k is not None:
        selected_indices = np.argsort(abs_correlations)[-k:][::-1].tolist()
    else:
        # Default: return all with positive correlation
        selected_indices = list(range(n_features))
    
    return correlations, selected_indices
 
# Example usage with synthetic data
np.random.seed(42)
n_samples, n_features = 1000, 20
 
# Create features: some correlated with target, some not
X = np.random.randn(n_samples, n_features)
# Target is linear combination of first 5 features + noise
true_weights = np.array([3, 2, 1, 0.5, 0.25] + [0] * 15)
y = X @ true_weights + np.random.randn(n_samples) * 0.5
 
correlations, selected = pearson_filter(X, y, k=5)
print(f"Selected features: {selected}")
print(f"Their correlations: {correlations[selected]}")

Spearman Rank Correlation

When relationships may be monotonic but non-linear, Spearman's rank correlation is more appropriate. It computes the Pearson correlation on the ranks of the data rather than the raw values:

$$\rho(X, Y) = r(\text{rank}(X), \text{rank}(Y))$$

Spearman correlation:

Is robust to outliers (since it uses ranks)
Captures monotonic relationships that Pearson misses
Equals Pearson correlation when the relationship is perfectly linear
Has the same range [-1, +1]

Point-Biserial Correlation

For binary targets with continuous features, the point-biserial correlation is the appropriate measure. It's mathematically equivalent to Pearson correlation when one variable is dichotomous:

$$r_{pb} = \frac{\bar{X}_1 - \bar{X}_0}{s_X}\sqrt{\frac{n_0 n_1}{n^2}}$$

where $\bar{X}_1$ and $\bar{X}_0$ are the means of the feature for the two classes, $s_X$ is the overall standard deviation, and $n_0$, $n_1$ are the class counts.

Choosing the Right Correlation Measure

Use Pearson for continuous targets with approximately linear relationships. Use Spearman when you expect monotonic but possibly non-linear relationships, or when outliers are present. Use Point-Biserial (or just Pearson) for binary classification with continuous features. For categorical features, consider mutual information or chi-squared tests instead.

Statistical Tests as Feature Filters

Statistical hypothesis tests provide a principled framework for feature selection, allowing us to assess whether a feature's relationship with the target is statistically significant or likely due to random chance.

Chi-Squared Test for Categorical Features

The chi-squared test of independence evaluates whether two categorical variables are associated. For feature selection with a categorical target, we compute:

$$\chi^2 = \sum_{i,j} \frac{(O_{ij} - E_{ij})^2}{E_{ij}}$$

where $O_{ij}$ is the observed frequency in cell $(i,j)$ of the contingency table and $E_{ij} = \frac{(\text{row}_i \text{total})(\text{col}_j \text{total})}{n}$ is the expected frequency under independence.

Higher chi-squared values indicate stronger association. Features with low p-values (rejecting the null hypothesis of independence) are considered informative.

chi_squared_filter.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
from sklearn.feature_selection import chi2, SelectKBest
from sklearn.preprocessing import MinMaxScaler
import numpy as np
 
def chi_squared_filter(X: np.ndarray, y: np.ndarray, 
                       k: int = 10) -> tuple:
    """
    Select top-k features using chi-squared test.
    
    Note: Chi-squared requires non-negative features.
    Typically used with count data or after scaling to [0, 1].
    
    Parameters:
    -----------
    X : Feature matrix (must be non-negative)
    y : Target vector (categorical/discrete)
    k : Number of features to select
    
    Returns:
    --------
    chi2_scores : Chi-squared statistics for each feature
    p_values : P-values for each feature
    selected_indices : Indices of selected features
    """
    # Ensure non-negative (required for chi2)
    scaler = MinMaxScaler()
    X_scaled = scaler.fit_transform(X)
    
    # Compute chi-squared statistics
    chi2_scores, p_values = chi2(X_scaled, y)
    
    # Select top-k features
    selector = SelectKBest(chi2, k=k)
    selector.fit(X_scaled, y)
    selected_indices = selector.get_support(indices=True)
    
    return chi2_scores, p_values, selected_indices
 
# Example: Text classification feature selection
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import CountVectorizer
 
# Load a subset of 20 newsgroups
categories = ['sci.med', 'sci.space']
newsgroups = fetch_20newsgroups(subset='train', categories=categories)
 
# Create bag-of-words features
vectorizer = CountVectorizer(max_features=1000, stop_words='english')
X = vectorizer.fit_transform(newsgroups.data).toarray()
y = newsgroups.target
 
# Apply chi-squared filter
scores, pvals, selected = chi_squared_filter(X, y, k=20)
 
# Show top features
feature_names = vectorizer.get_feature_names_out()
top_features = [(feature_names[i], scores[i]) for i in selected]
top_features.sort(key=lambda x: x[1], reverse=True)
print("Top discriminative words:")
for word, score in top_features[:10]:
    print(f"  {word}: χ² = {score:.2f}")

ANOVA F-Test for Classification

When features are continuous and the target is categorical, the ANOVA F-test (Analysis of Variance) assesses whether feature means differ significantly across classes:

$$F = \frac{\text{Between-group variance}}{\text{Within-group variance}} = \frac{\sum_{k} n_k (\bar{x}k - \bar{x})^2 / (K-1)}{\sum{k}\sum_{i \in k} (x_i - \bar{x}_k)^2 / (n-K)}$$

where $K$ is the number of classes, $n_k$ is the count in class $k$, $\bar{x}_k$ is the class mean, and $\bar{x}$ is the overall mean.

Higher F-values indicate greater class separation along that feature dimension.

Mutual Information

Mutual information (MI) provides a non-parametric measure of dependency that captures both linear and non-linear relationships:

$$I(X; Y) = \sum_{x,y} p(x,y) \log\frac{p(x,y)}{p(x)p(y)}$$

For continuous variables, estimation requires density estimation (typically via k-nearest neighbors). MI has several advantages:

Captures any statistical dependency, not just linear or monotonic
Is invariant to monotonic transformations of features
Has a clear information-theoretic interpretation (bits of information about Y contained in X)

However, MI estimation is more computationally expensive and requires careful hyperparameter tuning (e.g., number of neighbors).

Summary of Common Filter Methods by Data Type
Feature Type	Target Type	Recommended Methods
Continuous	Continuous	Pearson correlation, Spearman correlation, Mutual information
Continuous	Categorical	ANOVA F-test, Point-biserial correlation, Mutual information
Categorical	Continuous	ANOVA (feature as grouping variable), Mutual information
Categorical	Categorical	Chi-squared test, Cramér's V, Mutual information

Variance-Based and Unsupervised Filters

Not all filter methods require target labels. Unsupervised filters evaluate features based solely on their intrinsic properties, useful for:

Preprocessing before unsupervised learning (clustering, dimensionality reduction)
Removing obviously uninformative features before more expensive supervised selection
Settings where labeled data is scarce or unavailable

Variance Threshold

The simplest unsupervised filter removes features with low variance. The intuition: a feature that barely varies across samples provides little discriminative information.

$$\text{Var}(X) = \frac{1}{n}\sum_{i=1}^{n}(x_i - \bar{x})^2$$

For binary features, variance is maximized when the feature is split 50/50: $\text{Var} = 0.5 \times 0.5 = 0.25$. A feature that is 0 for 99% of samples has variance $0.99 \times 0.01 \approx 0.01$.

variance_threshold.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
from sklearn.feature_selection import VarianceThreshold
import numpy as np
 
def variance_filter(X: np.ndarray, threshold: float = 0.0) -> tuple:
    """
    Remove features with variance below threshold.
    
    Parameters:
    -----------
    X : Feature matrix
    threshold : Minimum variance to retain feature
        - For binary features, 0.8*(1-0.8) = 0.16 removes 
          features where >80% samples have same value
        
    Returns:
    --------
    X_filtered : Filtered feature matrix
    selected_indices : Indices of retained features
    variances : Variance of each original feature
    """
    variances = np.var(X, axis=0)
    
    selector = VarianceThreshold(threshold=threshold)
    X_filtered = selector.fit_transform(X)
    selected_indices = selector.get_support(indices=True)
    
    return X_filtered, selected_indices, variances
 
# Example: Remove quasi-constant features
np.random.seed(42)
n_samples = 1000
 
# Create features with varying variance
X = np.column_stack([
    np.random.randn(n_samples),           # High variance (normal)
    np.random.randn(n_samples) * 0.01,    # Very low variance
    np.random.choice([0, 1], n_samples, p=[0.99, 0.01]),  # Quasi-constant
    np.random.randn(n_samples) * 2,       # High variance
    np.zeros(n_samples) + 5,              # Zero variance (constant)
])
 
X_filtered, selected, variances = variance_filter(X, threshold=0.01)
print(f"Original features: {X.shape[1]}")
print(f"Features after filtering: {X_filtered.shape[1]}")
print(f"Variances: {variances.round(4)}")
print(f"Selected indices: {selected}")

Correlation-Based Redundancy Removal

While supervised filters identify relevant features, they don't account for redundancy—multiple features carrying the same information. Highly correlated features provide diminishing returns and can:

Inflate model complexity without improving performance
Cause numerical instability in some algorithms (e.g., linear regression with multicollinearity)
Increase overfitting risk

A common approach: for each pair of features with correlation above a threshold, remove one (typically the one with lower target correlation).

remove_correlated_features.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
import numpy as np
import pandas as pd
 
def remove_correlated_features(X: np.ndarray, 
                                threshold: float = 0.95,
                                target_correlations: np.ndarray = None) -> list:
    """
    Remove highly correlated features, keeping the more predictive one.
    
    Parameters:
    -----------
    X : Feature matrix
    threshold : Correlation threshold for considering features redundant
    target_correlations : If provided, keep feature with higher target correlation
    
    Returns:
    --------
    features_to_remove : List of feature indices to remove
    """
    n_features = X.shape[1]
    corr_matrix = np.corrcoef(X.T)
    
    features_to_remove = set()
    
    for i in range(n_features):
        if i in features_to_remove:
            continue
            
        for j in range(i + 1, n_features):
            if j in features_to_remove:
                continue
                
            if abs(corr_matrix[i, j]) > threshold:
                # Features are highly correlated - remove one
                if target_correlations is not None:
                    # Remove the one less correlated with target
                    if abs(target_correlations[i]) >= abs(target_correlations[j]):
                        features_to_remove.add(j)
                    else:
                        features_to_remove.add(i)
                        break  # i is removed, move to next i
                else:
                    # Default: remove the second one
                    features_to_remove.add(j)
    
    return sorted(list(features_to_remove))
 
# Example
np.random.seed(42)
n_samples = 500
 
# Create features with known correlations
base_feature = np.random.randn(n_samples)
X = np.column_stack([
    base_feature,                              # Feature 0
    base_feature + np.random.randn(n_samples) * 0.1,  # Highly correlated with 0
    np.random.randn(n_samples),                # Independent
    base_feature * 2 + 1,                      # Perfectly correlated with 0
    np.random.randn(n_samples),                # Independent
])
 
to_remove = remove_correlated_features(X, threshold=0.9)
print(f"Features to remove: {to_remove}")
print(f"Correlation matrix:\n{np.corrcoef(X.T).round(2)}")

Multivariate Filter Methods

To address the limitations of univariate filters, several multivariate filter methods have been developed. These consider feature interactions and redundancy while maintaining computational efficiency.

Minimum Redundancy Maximum Relevance (mRMR)

The mRMR algorithm balances two criteria:

Maximum Relevance: Select features highly correlated with the target
Minimum Redundancy: Penalize features similar to already-selected features

Starting from an empty set $S$, mRMR greedily adds the feature that maximizes:

$$\text{mRMR}(X_i) = I(X_i; Y) - \frac{1}{|S|}\sum_{X_j \in S} I(X_i; X_j)$$

The first term measures relevance (mutual information with target), the second penalizes redundancy with already-selected features.

mrmr_filter.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
import numpy as np
from sklearn.feature_selection import mutual_info_classif, mutual_info_regression
 
def mrmr_filter(X: np.ndarray, y: np.ndarray, 
                k: int = 10, 
                is_classification: bool = True) -> list:
    """
    Minimum Redundancy Maximum Relevance feature selection.
    
    Parameters:
    -----------
    X : Feature matrix of shape (n_samples, n_features)
    y : Target vector
    k : Number of features to select
    is_classification : Whether this is a classification task
    
    Returns:
    --------
    selected_features : List of selected feature indices in order of selection
    """
    n_features = X.shape[1]
    
    # Compute relevance: mutual information with target
    if is_classification:
        relevance = mutual_info_classif(X, y, random_state=42)
    else:
        relevance = mutual_info_regression(X, y, random_state=42)
    
    # Initialize
    selected = []
    remaining = list(range(n_features))
    
    # First feature: highest relevance
    first = np.argmax(relevance)
    selected.append(first)
    remaining.remove(first)
    
    # Precompute pairwise mutual information (for efficiency)
    # Using correlation as proxy (faster, often sufficient)
    corr_matrix = np.abs(np.corrcoef(X.T))
    
    # Greedily select remaining features
    while len(selected) < k and remaining:
        best_score = -np.inf
        best_feature = None
        
        for f in remaining:
            # Relevance term
            rel = relevance[f]
            
            # Redundancy term: average correlation with selected features
            if selected:
                red = np.mean([corr_matrix[f, s] for s in selected])
            else:
                red = 0
            
            # mRMR score
            score = rel - red
            
            if score > best_score:
                best_score = score
                best_feature = f
        
        selected.append(best_feature)
        remaining.remove(best_feature)
    
    return selected
 
# Example usage
from sklearn.datasets import load_breast_cancer
 
data = load_breast_cancer()
X, y = data.data, data.target
 
selected_features = mrmr_filter(X, y, k=10)
print("Selected features (in order of selection):")
for i, idx in enumerate(selected_features):
    print(f"  {i+1}. {data.feature_names[idx]}")

ReliefF Algorithm

ReliefF is a feature weighting algorithm that evaluates features by their ability to distinguish between instances that are close to each other. For each sample, it:

Finds the k-nearest neighbors from the same class ("hits")
Finds the k-nearest neighbors from different classes ("misses")
Updates feature weights based on distance differences

A feature is deemed important if it:

Has similar values for instances in the same class (small distance to hits)
Has different values for instances in different classes (large distance to misses)

The update rule for feature $f$ based on instance $R_i$:

$$W_f = W_f - \frac{\text{diff}(f, R_i, H)}{mk} + \sum_{c \neq \text{class}(R_i)} \frac{P(c)}{1-P(\text{class}(R_i))} \cdot \frac{\text{diff}(f, R_i, M(c))}{mk}$$

where $m$ is the number of sampled instances and $k$ is the number of neighbors.

ReliefF Strengths

ReliefF naturally handles feature interactions because it considers local neighborhoods. If two features together distinguish classes (like XOR), instances will have similar hit distances and different miss distances based on the combination. This makes ReliefF one of the few filter methods that can detect interaction effects.

Practical Considerations and Best Practices

Applying filter methods effectively requires attention to several practical concerns that can significantly impact results.

Handling Different Feature Scales

Many filter metrics are not scale-invariant. Before applying variance thresholds or distance-based methods like ReliefF, consider standardization:

$$X_{\text{standardized}} = \frac{X - \mu}{\sigma}$$

However, correlation-based methods and mutual information are naturally scale-invariant.

Multiple Testing Correction

When computing p-values for many features, the risk of false discoveries increases. With 10,000 features and α = 0.05, we expect ~500 false positives by chance.

Common corrections:

Bonferroni: α_adjusted = α / p (very conservative)
Benjamini-Hochberg: Controls false discovery rate (less conservative)
Permutation testing: Empirically estimates null distribution

Filter Method Selection Guide
Criterion	Recommended Approach
Very high dimensionality (p >> 10,000)	Start with variance threshold, then univariate correlation/MI
Moderate dimensionality	mRMR or ReliefF for interaction-aware selection
Speed is critical	Univariate methods (parallelizable, O(np) complexity)
Non-linear relationships expected	Mutual information over correlation
Binary/categorical features	Chi-squared test or mutual information
Interpretability required	Correlation coefficients (directly interpretable)

Filter Methods ≠ Final Selection

Filter methods should typically be the first step, not the final word, in feature selection. They efficiently reduce dimensionality from thousands/millions to a manageable set, which can then be refined using wrapper or embedded methods that account for the specific learning algorithm.

Summary: Filter Methods

Key Takeaways

•Filter methods evaluate features independently of any model, using statistical measures to score relevance to the target variable.
•Univariate filters (correlation, chi-squared, ANOVA, MI) are fast and scalable but miss feature interactions and redundancy.
•Multivariate filters like mRMR and ReliefF account for feature relationships at higher computational cost.
•Variance thresholds remove uninformative constant/quasi-constant features as a preprocessing step.
•Correlation-based redundancy removal eliminates highly correlated features that provide duplicate information.
•Multiple testing correction is essential when screening many features to control false discovery rate.

What's Next

Filter methods provide fast, model-agnostic feature screening. In the next section, we'll explore Wrapper Methods—approaches that evaluate feature subsets using actual model performance, enabling selection optimized for specific learning algorithms at the cost of increased computation.

1 / 5

Loading learning content...

Machine LearningFeature Selection Theory

Feature Selection Theory

LevelIntermediate

Duration90 mins

TopicFeature Selection Theory

1 / 5

Filter Methods

Understanding Filter Methods

Why Filter Methods Matter

Theoretical Foundation of Filter Methods

The Univariate Assumption

This univariate approach enables:

O(p) complexity for scoring all features (linear in the number of features)
Embarrassingly parallel computation across features
No risk of overfitting to specific model architectures

However, it also means filter methods can miss:

Interaction effects: Two features might be uninformative individually but highly predictive together
Redundancy: Multiple highly-correlated features may all score high, even though including all provides no additional information
Complementarity: A feature that seems weak alone might complement other features perfectly

The XOR Problem

Information-Theoretic Perspective

From an information-theoretic standpoint, feature selection seeks to identify a feature subset $S$ that maximizes the mutual information with the target:

$$I(S; Y) = H(Y) - H(Y|S)$$

where $H(Y)$ is the entropy of the target and $H(Y|S)$ is the conditional entropy given the features in $S$. The goal is to find features that maximally reduce our uncertainty about $Y$.

Correlation-Based Filter Methods

Pearson Correlation Coefficient

For continuous features and targets, the Pearson correlation coefficient measures linear dependence:

$$r(X, Y) = \frac{\sum_{i=1}^{n}(x_i - \bar{x})(y_i - \bar{y})}{\sqrt{\sum_{i=1}^{n}(x_i - \bar{x})^2} \sqrt{\sum_{i=1}^{n}(y_i - \bar{y})^2}}$$

The coefficient ranges from -1 (perfect negative correlation) to +1 (perfect positive correlation), with 0 indicating no linear relationship.

For feature selection, we typically use the absolute value $|r(X, Y)|$ since both strong positive and negative correlations indicate predictive power.

pearson_correlation_filter.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
import numpy as np
from scipy import stats
from typing import Tuple, List
 
def pearson_filter(X: np.ndarray, y: np.ndarray, 
                   threshold: float = None, 
                   k: int = None) -> Tuple[np.ndarray, List[int]]:
    """
    Filter features using Pearson correlation coefficient.
    
    Parameters:
    -----------
    X : np.ndarray of shape (n_samples, n_features)
        Feature matrix
    y : np.ndarray of shape (n_samples,)
        Target vector (continuous)
    threshold : float, optional
        Minimum absolute correlation to retain feature
    k : int, optional
        Number of top features to retain
        
    Returns:
    --------
    correlations : array of correlation values for each feature
    selected_indices : list of indices of selected features
    """
    n_samples, n_features = X.shape
    correlations = np.zeros(n_features)
    p_values = np.zeros(n_features)
    
    for i in range(n_features):
        # Compute Pearson correlation and p-value
        corr, pval = stats.pearsonr(X[:, i], y)
        correlations[i] = corr
        p_values[i] = pval
    
    abs_correlations = np.abs(correlations)
    
    # Select features based on threshold or top-k
    if threshold is not None:
        selected_indices = np.where(abs_correlations >= threshold)[0].tolist()
    elif k is not None:
        selected_indices = np.argsort(abs_correlations)[-k:][::-1].tolist()
    else:
        # Default: return all with positive correlation
        selected_indices = list(range(n_features))
    
    return correlations, selected_indices
 
# Example usage with synthetic data
np.random.seed(42)
n_samples, n_features = 1000, 20
 
# Create features: some correlated with target, some not
X = np.random.randn(n_samples, n_features)
# Target is linear combination of first 5 features + noise
true_weights = np.array([3, 2, 1, 0.5, 0.25] + [0] * 15)
y = X @ true_weights + np.random.randn(n_samples) * 0.5
 
correlations, selected = pearson_filter(X, y, k=5)
print(f"Selected features: {selected}")
print(f"Their correlations: {correlations[selected]}")

Spearman Rank Correlation

When relationships may be monotonic but non-linear, Spearman's rank correlation is more appropriate. It computes the Pearson correlation on the ranks of the data rather than the raw values:

$$\rho(X, Y) = r(\text{rank}(X), \text{rank}(Y))$$

Spearman correlation:

Is robust to outliers (since it uses ranks)
Captures monotonic relationships that Pearson misses
Equals Pearson correlation when the relationship is perfectly linear
Has the same range [-1, +1]

Point-Biserial Correlation

For binary targets with continuous features, the point-biserial correlation is the appropriate measure. It's mathematically equivalent to Pearson correlation when one variable is dichotomous:

$$r_{pb} = \frac{\bar{X}_1 - \bar{X}_0}{s_X}\sqrt{\frac{n_0 n_1}{n^2}}$$

where $\bar{X}_1$ and $\bar{X}_0$ are the means of the feature for the two classes, $s_X$ is the overall standard deviation, and $n_0$, $n_1$ are the class counts.

Choosing the Right Correlation Measure

Statistical Tests as Feature Filters

Chi-Squared Test for Categorical Features

The chi-squared test of independence evaluates whether two categorical variables are associated. For feature selection with a categorical target, we compute:

$$\chi^2 = \sum_{i,j} \frac{(O_{ij} - E_{ij})^2}{E_{ij}}$$

Higher chi-squared values indicate stronger association. Features with low p-values (rejecting the null hypothesis of independence) are considered informative.

chi_squared_filter.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
from sklearn.feature_selection import chi2, SelectKBest
from sklearn.preprocessing import MinMaxScaler
import numpy as np
 
def chi_squared_filter(X: np.ndarray, y: np.ndarray, 
                       k: int = 10) -> tuple:
    """
    Select top-k features using chi-squared test.
    
    Note: Chi-squared requires non-negative features.
    Typically used with count data or after scaling to [0, 1].
    
    Parameters:
    -----------
    X : Feature matrix (must be non-negative)
    y : Target vector (categorical/discrete)
    k : Number of features to select
    
    Returns:
    --------
    chi2_scores : Chi-squared statistics for each feature
    p_values : P-values for each feature
    selected_indices : Indices of selected features
    """
    # Ensure non-negative (required for chi2)
    scaler = MinMaxScaler()
    X_scaled = scaler.fit_transform(X)
    
    # Compute chi-squared statistics
    chi2_scores, p_values = chi2(X_scaled, y)
    
    # Select top-k features
    selector = SelectKBest(chi2, k=k)
    selector.fit(X_scaled, y)
    selected_indices = selector.get_support(indices=True)
    
    return chi2_scores, p_values, selected_indices
 
# Example: Text classification feature selection
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import CountVectorizer
 
# Load a subset of 20 newsgroups
categories = ['sci.med', 'sci.space']
newsgroups = fetch_20newsgroups(subset='train', categories=categories)
 
# Create bag-of-words features
vectorizer = CountVectorizer(max_features=1000, stop_words='english')
X = vectorizer.fit_transform(newsgroups.data).toarray()
y = newsgroups.target
 
# Apply chi-squared filter
scores, pvals, selected = chi_squared_filter(X, y, k=20)
 
# Show top features
feature_names = vectorizer.get_feature_names_out()
top_features = [(feature_names[i], scores[i]) for i in selected]
top_features.sort(key=lambda x: x[1], reverse=True)
print("Top discriminative words:")
for word, score in top_features[:10]:
    print(f"  {word}: χ² = {score:.2f}")

ANOVA F-Test for Classification

When features are continuous and the target is categorical, the ANOVA F-test (Analysis of Variance) assesses whether feature means differ significantly across classes:

$$F = \frac{\text{Between-group variance}}{\text{Within-group variance}} = \frac{\sum_{k} n_k (\bar{x}k - \bar{x})^2 / (K-1)}{\sum{k}\sum_{i \in k} (x_i - \bar{x}_k)^2 / (n-K)}$$

where $K$ is the number of classes, $n_k$ is the count in class $k$, $\bar{x}_k$ is the class mean, and $\bar{x}$ is the overall mean.

Higher F-values indicate greater class separation along that feature dimension.

Mutual Information

Mutual information (MI) provides a non-parametric measure of dependency that captures both linear and non-linear relationships:

$$I(X; Y) = \sum_{x,y} p(x,y) \log\frac{p(x,y)}{p(x)p(y)}$$

For continuous variables, estimation requires density estimation (typically via k-nearest neighbors). MI has several advantages:

Captures any statistical dependency, not just linear or monotonic
Is invariant to monotonic transformations of features
Has a clear information-theoretic interpretation (bits of information about Y contained in X)

However, MI estimation is more computationally expensive and requires careful hyperparameter tuning (e.g., number of neighbors).

Summary of Common Filter Methods by Data Type
Feature Type	Target Type	Recommended Methods
Continuous	Continuous	Pearson correlation, Spearman correlation, Mutual information
Continuous	Categorical	ANOVA F-test, Point-biserial correlation, Mutual information
Categorical	Continuous	ANOVA (feature as grouping variable), Mutual information
Categorical	Categorical	Chi-squared test, Cramér's V, Mutual information

Variance-Based and Unsupervised Filters

Not all filter methods require target labels. Unsupervised filters evaluate features based solely on their intrinsic properties, useful for:

Preprocessing before unsupervised learning (clustering, dimensionality reduction)
Removing obviously uninformative features before more expensive supervised selection
Settings where labeled data is scarce or unavailable

Variance Threshold

The simplest unsupervised filter removes features with low variance. The intuition: a feature that barely varies across samples provides little discriminative information.

$$\text{Var}(X) = \frac{1}{n}\sum_{i=1}^{n}(x_i - \bar{x})^2$$

For binary features, variance is maximized when the feature is split 50/50: $\text{Var} = 0.5 \times 0.5 = 0.25$. A feature that is 0 for 99% of samples has variance $0.99 \times 0.01 \approx 0.01$.

variance_threshold.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
from sklearn.feature_selection import VarianceThreshold
import numpy as np
 
def variance_filter(X: np.ndarray, threshold: float = 0.0) -> tuple:
    """
    Remove features with variance below threshold.
    
    Parameters:
    -----------
    X : Feature matrix
    threshold : Minimum variance to retain feature
        - For binary features, 0.8*(1-0.8) = 0.16 removes 
          features where >80% samples have same value
        
    Returns:
    --------
    X_filtered : Filtered feature matrix
    selected_indices : Indices of retained features
    variances : Variance of each original feature
    """
    variances = np.var(X, axis=0)
    
    selector = VarianceThreshold(threshold=threshold)
    X_filtered = selector.fit_transform(X)
    selected_indices = selector.get_support(indices=True)
    
    return X_filtered, selected_indices, variances
 
# Example: Remove quasi-constant features
np.random.seed(42)
n_samples = 1000
 
# Create features with varying variance
X = np.column_stack([
    np.random.randn(n_samples),           # High variance (normal)
    np.random.randn(n_samples) * 0.01,    # Very low variance
    np.random.choice([0, 1], n_samples, p=[0.99, 0.01]),  # Quasi-constant
    np.random.randn(n_samples) * 2,       # High variance
    np.zeros(n_samples) + 5,              # Zero variance (constant)
])
 
X_filtered, selected, variances = variance_filter(X, threshold=0.01)
print(f"Original features: {X.shape[1]}")
print(f"Features after filtering: {X_filtered.shape[1]}")
print(f"Variances: {variances.round(4)}")
print(f"Selected indices: {selected}")

Correlation-Based Redundancy Removal

Inflate model complexity without improving performance
Cause numerical instability in some algorithms (e.g., linear regression with multicollinearity)
Increase overfitting risk

A common approach: for each pair of features with correlation above a threshold, remove one (typically the one with lower target correlation).

remove_correlated_features.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
import numpy as np
import pandas as pd
 
def remove_correlated_features(X: np.ndarray, 
                                threshold: float = 0.95,
                                target_correlations: np.ndarray = None) -> list:
    """
    Remove highly correlated features, keeping the more predictive one.
    
    Parameters:
    -----------
    X : Feature matrix
    threshold : Correlation threshold for considering features redundant
    target_correlations : If provided, keep feature with higher target correlation
    
    Returns:
    --------
    features_to_remove : List of feature indices to remove
    """
    n_features = X.shape[1]
    corr_matrix = np.corrcoef(X.T)
    
    features_to_remove = set()
    
    for i in range(n_features):
        if i in features_to_remove:
            continue
            
        for j in range(i + 1, n_features):
            if j in features_to_remove:
                continue
                
            if abs(corr_matrix[i, j]) > threshold:
                # Features are highly correlated - remove one
                if target_correlations is not None:
                    # Remove the one less correlated with target
                    if abs(target_correlations[i]) >= abs(target_correlations[j]):
                        features_to_remove.add(j)
                    else:
                        features_to_remove.add(i)
                        break  # i is removed, move to next i
                else:
                    # Default: remove the second one
                    features_to_remove.add(j)
    
    return sorted(list(features_to_remove))
 
# Example
np.random.seed(42)
n_samples = 500
 
# Create features with known correlations
base_feature = np.random.randn(n_samples)
X = np.column_stack([
    base_feature,                              # Feature 0
    base_feature + np.random.randn(n_samples) * 0.1,  # Highly correlated with 0
    np.random.randn(n_samples),                # Independent
    base_feature * 2 + 1,                      # Perfectly correlated with 0
    np.random.randn(n_samples),                # Independent
])
 
to_remove = remove_correlated_features(X, threshold=0.9)
print(f"Features to remove: {to_remove}")
print(f"Correlation matrix:\n{np.corrcoef(X.T).round(2)}")

Multivariate Filter Methods

Minimum Redundancy Maximum Relevance (mRMR)

The mRMR algorithm balances two criteria:

Maximum Relevance: Select features highly correlated with the target
Minimum Redundancy: Penalize features similar to already-selected features

Starting from an empty set $S$, mRMR greedily adds the feature that maximizes:

$$\text{mRMR}(X_i) = I(X_i; Y) - \frac{1}{|S|}\sum_{X_j \in S} I(X_i; X_j)$$

The first term measures relevance (mutual information with target), the second penalizes redundancy with already-selected features.

mrmr_filter.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
import numpy as np
from sklearn.feature_selection import mutual_info_classif, mutual_info_regression
 
def mrmr_filter(X: np.ndarray, y: np.ndarray, 
                k: int = 10, 
                is_classification: bool = True) -> list:
    """
    Minimum Redundancy Maximum Relevance feature selection.
    
    Parameters:
    -----------
    X : Feature matrix of shape (n_samples, n_features)
    y : Target vector
    k : Number of features to select
    is_classification : Whether this is a classification task
    
    Returns:
    --------
    selected_features : List of selected feature indices in order of selection
    """
    n_features = X.shape[1]
    
    # Compute relevance: mutual information with target
    if is_classification:
        relevance = mutual_info_classif(X, y, random_state=42)
    else:
        relevance = mutual_info_regression(X, y, random_state=42)
    
    # Initialize
    selected = []
    remaining = list(range(n_features))
    
    # First feature: highest relevance
    first = np.argmax(relevance)
    selected.append(first)
    remaining.remove(first)
    
    # Precompute pairwise mutual information (for efficiency)
    # Using correlation as proxy (faster, often sufficient)
    corr_matrix = np.abs(np.corrcoef(X.T))
    
    # Greedily select remaining features
    while len(selected) < k and remaining:
        best_score = -np.inf
        best_feature = None
        
        for f in remaining:
            # Relevance term
            rel = relevance[f]
            
            # Redundancy term: average correlation with selected features
            if selected:
                red = np.mean([corr_matrix[f, s] for s in selected])
            else:
                red = 0
            
            # mRMR score
            score = rel - red
            
            if score > best_score:
                best_score = score
                best_feature = f
        
        selected.append(best_feature)
        remaining.remove(best_feature)
    
    return selected
 
# Example usage
from sklearn.datasets import load_breast_cancer
 
data = load_breast_cancer()
X, y = data.data, data.target
 
selected_features = mrmr_filter(X, y, k=10)
print("Selected features (in order of selection):")
for i, idx in enumerate(selected_features):
    print(f"  {i+1}. {data.feature_names[idx]}")

ReliefF Algorithm

ReliefF is a feature weighting algorithm that evaluates features by their ability to distinguish between instances that are close to each other. For each sample, it:

Finds the k-nearest neighbors from the same class ("hits")
Finds the k-nearest neighbors from different classes ("misses")
Updates feature weights based on distance differences

A feature is deemed important if it:

Has similar values for instances in the same class (small distance to hits)
Has different values for instances in different classes (large distance to misses)

The update rule for feature $f$ based on instance $R_i$:

$$W_f = W_f - \frac{\text{diff}(f, R_i, H)}{mk} + \sum_{c \neq \text{class}(R_i)} \frac{P(c)}{1-P(\text{class}(R_i))} \cdot \frac{\text{diff}(f, R_i, M(c))}{mk}$$

where $m$ is the number of sampled instances and $k$ is the number of neighbors.

ReliefF Strengths

Practical Considerations and Best Practices

Applying filter methods effectively requires attention to several practical concerns that can significantly impact results.

Handling Different Feature Scales

Many filter metrics are not scale-invariant. Before applying variance thresholds or distance-based methods like ReliefF, consider standardization:

$$X_{\text{standardized}} = \frac{X - \mu}{\sigma}$$

However, correlation-based methods and mutual information are naturally scale-invariant.

Multiple Testing Correction

When computing p-values for many features, the risk of false discoveries increases. With 10,000 features and α = 0.05, we expect ~500 false positives by chance.

Common corrections:

Bonferroni: α_adjusted = α / p (very conservative)
Benjamini-Hochberg: Controls false discovery rate (less conservative)
Permutation testing: Empirically estimates null distribution

Filter Method Selection Guide
Criterion	Recommended Approach
Very high dimensionality (p >> 10,000)	Start with variance threshold, then univariate correlation/MI
Moderate dimensionality	mRMR or ReliefF for interaction-aware selection
Speed is critical	Univariate methods (parallelizable, O(np) complexity)
Non-linear relationships expected	Mutual information over correlation
Binary/categorical features	Chi-squared test or mutual information
Interpretability required	Correlation coefficients (directly interpretable)

Filter Methods ≠ Final Selection

Summary: Filter Methods

Key Takeaways

•Filter methods evaluate features independently of any model, using statistical measures to score relevance to the target variable.
•Univariate filters (correlation, chi-squared, ANOVA, MI) are fast and scalable but miss feature interactions and redundancy.
•Multivariate filters like mRMR and ReliefF account for feature relationships at higher computational cost.
•Variance thresholds remove uninformative constant/quasi-constant features as a preprocessing step.
•Correlation-based redundancy removal eliminates highly correlated features that provide duplicate information.
•Multiple testing correction is essential when screening many features to control false discovery rate.

What's Next

1 / 5