Machine LearningLightGBM

LightGBM: Light Gradient Boosting Machine

LevelAdvanced

Duration90 mins

TopicLightGBM

3 / 5

Exclusive Feature Bundling (EFB)

Turning High-Dimensional Data into a Virtue

High-dimensional feature spaces are a common challenge in machine learning. Sparse features—where most values are zero—are particularly prevalent in text processing, categorical encoding (one-hot encoding), and feature engineering. Traditional algorithms must scan every feature when finding splits, making high dimensionality extremely costly.

Exclusive Feature Bundling (EFB) is LightGBM's elegant solution to this problem. The key insight is remarkably simple: many features never take non-zero values simultaneously. If feature A is non-zero only when feature B is zero (and vice versa), we can safely combine them into a single derived feature without losing information.

By bundling mutually exclusive features, EFB dramatically reduces the effective feature dimensionality, accelerating both histogram construction and split finding—all without sacrificing model accuracy.

What You Will Learn

By the end of this page, you will understand why sparse features create computational challenges, the concept of mutual exclusivity between features, how EFB identifies features that can be bundled together, the graph coloring algorithm used for bundle construction, the encoding scheme that preserves distinguishability within bundles, and practical scenarios where EFB provides the greatest benefit.

The Sparsity Challenge

Before understanding EFB's solution, we need to understand the problem it solves: the computational burden of sparse, high-dimensional features.

Sources of Sparse Features:

Sparse features arise from many common data preprocessing techniques:

One-Hot Encoding: A categorical feature with 1000 categories becomes 1000 binary features. For any sample, exactly one is 1, and 999 are 0.
Text Processing (Bag of Words): A vocabulary of 50,000 words creates 50,000 features per document. Most documents contain only a few hundred unique words.
Feature Interactions: Creating polynomial features from 100 base features yields thousands of interaction terms, most of which are zero for any given sample.
Bucketed Numericals: Discretizing a continuous feature into 100 bins creates 100 binary features, with only one non-zero per sample.

The Computational Cost:

In histogram-based gradient boosting, the cost of building histograms and finding splits is proportional to $O(\text{features} \times \text{bins})$. For 10,000 sparse features with 256 bins each, this means 2.56 million histogram bin updates per sample—even though most of those features are zero!

Feature Sparsity in Common Scenarios
Scenario	Raw Features	Non-zero per Sample	Sparsity
One-hot encoding (1000 categories)	1,000	1	99.9%
Text classification (50K vocab)	50,000	~200	99.6%
Click-through rate prediction	100,000+	~100	99.9%
Polynomial features (d=2, n=100)	5,050	~100	98.0%
Genomic data	20,000+	~500	97.5%

The Sparsity Paradox

Sparse feature representations are often chosen for their representational convenience (one-hot encoding is clean and avoids ordinal assumptions). But they create massive computational overhead. EFB allows us to keep the representational benefits while eliminating most of the computational cost.

Mutual Exclusivity

The concept of mutual exclusivity is central to EFB. Two features are mutually exclusive if they never both take non-zero values for the same sample.

Formal Definition:

Features $f_j$ and $f_k$ are mutually exclusive if: $$\forall i: , x_{ij} \neq 0 \implies x_{ik} = 0$$

In other words, whenever feature $j$ is active (non-zero), feature $k$ must be inactive (zero), and vice versa.

Perfect vs. Near Exclusivity:

In practice, strictly perfect exclusivity is rare except in one-hot encoded features. EFB relaxes this requirement by allowing near exclusivity—features that conflict on only a small fraction of samples can still be bundled together.

If features $f_j$ and $f_k$ both have non-zero values on at most $\epsilon$ fraction of samples, we say they are $\epsilon$-exclusive and can be bundled with a small loss of precision.

mutual_exclusivity.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
import numpy as np
from scipy import sparse
 
def compute_conflict_matrix(X: np.ndarray, threshold: float = 0.0) -> np.ndarray:
    """
    Compute the conflict count between all pairs of features.
    
    Two features conflict on a sample if both are non-zero for that sample.
    
    Parameters:
    -----------
    X : array of shape (n_samples, n_features)
        Feature matrix (can be dense or sparse)
    threshold : float
        Values with |x| <= threshold are considered zero
        
    Returns:
    --------
    conflicts : array of shape (n_features, n_features)
        conflicts[i, j] = number of samples where both features i and j are non-zero
    """
    # Convert to binary (non-zero indicator)
    if sparse.issparse(X):
        X_binary = (np.abs(X) > threshold).astype(np.float32)
        # X^T @ X gives dot products of binary columns = conflict counts
        conflicts = (X_binary.T @ X_binary).toarray()
    else:
        X_binary = (np.abs(X) > threshold).astype(np.float32)
        conflicts = X_binary.T @ X_binary
    
    return conflicts.astype(int)
 
 
def find_exclusive_pairs(X: np.ndarray, max_conflict_rate: float = 0.01):
    """
    Find pairs of features that are nearly mutually exclusive.
    
    Parameters:
    -----------
    X : Feature matrix
    max_conflict_rate : Maximum fraction of conflicting samples allowed
    
    Returns:
    --------
    exclusive_pairs : list of (feature_i, feature_j) that are epsilon-exclusive
    """
    n_samples, n_features = X.shape
    max_conflicts = int(n_samples * max_conflict_rate)
    
    conflicts = compute_conflict_matrix(X)
    
    exclusive_pairs = []
    for i in range(n_features):
        for j in range(i + 1, n_features):
            if conflicts[i, j] <= max_conflicts:
                exclusive_pairs.append((i, j))
    
    return exclusive_pairs, conflicts
 
 
# Demonstration with one-hot encoded features
def demonstrate_exclusivity():
    np.random.seed(42)
    n_samples = 1000
    
    # Create a realistic scenario: 
    # - 3 categorical features with different cardinalities
    # - Each gets one-hot encoded
    
    cat1 = np.random.randint(0, 5, n_samples)   # 5 categories
    cat2 = np.random.randint(0, 10, n_samples)  # 10 categories
    cat3 = np.random.randint(0, 8, n_samples)   # 8 categories
    
    # One-hot encode (creates 5 + 10 + 8 = 23 features)
    from sklearn.preprocessing import OneHotEncoder
    encoder = OneHotEncoder(sparse_output=False)
    X_encoded = encoder.fit_transform(np.column_stack([cat1, cat2, cat3]))
    
    print(f"Original: 3 categorical features")
    print(f"After one-hot encoding: {X_encoded.shape[1]} features")
    
    # Compute conflicts
    conflicts = compute_conflict_matrix(X_encoded)
    
    print(f"\nConflict matrix shape: {conflicts.shape}")
    print(f"Total pairs of features: {conflicts.shape[0] * (conflicts.shape[0] - 1) // 2}")
    
    # Count perfectly exclusive pairs
    n_exclusive = np.sum(conflicts == 0) // 2 - X_encoded.shape[1] // 2  # Subtract diagonal
    n_total_pairs = X_encoded.shape[1] * (X_encoded.shape[1] - 1) // 2
    
    print(f"Perfectly exclusive pairs: {n_exclusive} ({100*n_exclusive/n_total_pairs:.1f}%)")
    
    # Features within the same one-hot group are perfectly exclusive
    # Features across groups have conflicts based on joint distribution
    
    # Identify bundles (features from same original categorical)
    print(f"\nFeature bundles (from same categorical):")
    categories = encoder.categories_
    offset = 0
    for i, cat in enumerate(categories):
        bundle_indices = list(range(offset, offset + len(cat)))
        print(f"  Bundle {i+1}: features {bundle_indices} ({len(cat)} features)")
        offset += len(cat)
    
    return X_encoded, conflicts
 
 
if __name__ == "__main__":
    X, conflicts = demonstrate_exclusivity()

One-Hot Features are Perfectly Exclusive

Features derived from one-hot encoding the same categorical variable are guaranteed to be mutually exclusive—exactly one is 1 and all others are 0 for every sample. This makes one-hot encoded data the ideal use case for EFB, as bundles are obvious and loss-free.

The Graph Coloring Formulation

Finding optimal bundles of mutually exclusive features is equivalent to a classic graph theory problem: graph coloring.

The Conflict Graph:

We construct an undirected graph $G = (V, E)$ where:

Each feature is a vertex: $V = {f_1, f_2, \ldots, f_p}$
An edge $(f_i, f_j) \in E$ exists if features $i$ and $j$ conflict (both non-zero on the same sample)

Features that can be bundled together must form an independent set in this graph (no edges between them). Finding the minimum number of bundles is equivalent to finding the chromatic number of the conflict graph.

The Graph Coloring Problem:

Graph coloring assigns colors to vertices such that no two adjacent vertices share the same color. The chromatic number $\chi(G)$ is the minimum number of colors needed.

In the EFB context:

Each color represents a bundle
Features with the same color can be bundled together
The chromatic number is the minimum number of bundles needed

Computational Complexity:

Finding the chromatic number is NP-hard in general. However, EFB doesn't need the optimal solution—a good greedy approximation is sufficient.

Graph Coloring → Feature Bundling Mapping

•Vertices ↔ Features
•Edges ↔ Conflicts (features both non-zero on same sample)
•Colors ↔ Bundles
•Independent Set ↔ Bundle of non-conflicting features
•Chromatic Number ↔ Minimum number of bundles
•Greedy Coloring ↔ LightGBM's bundling algorithm

LightGBM's Greedy Bundling Algorithm:

LightGBM uses a greedy algorithm based on the observation that features with more conflicts are harder to bundle. The algorithm:

Sort features by their total conflict count (descending)
Iterate through sorted features, assigning each to an existing bundle if possible, otherwise creating a new bundle
A feature can join a bundle if its total conflicts with bundle members is below a threshold $K$

This greedy approach doesn't guarantee optimality but works well in practice because:

Sparse data typically has low conflict counts
The order of processing (high-conflict first) helps avoid poor bundle assignments

greedy_bundling.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
import numpy as np
from typing import List, Set, Tuple
 
class GreedyBundler:
    """
    Greedy feature bundling algorithm as used in LightGBM.
    
    This implements the graph coloring approach to finding bundles of
    mutually (or nearly) exclusive features.
    """
    
    def __init__(self, max_conflict: int = 0):
        """
        Initialize bundler.
        
        Parameters:
        -----------
        max_conflict : int
            Maximum number of conflicting samples allowed for features
            to be bundled together. 0 = perfectly exclusive only.
        """
        self.max_conflict = max_conflict
        self.bundles: List[List[int]] = []
        
    def fit(self, X: np.ndarray) -> 'GreedyBundler':
        """
        Find feature bundles using greedy algorithm.
        
        Parameters:
        -----------
        X : array of shape (n_samples, n_features)
        
        Returns:
        --------
        self
        """
        n_samples, n_features = X.shape
        
        # Step 1: Compute conflict counts for each feature pair
        X_binary = (X != 0).astype(np.float32)
        conflict_matrix = (X_binary.T @ X_binary).astype(int)
        np.fill_diagonal(conflict_matrix, 0)  # No self-conflicts
        
        # Step 2: Compute total conflicts per feature
        total_conflicts = conflict_matrix.sum(axis=1)
        
        # Step 3: Sort features by total conflicts (descending)
        # Features with more conflicts are harder to bundle, process first
        sorted_features = np.argsort(total_conflicts)[::-1]
        
        # Step 4: Greedy bundling
        self.bundles = []
        bundle_conflicts: List[np.ndarray] = []  # Track conflicts per bundle
        
        for feature_idx in sorted_features:
            # Try to add to existing bundle
            added = False
            
            for bundle_idx, bundle in enumerate(self.bundles):
                # Check if this feature conflicts too much with the bundle
                bundle_conflict_count = sum(
                    conflict_matrix[feature_idx, f] for f in bundle
                )
                
                if bundle_conflict_count <= self.max_conflict:
                    # Can add to this bundle
                    self.bundles[bundle_idx].append(feature_idx)
                    added = True
                    break
            
            if not added:
                # Create new bundle
                self.bundles.append([feature_idx])
        
        return self
    
    def get_bundles(self) -> List[List[int]]:
        """Return the computed bundles."""
        return self.bundles
    
    def get_statistics(self, n_features: int) -> dict:
        """Get bundling statistics."""
        bundle_sizes = [len(b) for b in self.bundles]
        return {
            'original_features': n_features,
            'num_bundles': len(self.bundles),
            'reduction_ratio': n_features / len(self.bundles) if self.bundles else 1,
            'avg_bundle_size': np.mean(bundle_sizes) if bundle_sizes else 0,
            'max_bundle_size': max(bundle_sizes) if bundle_sizes else 0,
            'min_bundle_size': min(bundle_sizes) if bundle_sizes else 0,
        }
 
 
# Demonstration
def demonstrate_bundling():
    np.random.seed(42)
    n_samples = 1000
    
    # Create sparse one-hot-like features from multiple categoricals
    # This simulates a common scenario in tabular data
    
    # Categorical 1: 20 categories (one-hot → 20 features)
    cat1 = np.zeros((n_samples, 20))
    cat1[np.arange(n_samples), np.random.randint(0, 20, n_samples)] = 1
    
    # Categorical 2: 15 categories  
    cat2 = np.zeros((n_samples, 15))
    cat2[np.arange(n_samples), np.random.randint(0, 15, n_samples)] = 1
    
    # Categorical 3: 25 categories
    cat3 = np.zeros((n_samples, 25))
    cat3[np.arange(n_samples), np.random.randint(0, 25, n_samples)] = 1
    
    # Some additional sparse features (not one-hot, but sparse)
    sparse_features = np.random.choice([0, 1], size=(n_samples, 10), p=[0.95, 0.05])
    
    # Dense numerical features
    dense_features = np.random.randn(n_samples, 5)
    
    # Combine all
    X = np.hstack([cat1, cat2, cat3, sparse_features, dense_features])
    
    print("Feature Composition:")
    print(f"  One-hot from cat1: 20 features")
    print(f"  One-hot from cat2: 15 features")
    print(f"  One-hot from cat3: 25 features")
    print(f"  Sparse features: 10 features")
    print(f"  Dense features: 5 features")
    print(f"  Total: {X.shape[1]} features")
    
    # Apply greedy bundling
    bundler = GreedyBundler(max_conflict=0)  # Perfect exclusivity
    bundler.fit(X)
    
    stats = bundler.get_statistics(X.shape[1])
    
    print(f"\nBundling Results (strict, max_conflict=0):")
    print(f"  Number of bundles: {stats['num_bundles']}")
    print(f"  Reduction ratio: {stats['reduction_ratio']:.2f}x")
    print(f"  Average bundle size: {stats['avg_bundle_size']:.1f}")
    print(f"  Max bundle size: {stats['max_bundle_size']}")
    
    # Show bundle composition
    print(f"\nBundle contents:")
    for i, bundle in enumerate(bundler.get_bundles()[:5]):  # Show first 5
        print(f"  Bundle {i}: features {sorted(bundle)[:10]}" + 
              (f"... ({len(bundle)} total)" if len(bundle) > 10 else f" ({len(bundle)} total)"))
    if len(bundler.get_bundles()) > 5:
        print(f"  ... and {len(bundler.get_bundles()) - 5} more bundles")
    
    # Now with relaxed constraints
    bundler_relaxed = GreedyBundler(max_conflict=10)
    bundler_relaxed.fit(X)
    stats_relaxed = bundler_relaxed.get_statistics(X.shape[1])
    
    print(f"\nBundling Results (relaxed, max_conflict=10):")
    print(f"  Number of bundles: {stats_relaxed['num_bundles']}")
    print(f"  Reduction ratio: {stats_relaxed['reduction_ratio']:.2f}x")
    
    return X, bundler
 
 
if __name__ == "__main__":
    X, bundler = demonstrate_bundling()

The Bundle Encoding Scheme

Once features are grouped into bundles, we need a way to encode multiple original features into a single derived feature. The encoding must preserve distinguishability—we need to be able to recover which original feature's value we're seeing.

The Offset Encoding Scheme:

LightGBM uses a simple but effective scheme. For each bundle:

Assign each feature in the bundle a unique offset
Add the offset to the feature's value to create the bundled value
Ensure offsets are large enough that value ranges don't overlap

Let features $f_1, f_2, \ldots, f_k$ be in a bundle, with value ranges $[0, r_1], [0, r_2], \ldots, [0, r_k]$.

Assign offsets:

$\text{offset}_1 = 0$
$\text{offset}_2 = r_1 + 1$
$\text{offset}_3 = r_1 + r_2 + 2$
And so on...

The bundled value is: $$v_{\text{bundle}} = v_{\text{original}} + \text{offset}_{\text{feature}}$$

Since features are mutually exclusive, only one contributes a non-zero value at a time.

bundle_encoding.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
import numpy as np
from typing import List, Tuple
 
class BundleEncoder:
    """
    Encodes bundles of mutually exclusive features into single features.
    
    Uses offset encoding to preserve distinguishability.
    """
    
    def __init__(self, bundles: List[List[int]], feature_ranges: np.ndarray):
        """
        Initialize encoder.
        
        Parameters:
        -----------
        bundles : List of feature index lists (one per bundle)
        feature_ranges : Array of (min, max) for each original feature
        """
        self.bundles = bundles
        self.feature_ranges = feature_ranges
        self.offsets = self._compute_offsets()
        
    def _compute_offsets(self) -> List[dict]:
        """Compute offsets for each feature in each bundle."""
        bundle_offsets = []
        
        for bundle in self.bundles:
            offsets = {}
            current_offset = 0
            
            for feature_idx in bundle:
                offsets[feature_idx] = current_offset
                # Range is max_value + 1 to leave room for the encoding
                feature_range = self.feature_ranges[feature_idx][1] - self.feature_ranges[feature_idx][0] + 1
                current_offset += feature_range
            
            bundle_offsets.append(offsets)
        
        return bundle_offsets
    
    def encode(self, X: np.ndarray) -> np.ndarray:
        """
        Transform original features to bundled features.
        
        Parameters:
        -----------
        X : array of shape (n_samples, n_features)
        
        Returns:
        --------
        X_bundled : array of shape (n_samples, n_bundles)
        """
        n_samples = X.shape[0]
        X_bundled = np.zeros((n_samples, len(self.bundles)))
        
        for bundle_idx, bundle in enumerate(self.bundles):
            for feature_idx in bundle:
                # Get non-zero mask for this feature
                mask = X[:, feature_idx] != 0
                
                # Add offset and assign
                offset = self.offsets[bundle_idx][feature_idx]
                min_val = self.feature_ranges[feature_idx][0]
                
                # Bundled value = original value - min + offset
                X_bundled[mask, bundle_idx] = X[mask, feature_idx] - min_val + offset
        
        return X_bundled
    
    def get_bundle_info(self) -> List[dict]:
        """Get information about each bundle."""
        info = []
        for bundle_idx, bundle in enumerate(self.bundles):
            bundle_range = max(self.offsets[bundle_idx].values()) +                           (self.feature_ranges[bundle[-1]][1] - self.feature_ranges[bundle[-1]][0] + 1)
            info.append({
                'bundle_idx': bundle_idx,
                'n_features': len(bundle),
                'feature_indices': bundle,
                'total_range': bundle_range,
                'offsets': self.offsets[bundle_idx]
            })
        return info
 
 
# Demonstration
def demonstrate_encoding():
    np.random.seed(42)
    n_samples = 10
    
    # Create a simple example with 3 one-hot encoded categories
    # Category A: 4 classes → features 0-3
    # Category B: 3 classes → features 4-6
    
    cat_a = np.random.randint(0, 4, n_samples)
    cat_b = np.random.randint(0, 3, n_samples)
    
    # One-hot encode
    X = np.zeros((n_samples, 7))
    X[np.arange(n_samples), cat_a] = 1          # Features 0-3
    X[np.arange(n_samples), cat_b + 4] = 1      # Features 4-6
    
    print("Original Feature Matrix:")
    print("(Features 0-3 from category A, 4-6 from category B)")
    print(X.astype(int))
    
    # Define bundles
    bundles = [
        [0, 1, 2, 3],  # All from category A
        [4, 5, 6]      # All from category B
    ]
    
    # Feature ranges (all binary: 0 or 1)
    feature_ranges = np.array([[0, 1]] * 7)
    
    # Encode
    encoder = BundleEncoder(bundles, feature_ranges)
    X_bundled = encoder.encode(X)
    
    print(f"\nBundled Feature Matrix ({len(bundles)} bundles):")
    print(X_bundled.astype(int))
    
    # Explain the encoding
    print("\nBundle 0 (features 0-3):")
    for feat_idx in bundles[0]:
        offset = encoder.offsets[0][feat_idx]
        print(f"  Feature {feat_idx}: value 1 encodes to {offset + 1}")
    
    print("\nBundle 1 (features 4-6):")
    for feat_idx in bundles[1]:
        offset = encoder.offsets[1][feat_idx]
        print(f"  Feature {feat_idx}: value 1 encodes to {offset + 1}")
    
    # Verify encoding preserves information
    print("\nVerification - First 3 samples:")
    for i in range(3):
        orig_a = np.where(X[i, :4] == 1)[0][0]
        orig_b = np.where(X[i, 4:] == 1)[0][0]
        bundled_a = int(X_bundled[i, 0]) - 1  # Subtract 1 because value was 1, offset starts at 0
        bundled_b = int(X_bundled[i, 1]) - 1
        print(f"  Sample {i}: A={orig_a}, B={orig_b} → Bundle0={int(X_bundled[i,0])}, Bundle1={int(X_bundled[i,1])}")
    
    return X, X_bundled, encoder
 
 
if __name__ == "__main__":
    X, X_bundled, encoder = demonstrate_encoding()

Histogram-Based Splitting with Bundles

After bundling, histogram construction works normally on the bundled features. Split points in the bundled feature space implicitly encode conditions on the original features. For example, a split at 'bundled_value < 2' might separate original features 0-1 from features 2-3.

EFB in Practice

LightGBM implements EFB transparently—you typically don't need to configure it explicitly. However, understanding its behavior helps you optimize for your specific data.

LightGBM's EFB Implementation:

LightGBM applies EFB during preprocessing:

Analyzes feature conflict patterns
Runs the greedy bundling algorithm
Encodes bundles using the offset scheme
Uses bundled features for histogram construction and split finding

The key parameter is max_conflict_rate (internal), which controls how much conflict is tolerated when bundling.

When EFB is Most Effective:

EFB provides dramatic speedups when:

One-hot encoded categoricals dominate the feature space
Sparse text features (bag of words, TF-IDF) are present
Feature interactions have been generated
Natural sparsity exists in the domain (genomics, user behavior, etc.)

EFB Speedup Scenarios
Scenario	Original Features	After EFB	Speedup Factor
One-hot (1000 categories × 5 cat features)	5,000	~5-10	500-1000×
Bag of words (10K vocab)	10,000	~100-500	20-100×
CTR prediction (100K sparse)	100,000	~5,000	20×
Mixed (sparse + dense)	1,000 (90% sparse)	~100	10×
Dense only	100 (all dense)	~100	1× (no benefit)

efb_lightgbm_demo.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
import lightgbm as lgb
import numpy as np
from sklearn.datasets import make_classification
from sklearn.preprocessing import OneHotEncoder
from sklearn.model_selection import train_test_split
import time
 
def create_sparse_high_dim_dataset(n_samples=50000):
    """
    Create a dataset with many sparse one-hot features.
    This is where EFB shines.
    """
    np.random.seed(42)
    
    # Generate base classification task
    X_numeric, y = make_classification(
        n_samples=n_samples, n_features=10, n_informative=5,
        n_redundant=2, random_state=42
    )
    
    # Add high-cardinality categoricals that will be one-hot encoded
    categoricals = [
        np.random.randint(0, 100, n_samples),   # 100 categories
        np.random.randint(0, 50, n_samples),    # 50 categories
        np.random.randint(0, 200, n_samples),   # 200 categories
        np.random.randint(0, 75, n_samples),    # 75 categories
    ]
    
    # One-hot encode (will create 100+50+200+75=425 sparse features)
    encoder = OneHotEncoder(sparse_output=False, handle_unknown='ignore')
    X_categorical = encoder.fit_transform(np.column_stack(categoricals))
    
    # Combine: 10 numeric + 425 one-hot = 435 features
    X = np.hstack([X_numeric, X_categorical])
    
    return X, y, encoder
 
 
def compare_with_and_without_sparse_handling():
    """
    Compare LightGBM with and without taking advantage of sparsity.
    """
    print("Creating sparse high-dimensional dataset...")
    X, y, encoder = create_sparse_high_dim_dataset(n_samples=50000)
    
    print(f"Dataset shape: {X.shape}")
    print(f"  Numeric features: 10")
    print(f"  One-hot features: {X.shape[1] - 10}")
    print(f"  Sparsity: {100 * (X == 0).mean():.1f}%")
    
    X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42)
    
    # LightGBM with default settings (EFB active)
    # Note: LightGBM automatically enables EFB when beneficial
    params_with_efb = {
        'objective': 'binary',
        'metric': 'auc',
        'boosting_type': 'gbdt',
        'num_leaves': 31,
        'learning_rate': 0.05,
        'force_col_wise': True,  # Column-wise histogram (better for sparse)
        'verbose': -1,
    }
    
    # LightGBM forcing row-wise (less efficient for sparse)
    params_row_wise = {
        'objective': 'binary',
        'metric': 'auc',
        'boosting_type': 'gbdt',
        'num_leaves': 31,
        'learning_rate': 0.05,
        'force_row_wise': True,  # Row-wise (less optimal for sparse)
        'verbose': -1,
    }
    
    train_data = lgb.Dataset(X_train, label=y_train)
    val_data = lgb.Dataset(X_val, label=y_val, reference=train_data)
    
    print("\nTraining with column-wise (EFB-friendly)...")
    start = time.time()
    model_efb = lgb.train(
        params_with_efb, train_data,
        num_boost_round=100,
        valid_sets=[val_data],
        callbacks=[lgb.log_evaluation(period=0)]
    )
    time_efb = time.time() - start
    
    print("Training with row-wise (less optimal)...")
    start = time.time()
    model_row = lgb.train(
        params_row_wise, train_data,
        num_boost_round=100,
        valid_sets=[val_data],
        callbacks=[lgb.log_evaluation(period=0)]
    )
    time_row = time.time() - start
    
    print(f"\nResults:")
    print(f"  Column-wise: {time_efb:.2f}s, AUC={model_efb.best_score['valid_0']['auc']:.4f}")
    print(f"  Row-wise:    {time_row:.2f}s, AUC={model_row.best_score['valid_0']['auc']:.4f}")
    print(f"  Speedup:     {time_row / time_efb:.2f}x")
 
 
def demonstrate_categorical_native():
    """
    Demonstrate LightGBM's native categorical handling vs one-hot + EFB.
    """
    np.random.seed(42)
    n_samples = 50000
    
    # Create categorical features
    cat1 = np.random.randint(0, 100, n_samples)
    cat2 = np.random.randint(0, 50, n_samples)
    cat3 = np.random.randint(0, 200, n_samples)
    
    # Target correlated with categories
    y = ((cat1 < 30) | (cat2 > 25) | (cat3 % 7 == 0)).astype(int)
    
    # Method 1: One-hot encoding + EFB
    encoder = OneHotEncoder(sparse_output=False)
    X_onehot = encoder.fit_transform(np.column_stack([cat1, cat2, cat3]))
    
    # Method 2: Native categorical (LightGBM handles internally)
    X_native = np.column_stack([cat1, cat2, cat3])
    
    X_oh_train, X_oh_val, y_train, y_val = train_test_split(X_onehot, y, test_size=0.2, random_state=42)
    X_nat_train, X_nat_val, _, _ = train_test_split(X_native, y, test_size=0.2, random_state=42)
    
    print(f"One-hot features: {X_onehot.shape[1]}")
    print(f"Native features: {X_native.shape[1]}")
    
    # One-hot approach
    params = {
        'objective': 'binary',
        'metric': 'auc',
        'num_leaves': 31,
        'learning_rate': 0.05,
        'verbose': -1
    }
    
    print("\nTraining with one-hot encoding + EFB...")
    start = time.time()
    train_oh = lgb.Dataset(X_oh_train, label=y_train)
    val_oh = lgb.Dataset(X_oh_val, label=y_val, reference=train_oh)
    model_oh = lgb.train(params, train_oh, num_boost_round=100, valid_sets=[val_oh])
    time_oh = time.time() - start
    
    print("Training with native categorical handling...")
    start = time.time()
    train_nat = lgb.Dataset(X_nat_train, label=y_train, categorical_feature=[0, 1, 2])
    val_nat = lgb.Dataset(X_nat_val, label=y_val, reference=train_nat)
    model_nat = lgb.train(params, train_nat, num_boost_round=100, valid_sets=[val_nat])
    time_nat = time.time() - start
    
    print(f"\nResults:")
    print(f"  One-hot + EFB:    {time_oh:.2f}s, AUC={model_oh.best_score['valid_0']['auc']:.4f}")
    print(f"  Native categorical: {time_nat:.2f}s, AUC={model_nat.best_score['valid_0']['auc']:.4f}")
 
 
if __name__ == "__main__":
    compare_with_and_without_sparse_handling()
    print("\n" + "="*60 + "\n")
    demonstrate_categorical_native()

Native Categorical vs One-Hot + EFB

LightGBM supports native categorical features (specify via categorical_feature parameter). For high-cardinality categoricals, native handling is often faster and can be more accurate than one-hot encoding, as LightGBM can find optimal category partitions rather than considering each level as a separate split.

Limitations and Edge Cases

While EFB is powerful, it has limitations and scenarios where it provides less benefit or may even hurt performance.

EFB Limitations

•Dense features: EFB provides no benefit when features are dense (always non-zero). The bundling step becomes overhead without gains.
•Highly conflicting features: If most features conflict with each other (high conflict count), few can be bundled together. The reduction ratio approaches 1:1.
•Information loss with conflicts: When ε-exclusivity is used (allowing some conflicts), there's a small amount of information loss. For extremely sensitive tasks, this might matter.
•Preprocessing overhead: The bundling algorithm has O(n_features² × n_samples) cost for conflict computation. For extremely high-dimensional data (millions of features), this can be significant.
•Model interpretability: Bundled features are harder to interpret. Feature importances apply to bundles, not individual original features.

When Not to Rely on EFB:

All features are dense numericals: Use standard LightGBM; EFB overhead isn't worthwhile.
Extreme feature counts (>1M): Consider dimensionality reduction or feature selection before training.
Interpretability is critical: Use native categorical handling or ensure you can trace back from bundles.
Near-exclusivity tolerance is too high: Setting a high conflict tolerance can degrade model quality.

Best Practices:

Let LightGBM handle bundling automatically (default behavior)
For interpretability, use lgb.feature_importance() and map back to original features
Consider native categorical handling for high-cardinality categoricals
Profile training time to verify EFB provides expected speedup

Summary: Exclusive Feature Bundling

Exclusive Feature Bundling is LightGBM's elegant solution to the curse of high-dimensional sparse features. By identifying mutually exclusive features and combining them through offset encoding, EFB dramatically reduces the effective feature dimensionality—often by 10× to 100× for sparse datasets.

Key Takeaways

•Mutual exclusivity enables bundling — Features that never both take non-zero values can be safely combined into one feature.
•Graph coloring identifies bundles — The bundling problem maps to finding independent sets in a conflict graph.
•Greedy algorithm provides practical solutions — Sorting by conflict count and greedily assigning to bundles works well in practice.
•Offset encoding preserves information — Each feature in a bundle gets a unique range in the bundled values.
•EFB pairs with histogram-based splitting — Bundled features work seamlessly with LightGBM's histogram approach.
•Greatest benefit on sparse data — One-hot encoded categoricals and text features see the most dramatic speedups.

What's Next:

We've seen how EFB reduces feature dimensionality. The next page covers Histogram-based Splitting, the technique that makes split finding efficient even before bundling—and which works synergistically with EFB for maximum performance.

Page Complete

You now understand Exclusive Feature Bundling—from the sparsity problem it solves, through the graph coloring formulation, to the offset encoding scheme. Combined with GOSS and leaf-wise growth, EFB forms the trifecta of LightGBM's efficiency innovations.

3 / 5

Loading learning content...

Machine LearningLightGBM

LightGBM: Light Gradient Boosting Machine

LevelAdvanced

Duration90 mins

TopicLightGBM

3 / 5

Exclusive Feature Bundling (EFB)

Turning High-Dimensional Data into a Virtue

What You Will Learn

The Sparsity Challenge

Before understanding EFB's solution, we need to understand the problem it solves: the computational burden of sparse, high-dimensional features.

Sources of Sparse Features:

Sparse features arise from many common data preprocessing techniques:

One-Hot Encoding: A categorical feature with 1000 categories becomes 1000 binary features. For any sample, exactly one is 1, and 999 are 0.
Text Processing (Bag of Words): A vocabulary of 50,000 words creates 50,000 features per document. Most documents contain only a few hundred unique words.
Feature Interactions: Creating polynomial features from 100 base features yields thousands of interaction terms, most of which are zero for any given sample.
Bucketed Numericals: Discretizing a continuous feature into 100 bins creates 100 binary features, with only one non-zero per sample.

The Computational Cost:

Feature Sparsity in Common Scenarios
Scenario	Raw Features	Non-zero per Sample	Sparsity
One-hot encoding (1000 categories)	1,000	1	99.9%
Text classification (50K vocab)	50,000	~200	99.6%
Click-through rate prediction	100,000+	~100	99.9%
Polynomial features (d=2, n=100)	5,050	~100	98.0%
Genomic data	20,000+	~500	97.5%

The Sparsity Paradox

Mutual Exclusivity

The concept of mutual exclusivity is central to EFB. Two features are mutually exclusive if they never both take non-zero values for the same sample.

Formal Definition:

Features $f_j$ and $f_k$ are mutually exclusive if: $$\forall i: , x_{ij} \neq 0 \implies x_{ik} = 0$$

In other words, whenever feature $j$ is active (non-zero), feature $k$ must be inactive (zero), and vice versa.

Perfect vs. Near Exclusivity:

If features $f_j$ and $f_k$ both have non-zero values on at most $\epsilon$ fraction of samples, we say they are $\epsilon$-exclusive and can be bundled with a small loss of precision.

mutual_exclusivity.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
import numpy as np
from scipy import sparse
 
def compute_conflict_matrix(X: np.ndarray, threshold: float = 0.0) -> np.ndarray:
    """
    Compute the conflict count between all pairs of features.
    
    Two features conflict on a sample if both are non-zero for that sample.
    
    Parameters:
    -----------
    X : array of shape (n_samples, n_features)
        Feature matrix (can be dense or sparse)
    threshold : float
        Values with |x| <= threshold are considered zero
        
    Returns:
    --------
    conflicts : array of shape (n_features, n_features)
        conflicts[i, j] = number of samples where both features i and j are non-zero
    """
    # Convert to binary (non-zero indicator)
    if sparse.issparse(X):
        X_binary = (np.abs(X) > threshold).astype(np.float32)
        # X^T @ X gives dot products of binary columns = conflict counts
        conflicts = (X_binary.T @ X_binary).toarray()
    else:
        X_binary = (np.abs(X) > threshold).astype(np.float32)
        conflicts = X_binary.T @ X_binary
    
    return conflicts.astype(int)
 
 
def find_exclusive_pairs(X: np.ndarray, max_conflict_rate: float = 0.01):
    """
    Find pairs of features that are nearly mutually exclusive.
    
    Parameters:
    -----------
    X : Feature matrix
    max_conflict_rate : Maximum fraction of conflicting samples allowed
    
    Returns:
    --------
    exclusive_pairs : list of (feature_i, feature_j) that are epsilon-exclusive
    """
    n_samples, n_features = X.shape
    max_conflicts = int(n_samples * max_conflict_rate)
    
    conflicts = compute_conflict_matrix(X)
    
    exclusive_pairs = []
    for i in range(n_features):
        for j in range(i + 1, n_features):
            if conflicts[i, j] <= max_conflicts:
                exclusive_pairs.append((i, j))
    
    return exclusive_pairs, conflicts
 
 
# Demonstration with one-hot encoded features
def demonstrate_exclusivity():
    np.random.seed(42)
    n_samples = 1000
    
    # Create a realistic scenario: 
    # - 3 categorical features with different cardinalities
    # - Each gets one-hot encoded
    
    cat1 = np.random.randint(0, 5, n_samples)   # 5 categories
    cat2 = np.random.randint(0, 10, n_samples)  # 10 categories
    cat3 = np.random.randint(0, 8, n_samples)   # 8 categories
    
    # One-hot encode (creates 5 + 10 + 8 = 23 features)
    from sklearn.preprocessing import OneHotEncoder
    encoder = OneHotEncoder(sparse_output=False)
    X_encoded = encoder.fit_transform(np.column_stack([cat1, cat2, cat3]))
    
    print(f"Original: 3 categorical features")
    print(f"After one-hot encoding: {X_encoded.shape[1]} features")
    
    # Compute conflicts
    conflicts = compute_conflict_matrix(X_encoded)
    
    print(f"\nConflict matrix shape: {conflicts.shape}")
    print(f"Total pairs of features: {conflicts.shape[0] * (conflicts.shape[0] - 1) // 2}")
    
    # Count perfectly exclusive pairs
    n_exclusive = np.sum(conflicts == 0) // 2 - X_encoded.shape[1] // 2  # Subtract diagonal
    n_total_pairs = X_encoded.shape[1] * (X_encoded.shape[1] - 1) // 2
    
    print(f"Perfectly exclusive pairs: {n_exclusive} ({100*n_exclusive/n_total_pairs:.1f}%)")
    
    # Features within the same one-hot group are perfectly exclusive
    # Features across groups have conflicts based on joint distribution
    
    # Identify bundles (features from same original categorical)
    print(f"\nFeature bundles (from same categorical):")
    categories = encoder.categories_
    offset = 0
    for i, cat in enumerate(categories):
        bundle_indices = list(range(offset, offset + len(cat)))
        print(f"  Bundle {i+1}: features {bundle_indices} ({len(cat)} features)")
        offset += len(cat)
    
    return X_encoded, conflicts
 
 
if __name__ == "__main__":
    X, conflicts = demonstrate_exclusivity()

One-Hot Features are Perfectly Exclusive

The Graph Coloring Formulation

Finding optimal bundles of mutually exclusive features is equivalent to a classic graph theory problem: graph coloring.

The Conflict Graph:

We construct an undirected graph $G = (V, E)$ where:

Each feature is a vertex: $V = {f_1, f_2, \ldots, f_p}$
An edge $(f_i, f_j) \in E$ exists if features $i$ and $j$ conflict (both non-zero on the same sample)

The Graph Coloring Problem:

Graph coloring assigns colors to vertices such that no two adjacent vertices share the same color. The chromatic number $\chi(G)$ is the minimum number of colors needed.

In the EFB context:

Each color represents a bundle
Features with the same color can be bundled together
The chromatic number is the minimum number of bundles needed

Computational Complexity:

Finding the chromatic number is NP-hard in general. However, EFB doesn't need the optimal solution—a good greedy approximation is sufficient.

Graph Coloring → Feature Bundling Mapping

•Vertices ↔ Features
•Edges ↔ Conflicts (features both non-zero on same sample)
•Colors ↔ Bundles
•Independent Set ↔ Bundle of non-conflicting features
•Chromatic Number ↔ Minimum number of bundles
•Greedy Coloring ↔ LightGBM's bundling algorithm

LightGBM's Greedy Bundling Algorithm:

LightGBM uses a greedy algorithm based on the observation that features with more conflicts are harder to bundle. The algorithm:

Sort features by their total conflict count (descending)
Iterate through sorted features, assigning each to an existing bundle if possible, otherwise creating a new bundle
A feature can join a bundle if its total conflicts with bundle members is below a threshold $K$

This greedy approach doesn't guarantee optimality but works well in practice because:

Sparse data typically has low conflict counts
The order of processing (high-conflict first) helps avoid poor bundle assignments

greedy_bundling.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
import numpy as np
from typing import List, Set, Tuple
 
class GreedyBundler:
    """
    Greedy feature bundling algorithm as used in LightGBM.
    
    This implements the graph coloring approach to finding bundles of
    mutually (or nearly) exclusive features.
    """
    
    def __init__(self, max_conflict: int = 0):
        """
        Initialize bundler.
        
        Parameters:
        -----------
        max_conflict : int
            Maximum number of conflicting samples allowed for features
            to be bundled together. 0 = perfectly exclusive only.
        """
        self.max_conflict = max_conflict
        self.bundles: List[List[int]] = []
        
    def fit(self, X: np.ndarray) -> 'GreedyBundler':
        """
        Find feature bundles using greedy algorithm.
        
        Parameters:
        -----------
        X : array of shape (n_samples, n_features)
        
        Returns:
        --------
        self
        """
        n_samples, n_features = X.shape
        
        # Step 1: Compute conflict counts for each feature pair
        X_binary = (X != 0).astype(np.float32)
        conflict_matrix = (X_binary.T @ X_binary).astype(int)
        np.fill_diagonal(conflict_matrix, 0)  # No self-conflicts
        
        # Step 2: Compute total conflicts per feature
        total_conflicts = conflict_matrix.sum(axis=1)
        
        # Step 3: Sort features by total conflicts (descending)
        # Features with more conflicts are harder to bundle, process first
        sorted_features = np.argsort(total_conflicts)[::-1]
        
        # Step 4: Greedy bundling
        self.bundles = []
        bundle_conflicts: List[np.ndarray] = []  # Track conflicts per bundle
        
        for feature_idx in sorted_features:
            # Try to add to existing bundle
            added = False
            
            for bundle_idx, bundle in enumerate(self.bundles):
                # Check if this feature conflicts too much with the bundle
                bundle_conflict_count = sum(
                    conflict_matrix[feature_idx, f] for f in bundle
                )
                
                if bundle_conflict_count <= self.max_conflict:
                    # Can add to this bundle
                    self.bundles[bundle_idx].append(feature_idx)
                    added = True
                    break
            
            if not added:
                # Create new bundle
                self.bundles.append([feature_idx])
        
        return self
    
    def get_bundles(self) -> List[List[int]]:
        """Return the computed bundles."""
        return self.bundles
    
    def get_statistics(self, n_features: int) -> dict:
        """Get bundling statistics."""
        bundle_sizes = [len(b) for b in self.bundles]
        return {
            'original_features': n_features,
            'num_bundles': len(self.bundles),
            'reduction_ratio': n_features / len(self.bundles) if self.bundles else 1,
            'avg_bundle_size': np.mean(bundle_sizes) if bundle_sizes else 0,
            'max_bundle_size': max(bundle_sizes) if bundle_sizes else 0,
            'min_bundle_size': min(bundle_sizes) if bundle_sizes else 0,
        }
 
 
# Demonstration
def demonstrate_bundling():
    np.random.seed(42)
    n_samples = 1000
    
    # Create sparse one-hot-like features from multiple categoricals
    # This simulates a common scenario in tabular data
    
    # Categorical 1: 20 categories (one-hot → 20 features)
    cat1 = np.zeros((n_samples, 20))
    cat1[np.arange(n_samples), np.random.randint(0, 20, n_samples)] = 1
    
    # Categorical 2: 15 categories  
    cat2 = np.zeros((n_samples, 15))
    cat2[np.arange(n_samples), np.random.randint(0, 15, n_samples)] = 1
    
    # Categorical 3: 25 categories
    cat3 = np.zeros((n_samples, 25))
    cat3[np.arange(n_samples), np.random.randint(0, 25, n_samples)] = 1
    
    # Some additional sparse features (not one-hot, but sparse)
    sparse_features = np.random.choice([0, 1], size=(n_samples, 10), p=[0.95, 0.05])
    
    # Dense numerical features
    dense_features = np.random.randn(n_samples, 5)
    
    # Combine all
    X = np.hstack([cat1, cat2, cat3, sparse_features, dense_features])
    
    print("Feature Composition:")
    print(f"  One-hot from cat1: 20 features")
    print(f"  One-hot from cat2: 15 features")
    print(f"  One-hot from cat3: 25 features")
    print(f"  Sparse features: 10 features")
    print(f"  Dense features: 5 features")
    print(f"  Total: {X.shape[1]} features")
    
    # Apply greedy bundling
    bundler = GreedyBundler(max_conflict=0)  # Perfect exclusivity
    bundler.fit(X)
    
    stats = bundler.get_statistics(X.shape[1])
    
    print(f"\nBundling Results (strict, max_conflict=0):")
    print(f"  Number of bundles: {stats['num_bundles']}")
    print(f"  Reduction ratio: {stats['reduction_ratio']:.2f}x")
    print(f"  Average bundle size: {stats['avg_bundle_size']:.1f}")
    print(f"  Max bundle size: {stats['max_bundle_size']}")
    
    # Show bundle composition
    print(f"\nBundle contents:")
    for i, bundle in enumerate(bundler.get_bundles()[:5]):  # Show first 5
        print(f"  Bundle {i}: features {sorted(bundle)[:10]}" + 
              (f"... ({len(bundle)} total)" if len(bundle) > 10 else f" ({len(bundle)} total)"))
    if len(bundler.get_bundles()) > 5:
        print(f"  ... and {len(bundler.get_bundles()) - 5} more bundles")
    
    # Now with relaxed constraints
    bundler_relaxed = GreedyBundler(max_conflict=10)
    bundler_relaxed.fit(X)
    stats_relaxed = bundler_relaxed.get_statistics(X.shape[1])
    
    print(f"\nBundling Results (relaxed, max_conflict=10):")
    print(f"  Number of bundles: {stats_relaxed['num_bundles']}")
    print(f"  Reduction ratio: {stats_relaxed['reduction_ratio']:.2f}x")
    
    return X, bundler
 
 
if __name__ == "__main__":
    X, bundler = demonstrate_bundling()

The Bundle Encoding Scheme

The Offset Encoding Scheme:

LightGBM uses a simple but effective scheme. For each bundle:

Assign each feature in the bundle a unique offset
Add the offset to the feature's value to create the bundled value
Ensure offsets are large enough that value ranges don't overlap

Let features $f_1, f_2, \ldots, f_k$ be in a bundle, with value ranges $[0, r_1], [0, r_2], \ldots, [0, r_k]$.

Assign offsets:

$\text{offset}_1 = 0$
$\text{offset}_2 = r_1 + 1$
$\text{offset}_3 = r_1 + r_2 + 2$
And so on...

The bundled value is: $$v_{\text{bundle}} = v_{\text{original}} + \text{offset}_{\text{feature}}$$

Since features are mutually exclusive, only one contributes a non-zero value at a time.

bundle_encoding.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
import numpy as np
from typing import List, Tuple
 
class BundleEncoder:
    """
    Encodes bundles of mutually exclusive features into single features.
    
    Uses offset encoding to preserve distinguishability.
    """
    
    def __init__(self, bundles: List[List[int]], feature_ranges: np.ndarray):
        """
        Initialize encoder.
        
        Parameters:
        -----------
        bundles : List of feature index lists (one per bundle)
        feature_ranges : Array of (min, max) for each original feature
        """
        self.bundles = bundles
        self.feature_ranges = feature_ranges
        self.offsets = self._compute_offsets()
        
    def _compute_offsets(self) -> List[dict]:
        """Compute offsets for each feature in each bundle."""
        bundle_offsets = []
        
        for bundle in self.bundles:
            offsets = {}
            current_offset = 0
            
            for feature_idx in bundle:
                offsets[feature_idx] = current_offset
                # Range is max_value + 1 to leave room for the encoding
                feature_range = self.feature_ranges[feature_idx][1] - self.feature_ranges[feature_idx][0] + 1
                current_offset += feature_range
            
            bundle_offsets.append(offsets)
        
        return bundle_offsets
    
    def encode(self, X: np.ndarray) -> np.ndarray:
        """
        Transform original features to bundled features.
        
        Parameters:
        -----------
        X : array of shape (n_samples, n_features)
        
        Returns:
        --------
        X_bundled : array of shape (n_samples, n_bundles)
        """
        n_samples = X.shape[0]
        X_bundled = np.zeros((n_samples, len(self.bundles)))
        
        for bundle_idx, bundle in enumerate(self.bundles):
            for feature_idx in bundle:
                # Get non-zero mask for this feature
                mask = X[:, feature_idx] != 0
                
                # Add offset and assign
                offset = self.offsets[bundle_idx][feature_idx]
                min_val = self.feature_ranges[feature_idx][0]
                
                # Bundled value = original value - min + offset
                X_bundled[mask, bundle_idx] = X[mask, feature_idx] - min_val + offset
        
        return X_bundled
    
    def get_bundle_info(self) -> List[dict]:
        """Get information about each bundle."""
        info = []
        for bundle_idx, bundle in enumerate(self.bundles):
            bundle_range = max(self.offsets[bundle_idx].values()) +                           (self.feature_ranges[bundle[-1]][1] - self.feature_ranges[bundle[-1]][0] + 1)
            info.append({
                'bundle_idx': bundle_idx,
                'n_features': len(bundle),
                'feature_indices': bundle,
                'total_range': bundle_range,
                'offsets': self.offsets[bundle_idx]
            })
        return info
 
 
# Demonstration
def demonstrate_encoding():
    np.random.seed(42)
    n_samples = 10
    
    # Create a simple example with 3 one-hot encoded categories
    # Category A: 4 classes → features 0-3
    # Category B: 3 classes → features 4-6
    
    cat_a = np.random.randint(0, 4, n_samples)
    cat_b = np.random.randint(0, 3, n_samples)
    
    # One-hot encode
    X = np.zeros((n_samples, 7))
    X[np.arange(n_samples), cat_a] = 1          # Features 0-3
    X[np.arange(n_samples), cat_b + 4] = 1      # Features 4-6
    
    print("Original Feature Matrix:")
    print("(Features 0-3 from category A, 4-6 from category B)")
    print(X.astype(int))
    
    # Define bundles
    bundles = [
        [0, 1, 2, 3],  # All from category A
        [4, 5, 6]      # All from category B
    ]
    
    # Feature ranges (all binary: 0 or 1)
    feature_ranges = np.array([[0, 1]] * 7)
    
    # Encode
    encoder = BundleEncoder(bundles, feature_ranges)
    X_bundled = encoder.encode(X)
    
    print(f"\nBundled Feature Matrix ({len(bundles)} bundles):")
    print(X_bundled.astype(int))
    
    # Explain the encoding
    print("\nBundle 0 (features 0-3):")
    for feat_idx in bundles[0]:
        offset = encoder.offsets[0][feat_idx]
        print(f"  Feature {feat_idx}: value 1 encodes to {offset + 1}")
    
    print("\nBundle 1 (features 4-6):")
    for feat_idx in bundles[1]:
        offset = encoder.offsets[1][feat_idx]
        print(f"  Feature {feat_idx}: value 1 encodes to {offset + 1}")
    
    # Verify encoding preserves information
    print("\nVerification - First 3 samples:")
    for i in range(3):
        orig_a = np.where(X[i, :4] == 1)[0][0]
        orig_b = np.where(X[i, 4:] == 1)[0][0]
        bundled_a = int(X_bundled[i, 0]) - 1  # Subtract 1 because value was 1, offset starts at 0
        bundled_b = int(X_bundled[i, 1]) - 1
        print(f"  Sample {i}: A={orig_a}, B={orig_b} → Bundle0={int(X_bundled[i,0])}, Bundle1={int(X_bundled[i,1])}")
    
    return X, X_bundled, encoder
 
 
if __name__ == "__main__":
    X, X_bundled, encoder = demonstrate_encoding()

Histogram-Based Splitting with Bundles

EFB in Practice

LightGBM implements EFB transparently—you typically don't need to configure it explicitly. However, understanding its behavior helps you optimize for your specific data.

LightGBM's EFB Implementation:

LightGBM applies EFB during preprocessing:

Analyzes feature conflict patterns
Runs the greedy bundling algorithm
Encodes bundles using the offset scheme
Uses bundled features for histogram construction and split finding

The key parameter is max_conflict_rate (internal), which controls how much conflict is tolerated when bundling.

When EFB is Most Effective:

EFB provides dramatic speedups when:

One-hot encoded categoricals dominate the feature space
Sparse text features (bag of words, TF-IDF) are present
Feature interactions have been generated
Natural sparsity exists in the domain (genomics, user behavior, etc.)

EFB Speedup Scenarios
Scenario	Original Features	After EFB	Speedup Factor
One-hot (1000 categories × 5 cat features)	5,000	~5-10	500-1000×
Bag of words (10K vocab)	10,000	~100-500	20-100×
CTR prediction (100K sparse)	100,000	~5,000	20×
Mixed (sparse + dense)	1,000 (90% sparse)	~100	10×
Dense only	100 (all dense)	~100	1× (no benefit)

efb_lightgbm_demo.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
import lightgbm as lgb
import numpy as np
from sklearn.datasets import make_classification
from sklearn.preprocessing import OneHotEncoder
from sklearn.model_selection import train_test_split
import time
 
def create_sparse_high_dim_dataset(n_samples=50000):
    """
    Create a dataset with many sparse one-hot features.
    This is where EFB shines.
    """
    np.random.seed(42)
    
    # Generate base classification task
    X_numeric, y = make_classification(
        n_samples=n_samples, n_features=10, n_informative=5,
        n_redundant=2, random_state=42
    )
    
    # Add high-cardinality categoricals that will be one-hot encoded
    categoricals = [
        np.random.randint(0, 100, n_samples),   # 100 categories
        np.random.randint(0, 50, n_samples),    # 50 categories
        np.random.randint(0, 200, n_samples),   # 200 categories
        np.random.randint(0, 75, n_samples),    # 75 categories
    ]
    
    # One-hot encode (will create 100+50+200+75=425 sparse features)
    encoder = OneHotEncoder(sparse_output=False, handle_unknown='ignore')
    X_categorical = encoder.fit_transform(np.column_stack(categoricals))
    
    # Combine: 10 numeric + 425 one-hot = 435 features
    X = np.hstack([X_numeric, X_categorical])
    
    return X, y, encoder
 
 
def compare_with_and_without_sparse_handling():
    """
    Compare LightGBM with and without taking advantage of sparsity.
    """
    print("Creating sparse high-dimensional dataset...")
    X, y, encoder = create_sparse_high_dim_dataset(n_samples=50000)
    
    print(f"Dataset shape: {X.shape}")
    print(f"  Numeric features: 10")
    print(f"  One-hot features: {X.shape[1] - 10}")
    print(f"  Sparsity: {100 * (X == 0).mean():.1f}%")
    
    X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42)
    
    # LightGBM with default settings (EFB active)
    # Note: LightGBM automatically enables EFB when beneficial
    params_with_efb = {
        'objective': 'binary',
        'metric': 'auc',
        'boosting_type': 'gbdt',
        'num_leaves': 31,
        'learning_rate': 0.05,
        'force_col_wise': True,  # Column-wise histogram (better for sparse)
        'verbose': -1,
    }
    
    # LightGBM forcing row-wise (less efficient for sparse)
    params_row_wise = {
        'objective': 'binary',
        'metric': 'auc',
        'boosting_type': 'gbdt',
        'num_leaves': 31,
        'learning_rate': 0.05,
        'force_row_wise': True,  # Row-wise (less optimal for sparse)
        'verbose': -1,
    }
    
    train_data = lgb.Dataset(X_train, label=y_train)
    val_data = lgb.Dataset(X_val, label=y_val, reference=train_data)
    
    print("\nTraining with column-wise (EFB-friendly)...")
    start = time.time()
    model_efb = lgb.train(
        params_with_efb, train_data,
        num_boost_round=100,
        valid_sets=[val_data],
        callbacks=[lgb.log_evaluation(period=0)]
    )
    time_efb = time.time() - start
    
    print("Training with row-wise (less optimal)...")
    start = time.time()
    model_row = lgb.train(
        params_row_wise, train_data,
        num_boost_round=100,
        valid_sets=[val_data],
        callbacks=[lgb.log_evaluation(period=0)]
    )
    time_row = time.time() - start
    
    print(f"\nResults:")
    print(f"  Column-wise: {time_efb:.2f}s, AUC={model_efb.best_score['valid_0']['auc']:.4f}")
    print(f"  Row-wise:    {time_row:.2f}s, AUC={model_row.best_score['valid_0']['auc']:.4f}")
    print(f"  Speedup:     {time_row / time_efb:.2f}x")
 
 
def demonstrate_categorical_native():
    """
    Demonstrate LightGBM's native categorical handling vs one-hot + EFB.
    """
    np.random.seed(42)
    n_samples = 50000
    
    # Create categorical features
    cat1 = np.random.randint(0, 100, n_samples)
    cat2 = np.random.randint(0, 50, n_samples)
    cat3 = np.random.randint(0, 200, n_samples)
    
    # Target correlated with categories
    y = ((cat1 < 30) | (cat2 > 25) | (cat3 % 7 == 0)).astype(int)
    
    # Method 1: One-hot encoding + EFB
    encoder = OneHotEncoder(sparse_output=False)
    X_onehot = encoder.fit_transform(np.column_stack([cat1, cat2, cat3]))
    
    # Method 2: Native categorical (LightGBM handles internally)
    X_native = np.column_stack([cat1, cat2, cat3])
    
    X_oh_train, X_oh_val, y_train, y_val = train_test_split(X_onehot, y, test_size=0.2, random_state=42)
    X_nat_train, X_nat_val, _, _ = train_test_split(X_native, y, test_size=0.2, random_state=42)
    
    print(f"One-hot features: {X_onehot.shape[1]}")
    print(f"Native features: {X_native.shape[1]}")
    
    # One-hot approach
    params = {
        'objective': 'binary',
        'metric': 'auc',
        'num_leaves': 31,
        'learning_rate': 0.05,
        'verbose': -1
    }
    
    print("\nTraining with one-hot encoding + EFB...")
    start = time.time()
    train_oh = lgb.Dataset(X_oh_train, label=y_train)
    val_oh = lgb.Dataset(X_oh_val, label=y_val, reference=train_oh)
    model_oh = lgb.train(params, train_oh, num_boost_round=100, valid_sets=[val_oh])
    time_oh = time.time() - start
    
    print("Training with native categorical handling...")
    start = time.time()
    train_nat = lgb.Dataset(X_nat_train, label=y_train, categorical_feature=[0, 1, 2])
    val_nat = lgb.Dataset(X_nat_val, label=y_val, reference=train_nat)
    model_nat = lgb.train(params, train_nat, num_boost_round=100, valid_sets=[val_nat])
    time_nat = time.time() - start
    
    print(f"\nResults:")
    print(f"  One-hot + EFB:    {time_oh:.2f}s, AUC={model_oh.best_score['valid_0']['auc']:.4f}")
    print(f"  Native categorical: {time_nat:.2f}s, AUC={model_nat.best_score['valid_0']['auc']:.4f}")
 
 
if __name__ == "__main__":
    compare_with_and_without_sparse_handling()
    print("\n" + "="*60 + "\n")
    demonstrate_categorical_native()

Native Categorical vs One-Hot + EFB

Limitations and Edge Cases

While EFB is powerful, it has limitations and scenarios where it provides less benefit or may even hurt performance.

EFB Limitations

•Dense features: EFB provides no benefit when features are dense (always non-zero). The bundling step becomes overhead without gains.
•Highly conflicting features: If most features conflict with each other (high conflict count), few can be bundled together. The reduction ratio approaches 1:1.
•Information loss with conflicts: When ε-exclusivity is used (allowing some conflicts), there's a small amount of information loss. For extremely sensitive tasks, this might matter.
•Preprocessing overhead: The bundling algorithm has O(n_features² × n_samples) cost for conflict computation. For extremely high-dimensional data (millions of features), this can be significant.
•Model interpretability: Bundled features are harder to interpret. Feature importances apply to bundles, not individual original features.

When Not to Rely on EFB:

All features are dense numericals: Use standard LightGBM; EFB overhead isn't worthwhile.
Extreme feature counts (>1M): Consider dimensionality reduction or feature selection before training.
Interpretability is critical: Use native categorical handling or ensure you can trace back from bundles.
Near-exclusivity tolerance is too high: Setting a high conflict tolerance can degrade model quality.

Best Practices:

Let LightGBM handle bundling automatically (default behavior)
For interpretability, use lgb.feature_importance() and map back to original features
Consider native categorical handling for high-cardinality categoricals
Profile training time to verify EFB provides expected speedup

Summary: Exclusive Feature Bundling

Key Takeaways

•Mutual exclusivity enables bundling — Features that never both take non-zero values can be safely combined into one feature.
•Graph coloring identifies bundles — The bundling problem maps to finding independent sets in a conflict graph.
•Greedy algorithm provides practical solutions — Sorting by conflict count and greedily assigning to bundles works well in practice.
•Offset encoding preserves information — Each feature in a bundle gets a unique range in the bundled values.
•EFB pairs with histogram-based splitting — Bundled features work seamlessly with LightGBM's histogram approach.
•Greatest benefit on sparse data — One-hot encoded categoricals and text features see the most dramatic speedups.

What's Next:

Page Complete

3 / 5