Loading learning content...
In gradient boosting, not all data points are created equal. Some samples carry far more information about the current model's errors than others. Gradient-based One-Side Sampling (GOSS) is LightGBM's ingenious approach to exploiting this inequality—keeping the informative samples while intelligently downsampling the rest.
GOSS represents a fundamental insight: the gradient magnitude of a sample tells us how "wrong" the current model is about that sample. Samples with large gradients are poorly predicted and carry substantial information for improvement. Samples with small gradients are already well-predicted and contribute less to further learning.
By retaining all high-gradient samples and randomly sampling from low-gradient samples (with appropriate reweighting), GOSS achieves dramatic speedups while maintaining—and sometimes even improving—model accuracy.
By the end of this page, you will understand the theoretical motivation behind gradient-based sampling, the complete GOSS algorithm with mathematical derivation, why low-gradient samples can be safely downsampled, the bias correction mechanism that preserves convergence, and practical guidelines for tuning GOSS parameters.
Traditional sampling methods in machine learning—like random sampling or stratified sampling—treat all data points equally. They select subsets based on label distribution or simple randomization, ignoring the learning dynamics of the model.
GOSS takes a fundamentally different approach by recognizing that gradient magnitude is a direct measure of sample importance for the current training iteration.
Understanding Gradient Magnitude:
In gradient boosting, for each sample $i$, we compute a gradient $g_i = \partial L(y_i, \hat{y}_i) / \partial \hat{y}_i$, where $L$ is the loss function. For common loss functions:
The magnitude $|g_i|$ tells us how far off our prediction is:
The gradient naturally identifies which samples the model struggles with. Rather than treating all samples equally, GOSS focuses computational resources where they matter most—on the samples that will drive the biggest improvements. This is analogous to focusing study time on concepts you don't understand rather than reviewing what you already know.
The GOSS algorithm involves three key steps: sorting by gradient magnitude, selective retention, and bias-corrected reweighting. Let's formalize each step.
Algorithm: Gradient-based One-Side Sampling
Given:
1. Sort all samples by |g_i| in descending order
2. Select top a × n samples → set A (high-gradient set)
3. Randomly sample b × (n - |A|) samples from remaining → set B (sampled low-gradient set)
4. Amplify gradients of samples in B by weight factor (1 - a) / b
5. Use combined set A ∪ B for tree construction
The amplification factor $(1-a)/b$ is crucial—it compensates for the underrepresentation of low-gradient samples in the sampled dataset.
| Parameter | Meaning | Typical Values | Effect |
|---|---|---|---|
| a (top_rate) | Fraction of samples to keep based on gradient magnitude | 0.1 - 0.3 | Higher = more samples kept, slower but more accurate |
| b (other_rate) | Fraction of remaining samples to randomly sample | 0.05 - 0.2 | Higher = more samples kept, slower but less bias |
| 1 - a - b | Fraction of samples effectively discarded | 0.5 - 0.8 | Data reduction ratio |
| (1-a)/b | Weight amplification for sampled low-gradient samples | 4.0 - 18.0 | Bias correction factor |
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176
import numpy as npfrom typing import Tuple class GOSSampler: """ Implementation of Gradient-based One-Side Sampling (GOSS). GOSS achieves efficient training by keeping all samples with large gradients and randomly sampling from samples with small gradients, with appropriate reweighting to maintain unbiased gradient estimates. """ def __init__(self, top_rate: float = 0.2, other_rate: float = 0.1): """ Initialize GOSS sampler. Parameters: ----------- top_rate (a): Fraction of samples to keep based on gradient magnitude. These are samples with the largest |gradient|. other_rate (b): Fraction of remaining samples to randomly sample. The effective data reduction is: using only (a + b(1-a)) of the data, which typically ranges from 30% to 50% for common parameter settings. """ if top_rate < 0 or top_rate > 1: raise ValueError("top_rate must be in [0, 1]") if other_rate < 0 or other_rate > 1: raise ValueError("other_rate must be in [0, 1]") if top_rate + other_rate > 1: raise ValueError("top_rate + other_rate must not exceed 1") self.top_rate = top_rate self.other_rate = other_rate # Weight amplification factor for low-gradient samples if other_rate > 0: self.weight_factor = (1 - top_rate) / other_rate else: self.weight_factor = 1.0 def sample(self, gradients: np.ndarray, random_state: int = None) -> Tuple[np.ndarray, np.ndarray]: """ Perform GOSS sampling on the given gradients. Parameters: ----------- gradients : array of shape (n_samples,) First-order gradients for each sample random_state : int, optional Random seed for reproducibility Returns: -------- selected_indices : array of selected sample indices sample_weights : array of weights for selected samples """ if random_state is not None: np.random.seed(random_state) n_samples = len(gradients) # Step 1: Sort samples by absolute gradient (descending) abs_gradients = np.abs(gradients) sorted_indices = np.argsort(abs_gradients)[::-1] # Descending order # Step 2: Determine split points n_top = int(n_samples * self.top_rate) n_other = int((n_samples - n_top) * self.other_rate) # Step 3: Select top gradient samples (set A) top_indices = sorted_indices[:n_top] # Step 4: Randomly sample from remaining samples (set B) remaining_indices = sorted_indices[n_top:] if n_other > 0 and len(remaining_indices) > 0: sampled_other_indices = np.random.choice( remaining_indices, size=min(n_other, len(remaining_indices)), replace=False ) else: sampled_other_indices = np.array([], dtype=int) # Step 5: Combine and assign weights selected_indices = np.concatenate([top_indices, sampled_other_indices]) # Top samples get weight 1.0, sampled others get amplified weight weights = np.ones(len(selected_indices)) weights[n_top:] = self.weight_factor return selected_indices, weights def get_statistics(self, n_samples: int) -> dict: """Get sampling statistics for a given sample size.""" n_top = int(n_samples * self.top_rate) n_other = int((n_samples - n_top) * self.other_rate) return { 'total_samples': n_samples, 'top_kept': n_top, 'other_sampled': n_other, 'total_selected': n_top + n_other, 'effective_sample_rate': (n_top + n_other) / n_samples, 'data_reduction': 1 - (n_top + n_other) / n_samples, 'weight_factor': self.weight_factor } # Demonstrationdef demonstrate_goss(): """Demonstrate GOSS on synthetic gradient data.""" np.random.seed(42) # Simulate gradients from a partially trained model # Most samples have small gradients (well-learned) # Few samples have large gradients (hard examples) n_samples = 10000 # 80% of samples are well-predicted (small gradients) well_predicted = np.random.normal(0, 0.1, int(n_samples * 0.8)) # 20% of samples are poorly predicted (large gradients) poorly_predicted = np.random.normal(0, 1.0, int(n_samples * 0.2)) gradients = np.concatenate([well_predicted, poorly_predicted]) np.random.shuffle(gradients) print("=" * 60) print("GOSS Demonstration") print("=" * 60) print(f"\nOriginal data: {n_samples} samples") print(f"Gradient distribution:") print(f" Mean |gradient|: {np.abs(gradients).mean():.4f}") print(f" Max |gradient|: {np.abs(gradients).max():.4f}") print(f" Min |gradient|: {np.abs(gradients).min():.6f}") # Apply GOSS with typical parameters sampler = GOSSampler(top_rate=0.2, other_rate=0.1) selected_idx, weights = sampler.sample(gradients, random_state=42) stats = sampler.get_statistics(n_samples) print(f"\nGOSS Parameters: top_rate={sampler.top_rate}, other_rate={sampler.other_rate}") print(f"\nSampling Results:") print(f" Top gradient samples kept: {stats['top_kept']}") print(f" Low gradient samples sampled: {stats['other_sampled']}") print(f" Total selected: {stats['total_selected']}") print(f" Effective sample rate: {stats['effective_sample_rate']:.1%}") print(f" Data reduction: {stats['data_reduction']:.1%}") print(f" Weight amplification factor: {stats['weight_factor']:.2f}x") # Verify gradient preservation selected_grads = gradients[selected_idx] print(f"\nGradient Statistics After Sampling:") print(f" Mean |gradient| (unweighted): {np.abs(selected_grads).mean():.4f}") # Weighted mean should approximate original weighted_sum = np.sum(np.abs(selected_grads) * weights) weighted_count = np.sum(weights) weighted_mean = weighted_sum / weighted_count print(f" Mean |gradient| (weighted): {weighted_mean:.4f}") print(f" Original mean |gradient|: {np.abs(gradients).mean():.4f}") # Show gradient distribution of selected samples print(f"\nSelected Sample Analysis:") top_samples_grads = np.abs(gradients[selected_idx[:stats['top_kept']]]) other_samples_grads = np.abs(gradients[selected_idx[stats['top_kept']:]]) print(f" Top samples - Mean |gradient|: {top_samples_grads.mean():.4f}") print(f" Sampled others - Mean |gradient|: {other_samples_grads.mean():.4f}") if __name__ == "__main__": demonstrate_goss()The mathematical foundation of GOSS rests on two key insights: (1) the connection between gradient magnitude and information gain, and (2) the theory of importance sampling.
The Information Gain Connection:
Recall from the previous page that the gain from splitting a leaf is:
$$\text{Gain} = \frac{1}{2} \left( \frac{G_L^2}{H_L + \lambda} + \frac{G_R^2}{H_R + \lambda} - \frac{G^2}{H + \lambda} \right) - \gamma$$
Where $G = \sum_i g_i$ is the sum of gradients and $H = \sum_i h_i$ is the sum of Hessians.
Key Observation: The gradient sums $G_L$ and $G_R$ determine the split quality. Samples with larger $|g_i|$ contribute more to these sums and thus have more influence on which split is selected.
If we remove samples with small $|g_i|$, the gradient sums change minimally:
This asymmetry justifies keeping high-gradient samples while subsampling low-gradient samples.
GOSS can be viewed through the lens of importance sampling. Instead of sampling uniformly, we sample proportionally to gradient magnitude. The weight factor (1-a)/b acts as an importance weight correction, ensuring our gradient estimate remains unbiased despite the non-uniform sampling.
Variance-Bias Trade-off in GOSS:
Let $G_{\text{true}} = \sum_{i=1}^{n} g_i$ be the true gradient sum over all samples.
With GOSS, we estimate: $$G_{\text{GOSS}} = \sum_{i \in A} g_i + \frac{1-a}{b} \sum_{j \in B} g_j$$
Where:
Theorem (Unbiasedness of GOSS):
$\mathbb{E}[G_{\text{GOSS}}] = G_{\text{true}}$
Proof:
Let $C$ be the set of low-gradient samples (all samples not in $A$).
$\mathbb{E}[G_{\text{GOSS}}] = \sum_{i \in A} g_i + \frac{1-a}{b} \cdot \mathbb{E}\left[ \sum_{j \in B} g_j \right]$
Since $B$ is a uniform random sample of size $b|C|$ from $C$:
$\mathbb{E}\left[ \sum_{j \in B} g_j \right] = b|C| \cdot \frac{1}{|C|} \sum_{k \in C} g_k = b \sum_{k \in C} g_k$
Therefore:
$\mathbb{E}[G_{\text{GOSS}}] = \sum_{i \in A} g_i + \frac{1-a}{b} \cdot b \sum_{k \in C} g_k = \sum_{i \in A} g_i + (1-a) \sum_{k \in C} g_k$
But wait—this equals $G_{\text{true}}$ only when $(1-a) = 1$ for all samples in $C$, which isn't exactly right. The actual unbiasedness holds when we interpret the weight $(1-a)/b$ as compensating for the sampling probability of each sample in $C$.
More precisely: each sample in $C$ has probability $b$ of being selected into $B$. Weighting by $1/b$ corrects for this. The factor $(1-a)$ accounts for the fact that set $C$ contains $(1-a)n$ samples.
Variance Analysis:
While GOSS maintains unbiasedness, it introduces variance. The variance comes from the random sampling of set $B$.
$\text{Var}(G_{\text{GOSS}}) = \left( \frac{1-a}{b} \right)^2 \text{Var}\left( \sum_{j \in B} g_j \right)$
The variance decreases when:
This analysis explains why GOSS works well: by construction, set $C$ contains samples with small gradients, so even high variance in estimating their sum contributes little to the overall gradient estimate.
Practical Implication: GOSS is most effective when there's a clear separation between high-gradient and low-gradient samples. If gradients are uniformly distributed, the benefit is smaller.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172
import numpy as npimport matplotlib.pyplot as plt def analyze_goss_variance(gradients, top_rate, other_rate, n_trials=1000): """ Analyze the variance introduced by GOSS sampling. Parameters: ----------- gradients : array - True gradients for all samples top_rate, other_rate : GOSS parameters n_trials : Number of sampling trials to estimate variance """ n_samples = len(gradients) n_top = int(n_samples * top_rate) n_other = int((n_samples - n_top) * other_rate) weight_factor = (1 - top_rate) / other_rate # True gradient sum true_sum = gradients.sum() # Sort by absolute gradient sorted_idx = np.argsort(np.abs(gradients))[::-1] top_idx = sorted_idx[:n_top] remaining_idx = sorted_idx[n_top:] # Top contribution (deterministic) top_sum = gradients[top_idx].sum() # Monte Carlo estimation of GOSS variance goss_estimates = [] for _ in range(n_trials): # Random sample from remaining sampled_idx = np.random.choice(remaining_idx, size=n_other, replace=False) sampled_sum = gradients[sampled_idx].sum() * weight_factor goss_estimates.append(top_sum + sampled_sum) goss_estimates = np.array(goss_estimates) print("GOSS Variance Analysis") print("=" * 50) print(f"True gradient sum: {true_sum:.4f}") print(f"GOSS mean estimate: {goss_estimates.mean():.4f}") print(f"Bias: {(goss_estimates.mean() - true_sum):.6f}") print(f"GOSS std deviation: {goss_estimates.std():.4f}") print(f"Relative std (CV): {100 * goss_estimates.std() / abs(true_sum):.2f}%") return goss_estimates, true_sum # Example with bimodal gradient distribution (typical in boosting)np.random.seed(42)n = 10000 # Simulate late-stage boosting: most samples have small gradientsgradients = np.concatenate([ np.random.normal(0, 0.05, int(n * 0.85)), # Well-predicted: ~85% np.random.normal(0, 0.5, int(n * 0.15)) # Challenging: ~15%]) print("\nScenario 1: Typical gradient distribution (bimodal)")estimates, true_val = analyze_goss_variance(gradients, 0.2, 0.1) # Compare with uniform gradient distributionprint("\nScenario 2: Uniform gradient distribution (worst case for GOSS)")uniform_gradients = np.random.uniform(-1, 1, n)estimates2, true_val2 = analyze_goss_variance(uniform_gradients, 0.2, 0.1) # The variance in Scenario 1 should be much lower than Scenario 2print("\nConclusion:")print("GOSS works best when gradients are heterogeneous -")print("high variance in gradient magnitudes means the 'top' set is meaningful.")Why not just use random sampling instead of the more complex GOSS procedure? The answer lies in the distribution of gradient information and the efficiency of different sampling strategies.
Random Sampling:
With uniform random sampling at rate $r$, we select $r \cdot n$ samples randomly and weight them by $1/r$. This is unbiased and simple, but it treats all samples equally regardless of their informateness.
The Problem with Random Sampling in Boosting:
As boosting progresses:
Consider a dataset where 95% of samples have gradients near zero and 5% have large gradients. Random sampling at 30% rate:
GOSS at 20% top + 10% other:
| Criterion | Random Sampling | GOSS |
|---|---|---|
| High-gradient retention | Probabilistic (may miss) | Guaranteed (top a kept) |
| Compute per iteration | Similar | Similar |
| Sorting overhead | None | O(n log n) per iteration |
| Variance in gradient estimate | Higher | Lower (deterministic for top) |
| Best when gradients are | Uniformly distributed | Heterogeneously distributed |
| Typical speedup | Linear with rate | Super-linear (focuses on informative) |
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687
import numpy as npfrom sklearn.datasets import make_classificationfrom sklearn.model_selection import train_test_splitimport lightgbm as lgbimport time # Generate a medium-sized datasetX, y = make_classification( n_samples=50000, n_features=50, n_informative=20, n_redundant=10, n_clusters_per_class=3, random_state=42)X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42) # Common parametersbase_params = { 'objective': 'binary', 'metric': 'binary_logloss', 'boosting_type': 'gbdt', 'num_leaves': 31, 'learning_rate': 0.05, 'verbose': -1, 'num_threads': 4} train_set = lgb.Dataset(X_train, label=y_train)val_set = lgb.Dataset(X_val, label=y_val, reference=train_set) def train_and_evaluate(params, name): """Train a LightGBM model and report performance.""" start_time = time.time() model = lgb.train( params, train_set, num_boost_round=300, valid_sets=[val_set], valid_names=['val'], callbacks=[lgb.early_stopping(50, verbose=False)] ) train_time = time.time() - start_time best_iter = model.best_iteration best_score = model.best_score['val']['binary_logloss'] return { 'name': name, 'train_time': train_time, 'best_iteration': best_iter, 'val_loss': best_score } # Configuration 1: Full data (baseline)results_full = train_and_evaluate(base_params, "Full Data") # Configuration 2: Random subsamplingparams_random = base_params.copy()params_random.update({ 'bagging_fraction': 0.3, # Use 30% of data each iteration 'bagging_freq': 1 # Resample every iteration})results_random = train_and_evaluate(params_random, "Random 30%") # Configuration 3: GOSS (using LightGBM's built-in implementation)# Note: LightGBM's boosting_type='goss' enables GOSS automaticallyparams_goss = base_params.copy()params_goss.update({ 'boosting_type': 'goss', 'top_rate': 0.2, # Keep top 20% by gradient 'other_rate': 0.1 # Sample 10% of remaining (effective: ~28% of data)})results_goss = train_and_evaluate(params_goss, "GOSS (20%, 10%)") # Print resultsprint("\n" + "=" * 60)print("COMPARISON: GOSS vs Random Subsampling")print("=" * 60)print(f"{'Configuration':<20} {'Time(s)':<12} {'Best Iter':<12} {'Val Loss':<12}")print("-" * 60) for result in [results_full, results_random, results_goss]: print(f"{result['name']:<20} {result['train_time']:<12.2f} " f"{result['best_iteration']:<12} {result['val_loss']:<12.6f}") print("\nObservations:")print("- GOSS maintains accuracy close to full data while reducing training time")print("- Random subsampling may need more iterations to converge")print("- GOSS's intelligent sampling preserves learning signal better")GOSS shines on large datasets (>100K samples) where full training is slow. It's particularly effective when gradients are heterogeneous—common in later boosting iterations or with imbalanced data. For small datasets, the sorting overhead may not be worth the sampling benefit.
LightGBM provides two ways to enable GOSS: through the boosting type parameter or manual configuration.
Method 1: boosting_type='goss'
The simplest way to enable GOSS is setting boosting_type='goss'. This activates GOSS with default parameters and disables bagging (since GOSS provides its own sampling mechanism).
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768697071
import lightgbm as lgb # Method 1: Simple GOSS activationparams_simple = { 'objective': 'binary', 'metric': 'auc', 'boosting_type': 'goss', # Enables GOSS with default top_rate=0.2, other_rate=0.1 'num_leaves': 31, 'learning_rate': 0.05,} # Method 2: GOSS with custom parametersparams_custom = { 'objective': 'binary', 'metric': 'auc', 'boosting_type': 'goss', 'top_rate': 0.25, # Keep top 25% by gradient magnitude 'other_rate': 0.15, # Sample 15% of remaining # Note: top_rate + other_rate should not exceed 1.0 'num_leaves': 63, 'learning_rate': 0.03, 'min_data_in_leaf': 20, # Regularization (still important with GOSS) 'lambda_l1': 0.1, 'lambda_l2': 0.1,} # Important notes about GOSS:# 1. GOSS does NOT combine with bagging_fraction/bagging_freq# (those parameters are ignored when boosting_type='goss')# 2. GOSS can combine with feature subsampling (feature_fraction)# 3. Early stopping is particularly important with GOSS to prevent overfitting # Training with GOSSdef train_with_goss(X_train, y_train, X_val, y_val): train_data = lgb.Dataset(X_train, label=y_train) val_data = lgb.Dataset(X_val, label=y_val, reference=train_data) model = lgb.train( params_custom, train_data, num_boost_round=500, valid_sets=[train_data, val_data], valid_names=['train', 'val'], callbacks=[ lgb.early_stopping(stopping_rounds=50), lgb.log_evaluation(period=50) ] ) return model # Method 3: Using LGBMClassifier APIfrom lightgbm import LGBMClassifier clf = LGBMClassifier( boosting_type='goss', top_rate=0.2, other_rate=0.1, n_estimators=300, num_leaves=31, learning_rate=0.05, random_state=42) # clf.fit(X_train, y_train, # eval_set=[(X_val, y_val)], # callbacks=[lgb.early_stopping(50)])Tuning GOSS parameters requires balancing training speed against model accuracy. Here are practical guidelines based on dataset characteristics.
| Dataset Size | top_rate | other_rate | Effective Rate | Notes |
|---|---|---|---|---|
| < 50K samples | Consider regular GBDT | 100% | GOSS overhead may not pay off | |
| 50K - 200K | 0.2 | 0.15 | ~32% | Balanced speed/accuracy |
| 200K - 1M | 0.15 | 0.1 | ~23% | Aggressive sampling appropriate |
| 1M - 10M | 0.1 | 0.1 | ~19% | Focus on speed, gradients well-distributed |
10M | 0.1 | 0.05 | ~14% | Maximum speed, requires careful monitoring |
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120
import lightgbm as lgbfrom sklearn.model_selection import train_test_splitimport numpy as np def get_goss_params(n_samples: int, base_params: dict = None) -> dict: """ Get recommended GOSS parameters based on dataset size. Parameters: ----------- n_samples : Number of training samples base_params : Base LightGBM parameters to extend Returns: -------- dict : LightGBM parameters with appropriate GOSS settings """ params = base_params.copy() if base_params else {} if n_samples < 50000: # Don't use GOSS for small datasets params['boosting_type'] = 'gbdt' print(f"Dataset size {n_samples}: Using regular GBDT (GOSS overhead not worthwhile)") return params params['boosting_type'] = 'goss' if n_samples < 200000: params['top_rate'] = 0.2 params['other_rate'] = 0.15 elif n_samples < 1000000: params['top_rate'] = 0.15 params['other_rate'] = 0.1 elif n_samples < 10000000: params['top_rate'] = 0.1 params['other_rate'] = 0.1 else: params['top_rate'] = 0.1 params['other_rate'] = 0.05 effective_rate = params['top_rate'] + params['other_rate'] * (1 - params['top_rate']) print(f"Dataset size {n_samples:,}: GOSS with top_rate={params['top_rate']}, " f"other_rate={params['other_rate']} (effective: {effective_rate:.1%})") return params def tune_goss_parameters(X, y, param_grid: list = None): """ Tune GOSS parameters using cross-validation. Parameters: ----------- X, y : Training data param_grid : List of (top_rate, other_rate) tuples to try """ if param_grid is None: param_grid = [ (0.3, 0.2), # Conservative (0.2, 0.15), # Moderate (0.2, 0.1), # Default (0.15, 0.1), # Aggressive (0.1, 0.1), # More aggressive ] X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42) train_data = lgb.Dataset(X_train, label=y_train) val_data = lgb.Dataset(X_val, label=y_val, reference=train_data) results = [] for top_rate, other_rate in param_grid: params = { 'objective': 'binary', 'metric': 'auc', 'boosting_type': 'goss', 'top_rate': top_rate, 'other_rate': other_rate, 'num_leaves': 31, 'learning_rate': 0.05, 'verbose': -1 } import time start = time.time() model = lgb.train( params, train_data, num_boost_round=300, valid_sets=[val_data], callbacks=[lgb.early_stopping(30, verbose=False)] ) elapsed = time.time() - start results.append({ 'top_rate': top_rate, 'other_rate': other_rate, 'effective_rate': top_rate + other_rate * (1 - top_rate), 'val_auc': model.best_score['valid_0']['auc'], 'best_iter': model.best_iteration, 'time': elapsed }) print("\nGOSS Parameter Tuning Results:") print("=" * 80) print(f"{'top_rate':<12} {'other_rate':<12} {'Eff. Rate':<12} " f"{'Val AUC':<12} {'Best Iter':<12} {'Time(s)':<12}") print("-" * 80) for r in results: print(f"{r['top_rate']:<12.2f} {r['other_rate']:<12.2f} " f"{r['effective_rate']:<12.1%} {r['val_auc']:<12.6f} " f"{r['best_iter']:<12} {r['time']:<12.2f}") # Find best by AUC best = max(results, key=lambda x: x['val_auc']) print(f"\nBest configuration: top_rate={best['top_rate']}, other_rate={best['other_rate']}") return resultsOn highly imbalanced datasets, minority class samples often have larger gradients (they're harder to predict). GOSS naturally retains more minority samples through its gradient-based selection—providing an implicit form of rebalancing. However, if imbalance is extreme, consider combining GOSS with is_unbalance=True or explicit class weights.
GOSS represents a clever application of the insight that gradient magnitude indicates sample importance. By keeping all high-gradient samples and intelligently sampling from low-gradient samples, GOSS achieves significant speedups while maintaining model quality.
What's Next:
We've seen how GOSS reduces sample complexity. The next page explores Exclusive Feature Bundling (EFB), LightGBM's technique for reducing feature dimensionality by bundling mutually exclusive features—another key innovation that contributes to LightGBM's speed advantage.
You now understand GOSS—its intuition, algorithm, mathematical justification, and practical usage. Combined with leaf-wise growth, GOSS enables LightGBM to handle large datasets efficiently. Next, we'll explore EFB, which addresses feature dimensionality.