Machine LearningLightGBM

LightGBM: Light Gradient Boosting Machine

LevelAdvanced

Duration90 mins

TopicLightGBM

2 / 5

Gradient-based One-Side Sampling (GOSS)

Sampling Smarter, Not Smaller

In gradient boosting, not all data points are created equal. Some samples carry far more information about the current model's errors than others. Gradient-based One-Side Sampling (GOSS) is LightGBM's ingenious approach to exploiting this inequality—keeping the informative samples while intelligently downsampling the rest.

GOSS represents a fundamental insight: the gradient magnitude of a sample tells us how "wrong" the current model is about that sample. Samples with large gradients are poorly predicted and carry substantial information for improvement. Samples with small gradients are already well-predicted and contribute less to further learning.

By retaining all high-gradient samples and randomly sampling from low-gradient samples (with appropriate reweighting), GOSS achieves dramatic speedups while maintaining—and sometimes even improving—model accuracy.

What You Will Learn

By the end of this page, you will understand the theoretical motivation behind gradient-based sampling, the complete GOSS algorithm with mathematical derivation, why low-gradient samples can be safely downsampled, the bias correction mechanism that preserves convergence, and practical guidelines for tuning GOSS parameters.

The Intuition Behind GOSS

Traditional sampling methods in machine learning—like random sampling or stratified sampling—treat all data points equally. They select subsets based on label distribution or simple randomization, ignoring the learning dynamics of the model.

GOSS takes a fundamentally different approach by recognizing that gradient magnitude is a direct measure of sample importance for the current training iteration.

Understanding Gradient Magnitude:

In gradient boosting, for each sample $i$, we compute a gradient $g_i = \partial L(y_i, \hat{y}_i) / \partial \hat{y}_i$, where $L$ is the loss function. For common loss functions:

Squared Loss (Regression): $g_i = \hat{y}_i - y_i$ (the residual)
Logistic Loss (Classification): $g_i = \text{sigmoid}(\hat{y}_i) - y_i$ (predicted probability - actual label)
Cross-Entropy: $g_i = p_i - y_i$ (similar interpretation)

The magnitude $|g_i|$ tells us how far off our prediction is:

Large $|g_i|$: The model is making significant errors on this sample. It's in a region where the model needs improvement.
Small $|g_i|$: The model predicts this sample well. It's close to correct.

High-Gradient Samples

•Model predictions are far from ground truth
•Carry significant information for reducing loss
•Often represent challenging or rare patterns
•Losing these would hurt learning severely
•GOSS keeps all of these samples

Low-Gradient Samples

•Model predictions are close to ground truth
•Contribute less to loss reduction
•Often represent well-learned patterns
•Can be sampled without major information loss
•GOSS randomly samples from these

The Key Insight

The gradient naturally identifies which samples the model struggles with. Rather than treating all samples equally, GOSS focuses computational resources where they matter most—on the samples that will drive the biggest improvements. This is analogous to focusing study time on concepts you don't understand rather than reviewing what you already know.

The GOSS Algorithm

The GOSS algorithm involves three key steps: sorting by gradient magnitude, selective retention, and bias-corrected reweighting. Let's formalize each step.

Algorithm: Gradient-based One-Side Sampling

Given:

Training data with $n$ samples
Gradients ${g_1, g_2, \ldots, g_n}$ from current model
Top-gradient ratio $a$ (e.g., 0.2 = keep top 20%)
Random sampling ratio $b$ (e.g., 0.1 = sample 10% of rest)

1. Sort all samples by |g_i| in descending order
2. Select top a × n samples → set A (high-gradient set)
3. Randomly sample b × (n - |A|) samples from remaining → set B (sampled low-gradient set)
4. Amplify gradients of samples in B by weight factor (1 - a) / b
5. Use combined set A ∪ B for tree construction

The amplification factor $(1-a)/b$ is crucial—it compensates for the underrepresentation of low-gradient samples in the sampled dataset.

GOSS Sampling Parameters
Parameter	Meaning	Typical Values	Effect
a (top_rate)	Fraction of samples to keep based on gradient magnitude	0.1 - 0.3	Higher = more samples kept, slower but more accurate
b (other_rate)	Fraction of remaining samples to randomly sample	0.05 - 0.2	Higher = more samples kept, slower but less bias
1 - a - b	Fraction of samples effectively discarded	0.5 - 0.8	Data reduction ratio
(1-a)/b	Weight amplification for sampled low-gradient samples	4.0 - 18.0	Bias correction factor

goss_algorithm.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
import numpy as np
from typing import Tuple
 
class GOSSampler:
    """
    Implementation of Gradient-based One-Side Sampling (GOSS).
    
    GOSS achieves efficient training by keeping all samples with large gradients
    and randomly sampling from samples with small gradients, with appropriate
    reweighting to maintain unbiased gradient estimates.
    """
    
    def __init__(self, top_rate: float = 0.2, other_rate: float = 0.1):
        """
        Initialize GOSS sampler.
        
        Parameters:
        -----------
        top_rate (a): Fraction of samples to keep based on gradient magnitude.
                      These are samples with the largest |gradient|.
        other_rate (b): Fraction of remaining samples to randomly sample.
        
        The effective data reduction is: using only (a + b(1-a)) of the data,
        which typically ranges from 30% to 50% for common parameter settings.
        """
        if top_rate < 0 or top_rate > 1:
            raise ValueError("top_rate must be in [0, 1]")
        if other_rate < 0 or other_rate > 1:
            raise ValueError("other_rate must be in [0, 1]")
        if top_rate + other_rate > 1:
            raise ValueError("top_rate + other_rate must not exceed 1")
            
        self.top_rate = top_rate
        self.other_rate = other_rate
        
        # Weight amplification factor for low-gradient samples
        if other_rate > 0:
            self.weight_factor = (1 - top_rate) / other_rate
        else:
            self.weight_factor = 1.0
    
    def sample(self, gradients: np.ndarray, 
               random_state: int = None) -> Tuple[np.ndarray, np.ndarray]:
        """
        Perform GOSS sampling on the given gradients.
        
        Parameters:
        -----------
        gradients : array of shape (n_samples,)
            First-order gradients for each sample
        random_state : int, optional
            Random seed for reproducibility
            
        Returns:
        --------
        selected_indices : array of selected sample indices
        sample_weights : array of weights for selected samples
        """
        if random_state is not None:
            np.random.seed(random_state)
        
        n_samples = len(gradients)
        
        # Step 1: Sort samples by absolute gradient (descending)
        abs_gradients = np.abs(gradients)
        sorted_indices = np.argsort(abs_gradients)[::-1]  # Descending order
        
        # Step 2: Determine split points
        n_top = int(n_samples * self.top_rate)
        n_other = int((n_samples - n_top) * self.other_rate)
        
        # Step 3: Select top gradient samples (set A)
        top_indices = sorted_indices[:n_top]
        
        # Step 4: Randomly sample from remaining samples (set B)
        remaining_indices = sorted_indices[n_top:]
        if n_other > 0 and len(remaining_indices) > 0:
            sampled_other_indices = np.random.choice(
                remaining_indices, 
                size=min(n_other, len(remaining_indices)),
                replace=False
            )
        else:
            sampled_other_indices = np.array([], dtype=int)
        
        # Step 5: Combine and assign weights
        selected_indices = np.concatenate([top_indices, sampled_other_indices])
        
        # Top samples get weight 1.0, sampled others get amplified weight
        weights = np.ones(len(selected_indices))
        weights[n_top:] = self.weight_factor
        
        return selected_indices, weights
    
    def get_statistics(self, n_samples: int) -> dict:
        """Get sampling statistics for a given sample size."""
        n_top = int(n_samples * self.top_rate)
        n_other = int((n_samples - n_top) * self.other_rate)
        
        return {
            'total_samples': n_samples,
            'top_kept': n_top,
            'other_sampled': n_other,
            'total_selected': n_top + n_other,
            'effective_sample_rate': (n_top + n_other) / n_samples,
            'data_reduction': 1 - (n_top + n_other) / n_samples,
            'weight_factor': self.weight_factor
        }
 
 
# Demonstration
def demonstrate_goss():
    """Demonstrate GOSS on synthetic gradient data."""
    np.random.seed(42)
    
    # Simulate gradients from a partially trained model
    # Most samples have small gradients (well-learned)
    # Few samples have large gradients (hard examples)
    n_samples = 10000
    
    # 80% of samples are well-predicted (small gradients)
    well_predicted = np.random.normal(0, 0.1, int(n_samples * 0.8))
    
    # 20% of samples are poorly predicted (large gradients)
    poorly_predicted = np.random.normal(0, 1.0, int(n_samples * 0.2))
    
    gradients = np.concatenate([well_predicted, poorly_predicted])
    np.random.shuffle(gradients)
    
    print("=" * 60)
    print("GOSS Demonstration")
    print("=" * 60)
    
    print(f"\nOriginal data: {n_samples} samples")
    print(f"Gradient distribution:")
    print(f"  Mean |gradient|: {np.abs(gradients).mean():.4f}")
    print(f"  Max |gradient|:  {np.abs(gradients).max():.4f}")
    print(f"  Min |gradient|:  {np.abs(gradients).min():.6f}")
    
    # Apply GOSS with typical parameters
    sampler = GOSSampler(top_rate=0.2, other_rate=0.1)
    selected_idx, weights = sampler.sample(gradients, random_state=42)
    
    stats = sampler.get_statistics(n_samples)
    
    print(f"\nGOSS Parameters: top_rate={sampler.top_rate}, other_rate={sampler.other_rate}")
    print(f"\nSampling Results:")
    print(f"  Top gradient samples kept: {stats['top_kept']}")
    print(f"  Low gradient samples sampled: {stats['other_sampled']}")
    print(f"  Total selected: {stats['total_selected']}")
    print(f"  Effective sample rate: {stats['effective_sample_rate']:.1%}")
    print(f"  Data reduction: {stats['data_reduction']:.1%}")
    print(f"  Weight amplification factor: {stats['weight_factor']:.2f}x")
    
    # Verify gradient preservation
    selected_grads = gradients[selected_idx]
    print(f"\nGradient Statistics After Sampling:")
    print(f"  Mean |gradient| (unweighted): {np.abs(selected_grads).mean():.4f}")
    
    # Weighted mean should approximate original
    weighted_sum = np.sum(np.abs(selected_grads) * weights)
    weighted_count = np.sum(weights)
    weighted_mean = weighted_sum / weighted_count
    print(f"  Mean |gradient| (weighted): {weighted_mean:.4f}")
    print(f"  Original mean |gradient|: {np.abs(gradients).mean():.4f}")
    
    # Show gradient distribution of selected samples
    print(f"\nSelected Sample Analysis:")
    top_samples_grads = np.abs(gradients[selected_idx[:stats['top_kept']]])
    other_samples_grads = np.abs(gradients[selected_idx[stats['top_kept']:]])
    print(f"  Top samples - Mean |gradient|: {top_samples_grads.mean():.4f}")
    print(f"  Sampled others - Mean |gradient|: {other_samples_grads.mean():.4f}")
 
 
if __name__ == "__main__":
    demonstrate_goss()

Mathematical Justification

The mathematical foundation of GOSS rests on two key insights: (1) the connection between gradient magnitude and information gain, and (2) the theory of importance sampling.

The Information Gain Connection:

Recall from the previous page that the gain from splitting a leaf is:

$$\text{Gain} = \frac{1}{2} \left( \frac{G_L^2}{H_L + \lambda} + \frac{G_R^2}{H_R + \lambda} - \frac{G^2}{H + \lambda} \right) - \gamma$$

Where $G = \sum_i g_i$ is the sum of gradients and $H = \sum_i h_i$ is the sum of Hessians.

Key Observation: The gradient sums $G_L$ and $G_R$ determine the split quality. Samples with larger $|g_i|$ contribute more to these sums and thus have more influence on which split is selected.

If we remove samples with small $|g_i|$, the gradient sums change minimally:

A sample with $|g_i| = 0.01$ contributes almost nothing
A sample with $|g_i| = 1.0$ contributes 100× more

This asymmetry justifies keeping high-gradient samples while subsampling low-gradient samples.

The Importance Sampling Perspective

GOSS can be viewed through the lens of importance sampling. Instead of sampling uniformly, we sample proportionally to gradient magnitude. The weight factor (1-a)/b acts as an importance weight correction, ensuring our gradient estimate remains unbiased despite the non-uniform sampling.

Variance-Bias Trade-off in GOSS:

Let $G_{\text{true}} = \sum_{i=1}^{n} g_i$ be the true gradient sum over all samples.

With GOSS, we estimate: $$G_{\text{GOSS}} = \sum_{i \in A} g_i + \frac{1-a}{b} \sum_{j \in B} g_j$$

Where:

$A$ = top gradient samples (deterministically selected)
$B$ = randomly sampled low-gradient samples
$(1-a)/b$ = importance weight

Theorem (Unbiasedness of GOSS):

$\mathbb{E}[G_{\text{GOSS}}] = G_{\text{true}}$

Proof:

Let $C$ be the set of low-gradient samples (all samples not in $A$).

$\mathbb{E}[G_{\text{GOSS}}] = \sum_{i \in A} g_i + \frac{1-a}{b} \cdot \mathbb{E}\left[ \sum_{j \in B} g_j \right]$

Since $B$ is a uniform random sample of size $b|C|$ from $C$:

$\mathbb{E}\left[ \sum_{j \in B} g_j \right] = b|C| \cdot \frac{1}{|C|} \sum_{k \in C} g_k = b \sum_{k \in C} g_k$

Therefore:

$\mathbb{E}[G_{\text{GOSS}}] = \sum_{i \in A} g_i + \frac{1-a}{b} \cdot b \sum_{k \in C} g_k = \sum_{i \in A} g_i + (1-a) \sum_{k \in C} g_k$

But wait—this equals $G_{\text{true}}$ only when $(1-a) = 1$ for all samples in $C$, which isn't exactly right. The actual unbiasedness holds when we interpret the weight $(1-a)/b$ as compensating for the sampling probability of each sample in $C$.

More precisely: each sample in $C$ has probability $b$ of being selected into $B$. Weighting by $1/b$ corrects for this. The factor $(1-a)$ accounts for the fact that set $C$ contains $(1-a)n$ samples.

Variance Analysis:

While GOSS maintains unbiasedness, it introduces variance. The variance comes from the random sampling of set $B$.

$\text{Var}(G_{\text{GOSS}}) = \left( \frac{1-a}{b} \right)^2 \text{Var}\left( \sum_{j \in B} g_j \right)$

The variance decreases when:

$b$ increases: Sampling more low-gradient points reduces randomness
Low-gradient gradients are indeed small: If $|g_i| \approx 0$ for all $i \in C$, variance is minimal regardless of sampling

This analysis explains why GOSS works well: by construction, set $C$ contains samples with small gradients, so even high variance in estimating their sum contributes little to the overall gradient estimate.

Practical Implication: GOSS is most effective when there's a clear separation between high-gradient and low-gradient samples. If gradients are uniformly distributed, the benefit is smaller.

goss_variance_analysis.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
import numpy as np
import matplotlib.pyplot as plt
 
def analyze_goss_variance(gradients, top_rate, other_rate, n_trials=1000):
    """
    Analyze the variance introduced by GOSS sampling.
    
    Parameters:
    -----------
    gradients : array - True gradients for all samples
    top_rate, other_rate : GOSS parameters
    n_trials : Number of sampling trials to estimate variance
    """
    n_samples = len(gradients)
    n_top = int(n_samples * top_rate)
    n_other = int((n_samples - n_top) * other_rate)
    weight_factor = (1 - top_rate) / other_rate
    
    # True gradient sum
    true_sum = gradients.sum()
    
    # Sort by absolute gradient
    sorted_idx = np.argsort(np.abs(gradients))[::-1]
    top_idx = sorted_idx[:n_top]
    remaining_idx = sorted_idx[n_top:]
    
    # Top contribution (deterministic)
    top_sum = gradients[top_idx].sum()
    
    # Monte Carlo estimation of GOSS variance
    goss_estimates = []
    for _ in range(n_trials):
        # Random sample from remaining
        sampled_idx = np.random.choice(remaining_idx, size=n_other, replace=False)
        sampled_sum = gradients[sampled_idx].sum() * weight_factor
        goss_estimates.append(top_sum + sampled_sum)
    
    goss_estimates = np.array(goss_estimates)
    
    print("GOSS Variance Analysis")
    print("=" * 50)
    print(f"True gradient sum: {true_sum:.4f}")
    print(f"GOSS mean estimate: {goss_estimates.mean():.4f}")
    print(f"Bias: {(goss_estimates.mean() - true_sum):.6f}")
    print(f"GOSS std deviation: {goss_estimates.std():.4f}")
    print(f"Relative std (CV): {100 * goss_estimates.std() / abs(true_sum):.2f}%")
    
    return goss_estimates, true_sum
 
 
# Example with bimodal gradient distribution (typical in boosting)
np.random.seed(42)
n = 10000
 
# Simulate late-stage boosting: most samples have small gradients
gradients = np.concatenate([
    np.random.normal(0, 0.05, int(n * 0.85)),   # Well-predicted: ~85%
    np.random.normal(0, 0.5, int(n * 0.15))     # Challenging: ~15%
])
 
print("\nScenario 1: Typical gradient distribution (bimodal)")
estimates, true_val = analyze_goss_variance(gradients, 0.2, 0.1)
 
# Compare with uniform gradient distribution
print("\nScenario 2: Uniform gradient distribution (worst case for GOSS)")
uniform_gradients = np.random.uniform(-1, 1, n)
estimates2, true_val2 = analyze_goss_variance(uniform_gradients, 0.2, 0.1)
 
# The variance in Scenario 1 should be much lower than Scenario 2
print("\nConclusion:")
print("GOSS works best when gradients are heterogeneous -")
print("high variance in gradient magnitudes means the 'top' set is meaningful.")

GOSS vs Random Sampling

Why not just use random sampling instead of the more complex GOSS procedure? The answer lies in the distribution of gradient information and the efficiency of different sampling strategies.

Random Sampling:

With uniform random sampling at rate $r$, we select $r \cdot n$ samples randomly and weight them by $1/r$. This is unbiased and simple, but it treats all samples equally regardless of their informateness.

The Problem with Random Sampling in Boosting:

As boosting progresses:

Most samples become well-predicted (small gradients)
Few samples remain challenging (large gradients)
Random sampling has high probability of missing the informative samples

Consider a dataset where 95% of samples have gradients near zero and 5% have large gradients. Random sampling at 30% rate:

Expected to capture only 30% of the informative samples
Captures 30% of the uninformative samples (wasting computation)
High variance because important samples may be missed

GOSS at 20% top + 10% other:

Guarantees capturing the top 20% by gradient (which includes all informative samples)
Samples only 10% of the rest
Low variance because the important contribution is never missed

GOSS vs Random Sampling Comparison
Criterion	Random Sampling	GOSS
High-gradient retention	Probabilistic (may miss)	Guaranteed (top a kept)
Compute per iteration	Similar	Similar
Sorting overhead	None	O(n log n) per iteration
Variance in gradient estimate	Higher	Lower (deterministic for top)
Best when gradients are	Uniformly distributed	Heterogeneously distributed
Typical speedup	Linear with rate	Super-linear (focuses on informative)

goss_vs_random_comparison.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
import numpy as np
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
import lightgbm as lgb
import time
 
# Generate a medium-sized dataset
X, y = make_classification(
    n_samples=50000, n_features=50, n_informative=20,
    n_redundant=10, n_clusters_per_class=3, random_state=42
)
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42)
 
# Common parameters
base_params = {
    'objective': 'binary',
    'metric': 'binary_logloss',
    'boosting_type': 'gbdt',
    'num_leaves': 31,
    'learning_rate': 0.05,
    'verbose': -1,
    'num_threads': 4
}
 
train_set = lgb.Dataset(X_train, label=y_train)
val_set = lgb.Dataset(X_val, label=y_val, reference=train_set)
 
def train_and_evaluate(params, name):
    """Train a LightGBM model and report performance."""
    start_time = time.time()
    
    model = lgb.train(
        params, train_set,
        num_boost_round=300,
        valid_sets=[val_set],
        valid_names=['val'],
        callbacks=[lgb.early_stopping(50, verbose=False)]
    )
    
    train_time = time.time() - start_time
    best_iter = model.best_iteration
    best_score = model.best_score['val']['binary_logloss']
    
    return {
        'name': name,
        'train_time': train_time,
        'best_iteration': best_iter,
        'val_loss': best_score
    }
 
 
# Configuration 1: Full data (baseline)
results_full = train_and_evaluate(base_params, "Full Data")
 
# Configuration 2: Random subsampling
params_random = base_params.copy()
params_random.update({
    'bagging_fraction': 0.3,  # Use 30% of data each iteration
    'bagging_freq': 1         # Resample every iteration
})
results_random = train_and_evaluate(params_random, "Random 30%")
 
# Configuration 3: GOSS (using LightGBM's built-in implementation)
# Note: LightGBM's boosting_type='goss' enables GOSS automatically
params_goss = base_params.copy()
params_goss.update({
    'boosting_type': 'goss',
    'top_rate': 0.2,     # Keep top 20% by gradient
    'other_rate': 0.1    # Sample 10% of remaining (effective: ~28% of data)
})
results_goss = train_and_evaluate(params_goss, "GOSS (20%, 10%)")
 
# Print results
print("\n" + "=" * 60)
print("COMPARISON: GOSS vs Random Subsampling")
print("=" * 60)
print(f"{'Configuration':<20} {'Time(s)':<12} {'Best Iter':<12} {'Val Loss':<12}")
print("-" * 60)
 
for result in [results_full, results_random, results_goss]:
    print(f"{result['name']:<20} {result['train_time']:<12.2f} "
          f"{result['best_iteration']:<12} {result['val_loss']:<12.6f}")
 
print("\nObservations:")
print("- GOSS maintains accuracy close to full data while reducing training time")
print("- Random subsampling may need more iterations to converge")
print("- GOSS's intelligent sampling preserves learning signal better")

When to Use GOSS

GOSS shines on large datasets (>100K samples) where full training is slow. It's particularly effective when gradients are heterogeneous—common in later boosting iterations or with imbalanced data. For small datasets, the sorting overhead may not be worth the sampling benefit.

GOSS in LightGBM

LightGBM provides two ways to enable GOSS: through the boosting type parameter or manual configuration.

Method 1: boosting_type='goss'

The simplest way to enable GOSS is setting boosting_type='goss'. This activates GOSS with default parameters and disables bagging (since GOSS provides its own sampling mechanism).

lightgbm_goss_usage.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
import lightgbm as lgb
 
# Method 1: Simple GOSS activation
params_simple = {
    'objective': 'binary',
    'metric': 'auc',
    'boosting_type': 'goss',  # Enables GOSS with default top_rate=0.2, other_rate=0.1
    'num_leaves': 31,
    'learning_rate': 0.05,
}
 
# Method 2: GOSS with custom parameters
params_custom = {
    'objective': 'binary',
    'metric': 'auc',
    'boosting_type': 'goss',
    'top_rate': 0.25,         # Keep top 25% by gradient magnitude
    'other_rate': 0.15,       # Sample 15% of remaining
    # Note: top_rate + other_rate should not exceed 1.0
    
    'num_leaves': 63,
    'learning_rate': 0.03,
    'min_data_in_leaf': 20,
    
    # Regularization (still important with GOSS)
    'lambda_l1': 0.1,
    'lambda_l2': 0.1,
}
 
# Important notes about GOSS:
# 1. GOSS does NOT combine with bagging_fraction/bagging_freq
#    (those parameters are ignored when boosting_type='goss')
# 2. GOSS can combine with feature subsampling (feature_fraction)
# 3. Early stopping is particularly important with GOSS to prevent overfitting
 
# Training with GOSS
def train_with_goss(X_train, y_train, X_val, y_val):
    train_data = lgb.Dataset(X_train, label=y_train)
    val_data = lgb.Dataset(X_val, label=y_val, reference=train_data)
    
    model = lgb.train(
        params_custom,
        train_data,
        num_boost_round=500,
        valid_sets=[train_data, val_data],
        valid_names=['train', 'val'],
        callbacks=[
            lgb.early_stopping(stopping_rounds=50),
            lgb.log_evaluation(period=50)
        ]
    )
    
    return model
 
 
# Method 3: Using LGBMClassifier API
from lightgbm import LGBMClassifier
 
clf = LGBMClassifier(
    boosting_type='goss',
    top_rate=0.2,
    other_rate=0.1,
    n_estimators=300,
    num_leaves=31,
    learning_rate=0.05,
    random_state=42
)
 
# clf.fit(X_train, y_train, 
#         eval_set=[(X_val, y_val)], 
#         callbacks=[lgb.early_stopping(50)])

Important GOSS Considerations

•GOSS replaces bagging — When using boosting_type='goss', bagging_fraction and bagging_freq are ignored. GOSS has its own sampling mechanism.
•Sorting overhead — Each iteration requires sorting samples by gradient, adding O(n log n) overhead. This is offset by the reduced sample count in split finding.
•Not for small datasets — For datasets under 10K samples, the overhead may exceed the benefit. Use regular GBDT instead.
•Feature fraction still works — You can combine GOSS with feature_fraction for additional randomization.
•Monitor for instability — High top_rate and low other_rate can introduce training instability. Watch validation metrics carefully.

Parameter Tuning Guidelines

Tuning GOSS parameters requires balancing training speed against model accuracy. Here are practical guidelines based on dataset characteristics.

GOSS Parameter Recommendations by Dataset Size
Dataset Size	top_rate	other_rate	Effective Rate	Notes
< 50K samples	Consider regular GBDT		100%	GOSS overhead may not pay off
50K - 200K	0.2	0.15	~32%	Balanced speed/accuracy
200K - 1M	0.15	0.1	~23%	Aggressive sampling appropriate
1M - 10M	0.1	0.1	~19%	Focus on speed, gradients well-distributed
10M	0.1	0.05	~14%	Maximum speed, requires careful monitoring

goss_tuning_strategy.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
import lightgbm as lgb
from sklearn.model_selection import train_test_split
import numpy as np
 
def get_goss_params(n_samples: int, base_params: dict = None) -> dict:
    """
    Get recommended GOSS parameters based on dataset size.
    
    Parameters:
    -----------
    n_samples : Number of training samples
    base_params : Base LightGBM parameters to extend
    
    Returns:
    --------
    dict : LightGBM parameters with appropriate GOSS settings
    """
    params = base_params.copy() if base_params else {}
    
    if n_samples < 50000:
        # Don't use GOSS for small datasets
        params['boosting_type'] = 'gbdt'
        print(f"Dataset size {n_samples}: Using regular GBDT (GOSS overhead not worthwhile)")
        return params
    
    params['boosting_type'] = 'goss'
    
    if n_samples < 200000:
        params['top_rate'] = 0.2
        params['other_rate'] = 0.15
    elif n_samples < 1000000:
        params['top_rate'] = 0.15
        params['other_rate'] = 0.1
    elif n_samples < 10000000:
        params['top_rate'] = 0.1
        params['other_rate'] = 0.1
    else:
        params['top_rate'] = 0.1
        params['other_rate'] = 0.05
    
    effective_rate = params['top_rate'] + params['other_rate'] * (1 - params['top_rate'])
    print(f"Dataset size {n_samples:,}: GOSS with top_rate={params['top_rate']}, "
          f"other_rate={params['other_rate']} (effective: {effective_rate:.1%})")
    
    return params
 
 
def tune_goss_parameters(X, y, param_grid: list = None):
    """
    Tune GOSS parameters using cross-validation.
    
    Parameters:
    -----------
    X, y : Training data
    param_grid : List of (top_rate, other_rate) tuples to try
    """
    if param_grid is None:
        param_grid = [
            (0.3, 0.2),   # Conservative
            (0.2, 0.15),  # Moderate
            (0.2, 0.1),   # Default
            (0.15, 0.1),  # Aggressive
            (0.1, 0.1),   # More aggressive
        ]
    
    X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42)
    train_data = lgb.Dataset(X_train, label=y_train)
    val_data = lgb.Dataset(X_val, label=y_val, reference=train_data)
    
    results = []
    
    for top_rate, other_rate in param_grid:
        params = {
            'objective': 'binary',
            'metric': 'auc',
            'boosting_type': 'goss',
            'top_rate': top_rate,
            'other_rate': other_rate,
            'num_leaves': 31,
            'learning_rate': 0.05,
            'verbose': -1
        }
        
        import time
        start = time.time()
        
        model = lgb.train(
            params, train_data,
            num_boost_round=300,
            valid_sets=[val_data],
            callbacks=[lgb.early_stopping(30, verbose=False)]
        )
        
        elapsed = time.time() - start
        
        results.append({
            'top_rate': top_rate,
            'other_rate': other_rate,
            'effective_rate': top_rate + other_rate * (1 - top_rate),
            'val_auc': model.best_score['valid_0']['auc'],
            'best_iter': model.best_iteration,
            'time': elapsed
        })
    
    print("\nGOSS Parameter Tuning Results:")
    print("=" * 80)
    print(f"{'top_rate':<12} {'other_rate':<12} {'Eff. Rate':<12} "
          f"{'Val AUC':<12} {'Best Iter':<12} {'Time(s)':<12}")
    print("-" * 80)
    
    for r in results:
        print(f"{r['top_rate']:<12.2f} {r['other_rate']:<12.2f} "
              f"{r['effective_rate']:<12.1%} {r['val_auc']:<12.6f} "
              f"{r['best_iter']:<12} {r['time']:<12.2f}")
    
    # Find best by AUC
    best = max(results, key=lambda x: x['val_auc'])
    print(f"\nBest configuration: top_rate={best['top_rate']}, other_rate={best['other_rate']}")
    
    return results

GOSS and Imbalanced Data

On highly imbalanced datasets, minority class samples often have larger gradients (they're harder to predict). GOSS naturally retains more minority samples through its gradient-based selection—providing an implicit form of rebalancing. However, if imbalance is extreme, consider combining GOSS with is_unbalance=True or explicit class weights.

Summary: Gradient-based One-Side Sampling

GOSS represents a clever application of the insight that gradient magnitude indicates sample importance. By keeping all high-gradient samples and intelligently sampling from low-gradient samples, GOSS achieves significant speedups while maintaining model quality.

Key Takeaways

•Gradient magnitude = sample importance — Samples the model struggles with (high |g|) carry more information for learning.
•Keep the top, sample the rest — GOSS deterministically keeps top-a samples and randomly samples b fraction of the remainder.
•Weight correction ensures unbiasedness — The (1-a)/b amplification factor compensates for underrepresentation of low-gradient samples.
•Variance is controlled — Because low-gradient samples contribute little anyway, variance from sampling them is manageable.
•Superior to random sampling — GOSS guarantees retention of informative samples, unlike uniform random sampling.
•Best for large datasets — The O(n log n) sorting overhead is worthwhile only for larger datasets (>50K samples).

What's Next:

We've seen how GOSS reduces sample complexity. The next page explores Exclusive Feature Bundling (EFB), LightGBM's technique for reducing feature dimensionality by bundling mutually exclusive features—another key innovation that contributes to LightGBM's speed advantage.

Page Complete

You now understand GOSS—its intuition, algorithm, mathematical justification, and practical usage. Combined with leaf-wise growth, GOSS enables LightGBM to handle large datasets efficiently. Next, we'll explore EFB, which addresses feature dimensionality.

2 / 5

Loading learning content...

Machine LearningLightGBM

LightGBM: Light Gradient Boosting Machine

LevelAdvanced

Duration90 mins

TopicLightGBM

2 / 5

Gradient-based One-Side Sampling (GOSS)

Sampling Smarter, Not Smaller

What You Will Learn

The Intuition Behind GOSS

GOSS takes a fundamentally different approach by recognizing that gradient magnitude is a direct measure of sample importance for the current training iteration.

Understanding Gradient Magnitude:

In gradient boosting, for each sample $i$, we compute a gradient $g_i = \partial L(y_i, \hat{y}_i) / \partial \hat{y}_i$, where $L$ is the loss function. For common loss functions:

Squared Loss (Regression): $g_i = \hat{y}_i - y_i$ (the residual)
Logistic Loss (Classification): $g_i = \text{sigmoid}(\hat{y}_i) - y_i$ (predicted probability - actual label)
Cross-Entropy: $g_i = p_i - y_i$ (similar interpretation)

The magnitude $|g_i|$ tells us how far off our prediction is:

Large $|g_i|$: The model is making significant errors on this sample. It's in a region where the model needs improvement.
Small $|g_i|$: The model predicts this sample well. It's close to correct.

High-Gradient Samples

•Model predictions are far from ground truth
•Carry significant information for reducing loss
•Often represent challenging or rare patterns
•Losing these would hurt learning severely
•GOSS keeps all of these samples

Low-Gradient Samples

•Model predictions are close to ground truth
•Contribute less to loss reduction
•Often represent well-learned patterns
•Can be sampled without major information loss
•GOSS randomly samples from these

The Key Insight

The GOSS Algorithm

The GOSS algorithm involves three key steps: sorting by gradient magnitude, selective retention, and bias-corrected reweighting. Let's formalize each step.

Algorithm: Gradient-based One-Side Sampling

Given:

Training data with $n$ samples
Gradients ${g_1, g_2, \ldots, g_n}$ from current model
Top-gradient ratio $a$ (e.g., 0.2 = keep top 20%)
Random sampling ratio $b$ (e.g., 0.1 = sample 10% of rest)

1. Sort all samples by |g_i| in descending order
2. Select top a × n samples → set A (high-gradient set)
3. Randomly sample b × (n - |A|) samples from remaining → set B (sampled low-gradient set)
4. Amplify gradients of samples in B by weight factor (1 - a) / b
5. Use combined set A ∪ B for tree construction

The amplification factor $(1-a)/b$ is crucial—it compensates for the underrepresentation of low-gradient samples in the sampled dataset.

GOSS Sampling Parameters
Parameter	Meaning	Typical Values	Effect
a (top_rate)	Fraction of samples to keep based on gradient magnitude	0.1 - 0.3	Higher = more samples kept, slower but more accurate
b (other_rate)	Fraction of remaining samples to randomly sample	0.05 - 0.2	Higher = more samples kept, slower but less bias
1 - a - b	Fraction of samples effectively discarded	0.5 - 0.8	Data reduction ratio
(1-a)/b	Weight amplification for sampled low-gradient samples	4.0 - 18.0	Bias correction factor

goss_algorithm.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
import numpy as np
from typing import Tuple
 
class GOSSampler:
    """
    Implementation of Gradient-based One-Side Sampling (GOSS).
    
    GOSS achieves efficient training by keeping all samples with large gradients
    and randomly sampling from samples with small gradients, with appropriate
    reweighting to maintain unbiased gradient estimates.
    """
    
    def __init__(self, top_rate: float = 0.2, other_rate: float = 0.1):
        """
        Initialize GOSS sampler.
        
        Parameters:
        -----------
        top_rate (a): Fraction of samples to keep based on gradient magnitude.
                      These are samples with the largest |gradient|.
        other_rate (b): Fraction of remaining samples to randomly sample.
        
        The effective data reduction is: using only (a + b(1-a)) of the data,
        which typically ranges from 30% to 50% for common parameter settings.
        """
        if top_rate < 0 or top_rate > 1:
            raise ValueError("top_rate must be in [0, 1]")
        if other_rate < 0 or other_rate > 1:
            raise ValueError("other_rate must be in [0, 1]")
        if top_rate + other_rate > 1:
            raise ValueError("top_rate + other_rate must not exceed 1")
            
        self.top_rate = top_rate
        self.other_rate = other_rate
        
        # Weight amplification factor for low-gradient samples
        if other_rate > 0:
            self.weight_factor = (1 - top_rate) / other_rate
        else:
            self.weight_factor = 1.0
    
    def sample(self, gradients: np.ndarray, 
               random_state: int = None) -> Tuple[np.ndarray, np.ndarray]:
        """
        Perform GOSS sampling on the given gradients.
        
        Parameters:
        -----------
        gradients : array of shape (n_samples,)
            First-order gradients for each sample
        random_state : int, optional
            Random seed for reproducibility
            
        Returns:
        --------
        selected_indices : array of selected sample indices
        sample_weights : array of weights for selected samples
        """
        if random_state is not None:
            np.random.seed(random_state)
        
        n_samples = len(gradients)
        
        # Step 1: Sort samples by absolute gradient (descending)
        abs_gradients = np.abs(gradients)
        sorted_indices = np.argsort(abs_gradients)[::-1]  # Descending order
        
        # Step 2: Determine split points
        n_top = int(n_samples * self.top_rate)
        n_other = int((n_samples - n_top) * self.other_rate)
        
        # Step 3: Select top gradient samples (set A)
        top_indices = sorted_indices[:n_top]
        
        # Step 4: Randomly sample from remaining samples (set B)
        remaining_indices = sorted_indices[n_top:]
        if n_other > 0 and len(remaining_indices) > 0:
            sampled_other_indices = np.random.choice(
                remaining_indices, 
                size=min(n_other, len(remaining_indices)),
                replace=False
            )
        else:
            sampled_other_indices = np.array([], dtype=int)
        
        # Step 5: Combine and assign weights
        selected_indices = np.concatenate([top_indices, sampled_other_indices])
        
        # Top samples get weight 1.0, sampled others get amplified weight
        weights = np.ones(len(selected_indices))
        weights[n_top:] = self.weight_factor
        
        return selected_indices, weights
    
    def get_statistics(self, n_samples: int) -> dict:
        """Get sampling statistics for a given sample size."""
        n_top = int(n_samples * self.top_rate)
        n_other = int((n_samples - n_top) * self.other_rate)
        
        return {
            'total_samples': n_samples,
            'top_kept': n_top,
            'other_sampled': n_other,
            'total_selected': n_top + n_other,
            'effective_sample_rate': (n_top + n_other) / n_samples,
            'data_reduction': 1 - (n_top + n_other) / n_samples,
            'weight_factor': self.weight_factor
        }
 
 
# Demonstration
def demonstrate_goss():
    """Demonstrate GOSS on synthetic gradient data."""
    np.random.seed(42)
    
    # Simulate gradients from a partially trained model
    # Most samples have small gradients (well-learned)
    # Few samples have large gradients (hard examples)
    n_samples = 10000
    
    # 80% of samples are well-predicted (small gradients)
    well_predicted = np.random.normal(0, 0.1, int(n_samples * 0.8))
    
    # 20% of samples are poorly predicted (large gradients)
    poorly_predicted = np.random.normal(0, 1.0, int(n_samples * 0.2))
    
    gradients = np.concatenate([well_predicted, poorly_predicted])
    np.random.shuffle(gradients)
    
    print("=" * 60)
    print("GOSS Demonstration")
    print("=" * 60)
    
    print(f"\nOriginal data: {n_samples} samples")
    print(f"Gradient distribution:")
    print(f"  Mean |gradient|: {np.abs(gradients).mean():.4f}")
    print(f"  Max |gradient|:  {np.abs(gradients).max():.4f}")
    print(f"  Min |gradient|:  {np.abs(gradients).min():.6f}")
    
    # Apply GOSS with typical parameters
    sampler = GOSSampler(top_rate=0.2, other_rate=0.1)
    selected_idx, weights = sampler.sample(gradients, random_state=42)
    
    stats = sampler.get_statistics(n_samples)
    
    print(f"\nGOSS Parameters: top_rate={sampler.top_rate}, other_rate={sampler.other_rate}")
    print(f"\nSampling Results:")
    print(f"  Top gradient samples kept: {stats['top_kept']}")
    print(f"  Low gradient samples sampled: {stats['other_sampled']}")
    print(f"  Total selected: {stats['total_selected']}")
    print(f"  Effective sample rate: {stats['effective_sample_rate']:.1%}")
    print(f"  Data reduction: {stats['data_reduction']:.1%}")
    print(f"  Weight amplification factor: {stats['weight_factor']:.2f}x")
    
    # Verify gradient preservation
    selected_grads = gradients[selected_idx]
    print(f"\nGradient Statistics After Sampling:")
    print(f"  Mean |gradient| (unweighted): {np.abs(selected_grads).mean():.4f}")
    
    # Weighted mean should approximate original
    weighted_sum = np.sum(np.abs(selected_grads) * weights)
    weighted_count = np.sum(weights)
    weighted_mean = weighted_sum / weighted_count
    print(f"  Mean |gradient| (weighted): {weighted_mean:.4f}")
    print(f"  Original mean |gradient|: {np.abs(gradients).mean():.4f}")
    
    # Show gradient distribution of selected samples
    print(f"\nSelected Sample Analysis:")
    top_samples_grads = np.abs(gradients[selected_idx[:stats['top_kept']]])
    other_samples_grads = np.abs(gradients[selected_idx[stats['top_kept']:]])
    print(f"  Top samples - Mean |gradient|: {top_samples_grads.mean():.4f}")
    print(f"  Sampled others - Mean |gradient|: {other_samples_grads.mean():.4f}")
 
 
if __name__ == "__main__":
    demonstrate_goss()

Mathematical Justification

The mathematical foundation of GOSS rests on two key insights: (1) the connection between gradient magnitude and information gain, and (2) the theory of importance sampling.

The Information Gain Connection:

Recall from the previous page that the gain from splitting a leaf is:

$$\text{Gain} = \frac{1}{2} \left( \frac{G_L^2}{H_L + \lambda} + \frac{G_R^2}{H_R + \lambda} - \frac{G^2}{H + \lambda} \right) - \gamma$$

Where $G = \sum_i g_i$ is the sum of gradients and $H = \sum_i h_i$ is the sum of Hessians.

Key Observation: The gradient sums $G_L$ and $G_R$ determine the split quality. Samples with larger $|g_i|$ contribute more to these sums and thus have more influence on which split is selected.

If we remove samples with small $|g_i|$, the gradient sums change minimally:

A sample with $|g_i| = 0.01$ contributes almost nothing
A sample with $|g_i| = 1.0$ contributes 100× more

This asymmetry justifies keeping high-gradient samples while subsampling low-gradient samples.

The Importance Sampling Perspective

Variance-Bias Trade-off in GOSS:

Let $G_{\text{true}} = \sum_{i=1}^{n} g_i$ be the true gradient sum over all samples.

With GOSS, we estimate: $$G_{\text{GOSS}} = \sum_{i \in A} g_i + \frac{1-a}{b} \sum_{j \in B} g_j$$

Where:

$A$ = top gradient samples (deterministically selected)
$B$ = randomly sampled low-gradient samples
$(1-a)/b$ = importance weight

Theorem (Unbiasedness of GOSS):

$\mathbb{E}[G_{\text{GOSS}}] = G_{\text{true}}$

Proof:

Let $C$ be the set of low-gradient samples (all samples not in $A$).

$\mathbb{E}[G_{\text{GOSS}}] = \sum_{i \in A} g_i + \frac{1-a}{b} \cdot \mathbb{E}\left[ \sum_{j \in B} g_j \right]$

Since $B$ is a uniform random sample of size $b|C|$ from $C$:

$\mathbb{E}\left[ \sum_{j \in B} g_j \right] = b|C| \cdot \frac{1}{|C|} \sum_{k \in C} g_k = b \sum_{k \in C} g_k$

Therefore:

$\mathbb{E}[G_{\text{GOSS}}] = \sum_{i \in A} g_i + \frac{1-a}{b} \cdot b \sum_{k \in C} g_k = \sum_{i \in A} g_i + (1-a) \sum_{k \in C} g_k$

Variance Analysis:

While GOSS maintains unbiasedness, it introduces variance. The variance comes from the random sampling of set $B$.

$\text{Var}(G_{\text{GOSS}}) = \left( \frac{1-a}{b} \right)^2 \text{Var}\left( \sum_{j \in B} g_j \right)$

The variance decreases when:

$b$ increases: Sampling more low-gradient points reduces randomness
Low-gradient gradients are indeed small: If $|g_i| \approx 0$ for all $i \in C$, variance is minimal regardless of sampling

Practical Implication: GOSS is most effective when there's a clear separation between high-gradient and low-gradient samples. If gradients are uniformly distributed, the benefit is smaller.

goss_variance_analysis.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
import numpy as np
import matplotlib.pyplot as plt
 
def analyze_goss_variance(gradients, top_rate, other_rate, n_trials=1000):
    """
    Analyze the variance introduced by GOSS sampling.
    
    Parameters:
    -----------
    gradients : array - True gradients for all samples
    top_rate, other_rate : GOSS parameters
    n_trials : Number of sampling trials to estimate variance
    """
    n_samples = len(gradients)
    n_top = int(n_samples * top_rate)
    n_other = int((n_samples - n_top) * other_rate)
    weight_factor = (1 - top_rate) / other_rate
    
    # True gradient sum
    true_sum = gradients.sum()
    
    # Sort by absolute gradient
    sorted_idx = np.argsort(np.abs(gradients))[::-1]
    top_idx = sorted_idx[:n_top]
    remaining_idx = sorted_idx[n_top:]
    
    # Top contribution (deterministic)
    top_sum = gradients[top_idx].sum()
    
    # Monte Carlo estimation of GOSS variance
    goss_estimates = []
    for _ in range(n_trials):
        # Random sample from remaining
        sampled_idx = np.random.choice(remaining_idx, size=n_other, replace=False)
        sampled_sum = gradients[sampled_idx].sum() * weight_factor
        goss_estimates.append(top_sum + sampled_sum)
    
    goss_estimates = np.array(goss_estimates)
    
    print("GOSS Variance Analysis")
    print("=" * 50)
    print(f"True gradient sum: {true_sum:.4f}")
    print(f"GOSS mean estimate: {goss_estimates.mean():.4f}")
    print(f"Bias: {(goss_estimates.mean() - true_sum):.6f}")
    print(f"GOSS std deviation: {goss_estimates.std():.4f}")
    print(f"Relative std (CV): {100 * goss_estimates.std() / abs(true_sum):.2f}%")
    
    return goss_estimates, true_sum
 
 
# Example with bimodal gradient distribution (typical in boosting)
np.random.seed(42)
n = 10000
 
# Simulate late-stage boosting: most samples have small gradients
gradients = np.concatenate([
    np.random.normal(0, 0.05, int(n * 0.85)),   # Well-predicted: ~85%
    np.random.normal(0, 0.5, int(n * 0.15))     # Challenging: ~15%
])
 
print("\nScenario 1: Typical gradient distribution (bimodal)")
estimates, true_val = analyze_goss_variance(gradients, 0.2, 0.1)
 
# Compare with uniform gradient distribution
print("\nScenario 2: Uniform gradient distribution (worst case for GOSS)")
uniform_gradients = np.random.uniform(-1, 1, n)
estimates2, true_val2 = analyze_goss_variance(uniform_gradients, 0.2, 0.1)
 
# The variance in Scenario 1 should be much lower than Scenario 2
print("\nConclusion:")
print("GOSS works best when gradients are heterogeneous -")
print("high variance in gradient magnitudes means the 'top' set is meaningful.")

GOSS vs Random Sampling

Why not just use random sampling instead of the more complex GOSS procedure? The answer lies in the distribution of gradient information and the efficiency of different sampling strategies.

Random Sampling:

The Problem with Random Sampling in Boosting:

As boosting progresses:

Most samples become well-predicted (small gradients)
Few samples remain challenging (large gradients)
Random sampling has high probability of missing the informative samples

Consider a dataset where 95% of samples have gradients near zero and 5% have large gradients. Random sampling at 30% rate:

Expected to capture only 30% of the informative samples
Captures 30% of the uninformative samples (wasting computation)
High variance because important samples may be missed

GOSS at 20% top + 10% other:

Guarantees capturing the top 20% by gradient (which includes all informative samples)
Samples only 10% of the rest
Low variance because the important contribution is never missed

GOSS vs Random Sampling Comparison
Criterion	Random Sampling	GOSS
High-gradient retention	Probabilistic (may miss)	Guaranteed (top a kept)
Compute per iteration	Similar	Similar
Sorting overhead	None	O(n log n) per iteration
Variance in gradient estimate	Higher	Lower (deterministic for top)
Best when gradients are	Uniformly distributed	Heterogeneously distributed
Typical speedup	Linear with rate	Super-linear (focuses on informative)

goss_vs_random_comparison.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
import numpy as np
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
import lightgbm as lgb
import time
 
# Generate a medium-sized dataset
X, y = make_classification(
    n_samples=50000, n_features=50, n_informative=20,
    n_redundant=10, n_clusters_per_class=3, random_state=42
)
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42)
 
# Common parameters
base_params = {
    'objective': 'binary',
    'metric': 'binary_logloss',
    'boosting_type': 'gbdt',
    'num_leaves': 31,
    'learning_rate': 0.05,
    'verbose': -1,
    'num_threads': 4
}
 
train_set = lgb.Dataset(X_train, label=y_train)
val_set = lgb.Dataset(X_val, label=y_val, reference=train_set)
 
def train_and_evaluate(params, name):
    """Train a LightGBM model and report performance."""
    start_time = time.time()
    
    model = lgb.train(
        params, train_set,
        num_boost_round=300,
        valid_sets=[val_set],
        valid_names=['val'],
        callbacks=[lgb.early_stopping(50, verbose=False)]
    )
    
    train_time = time.time() - start_time
    best_iter = model.best_iteration
    best_score = model.best_score['val']['binary_logloss']
    
    return {
        'name': name,
        'train_time': train_time,
        'best_iteration': best_iter,
        'val_loss': best_score
    }
 
 
# Configuration 1: Full data (baseline)
results_full = train_and_evaluate(base_params, "Full Data")
 
# Configuration 2: Random subsampling
params_random = base_params.copy()
params_random.update({
    'bagging_fraction': 0.3,  # Use 30% of data each iteration
    'bagging_freq': 1         # Resample every iteration
})
results_random = train_and_evaluate(params_random, "Random 30%")
 
# Configuration 3: GOSS (using LightGBM's built-in implementation)
# Note: LightGBM's boosting_type='goss' enables GOSS automatically
params_goss = base_params.copy()
params_goss.update({
    'boosting_type': 'goss',
    'top_rate': 0.2,     # Keep top 20% by gradient
    'other_rate': 0.1    # Sample 10% of remaining (effective: ~28% of data)
})
results_goss = train_and_evaluate(params_goss, "GOSS (20%, 10%)")
 
# Print results
print("\n" + "=" * 60)
print("COMPARISON: GOSS vs Random Subsampling")
print("=" * 60)
print(f"{'Configuration':<20} {'Time(s)':<12} {'Best Iter':<12} {'Val Loss':<12}")
print("-" * 60)
 
for result in [results_full, results_random, results_goss]:
    print(f"{result['name']:<20} {result['train_time']:<12.2f} "
          f"{result['best_iteration']:<12} {result['val_loss']:<12.6f}")
 
print("\nObservations:")
print("- GOSS maintains accuracy close to full data while reducing training time")
print("- Random subsampling may need more iterations to converge")
print("- GOSS's intelligent sampling preserves learning signal better")

When to Use GOSS

GOSS in LightGBM

LightGBM provides two ways to enable GOSS: through the boosting type parameter or manual configuration.

Method 1: boosting_type='goss'

The simplest way to enable GOSS is setting boosting_type='goss'. This activates GOSS with default parameters and disables bagging (since GOSS provides its own sampling mechanism).

lightgbm_goss_usage.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
import lightgbm as lgb
 
# Method 1: Simple GOSS activation
params_simple = {
    'objective': 'binary',
    'metric': 'auc',
    'boosting_type': 'goss',  # Enables GOSS with default top_rate=0.2, other_rate=0.1
    'num_leaves': 31,
    'learning_rate': 0.05,
}
 
# Method 2: GOSS with custom parameters
params_custom = {
    'objective': 'binary',
    'metric': 'auc',
    'boosting_type': 'goss',
    'top_rate': 0.25,         # Keep top 25% by gradient magnitude
    'other_rate': 0.15,       # Sample 15% of remaining
    # Note: top_rate + other_rate should not exceed 1.0
    
    'num_leaves': 63,
    'learning_rate': 0.03,
    'min_data_in_leaf': 20,
    
    # Regularization (still important with GOSS)
    'lambda_l1': 0.1,
    'lambda_l2': 0.1,
}
 
# Important notes about GOSS:
# 1. GOSS does NOT combine with bagging_fraction/bagging_freq
#    (those parameters are ignored when boosting_type='goss')
# 2. GOSS can combine with feature subsampling (feature_fraction)
# 3. Early stopping is particularly important with GOSS to prevent overfitting
 
# Training with GOSS
def train_with_goss(X_train, y_train, X_val, y_val):
    train_data = lgb.Dataset(X_train, label=y_train)
    val_data = lgb.Dataset(X_val, label=y_val, reference=train_data)
    
    model = lgb.train(
        params_custom,
        train_data,
        num_boost_round=500,
        valid_sets=[train_data, val_data],
        valid_names=['train', 'val'],
        callbacks=[
            lgb.early_stopping(stopping_rounds=50),
            lgb.log_evaluation(period=50)
        ]
    )
    
    return model
 
 
# Method 3: Using LGBMClassifier API
from lightgbm import LGBMClassifier
 
clf = LGBMClassifier(
    boosting_type='goss',
    top_rate=0.2,
    other_rate=0.1,
    n_estimators=300,
    num_leaves=31,
    learning_rate=0.05,
    random_state=42
)
 
# clf.fit(X_train, y_train, 
#         eval_set=[(X_val, y_val)], 
#         callbacks=[lgb.early_stopping(50)])

Important GOSS Considerations

•GOSS replaces bagging — When using boosting_type='goss', bagging_fraction and bagging_freq are ignored. GOSS has its own sampling mechanism.
•Sorting overhead — Each iteration requires sorting samples by gradient, adding O(n log n) overhead. This is offset by the reduced sample count in split finding.
•Not for small datasets — For datasets under 10K samples, the overhead may exceed the benefit. Use regular GBDT instead.
•Feature fraction still works — You can combine GOSS with feature_fraction for additional randomization.
•Monitor for instability — High top_rate and low other_rate can introduce training instability. Watch validation metrics carefully.

Parameter Tuning Guidelines

Tuning GOSS parameters requires balancing training speed against model accuracy. Here are practical guidelines based on dataset characteristics.

GOSS Parameter Recommendations by Dataset Size
Dataset Size	top_rate	other_rate	Effective Rate	Notes
< 50K samples	Consider regular GBDT		100%	GOSS overhead may not pay off
50K - 200K	0.2	0.15	~32%	Balanced speed/accuracy
200K - 1M	0.15	0.1	~23%	Aggressive sampling appropriate
1M - 10M	0.1	0.1	~19%	Focus on speed, gradients well-distributed
10M	0.1	0.05	~14%	Maximum speed, requires careful monitoring

goss_tuning_strategy.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
import lightgbm as lgb
from sklearn.model_selection import train_test_split
import numpy as np
 
def get_goss_params(n_samples: int, base_params: dict = None) -> dict:
    """
    Get recommended GOSS parameters based on dataset size.
    
    Parameters:
    -----------
    n_samples : Number of training samples
    base_params : Base LightGBM parameters to extend
    
    Returns:
    --------
    dict : LightGBM parameters with appropriate GOSS settings
    """
    params = base_params.copy() if base_params else {}
    
    if n_samples < 50000:
        # Don't use GOSS for small datasets
        params['boosting_type'] = 'gbdt'
        print(f"Dataset size {n_samples}: Using regular GBDT (GOSS overhead not worthwhile)")
        return params
    
    params['boosting_type'] = 'goss'
    
    if n_samples < 200000:
        params['top_rate'] = 0.2
        params['other_rate'] = 0.15
    elif n_samples < 1000000:
        params['top_rate'] = 0.15
        params['other_rate'] = 0.1
    elif n_samples < 10000000:
        params['top_rate'] = 0.1
        params['other_rate'] = 0.1
    else:
        params['top_rate'] = 0.1
        params['other_rate'] = 0.05
    
    effective_rate = params['top_rate'] + params['other_rate'] * (1 - params['top_rate'])
    print(f"Dataset size {n_samples:,}: GOSS with top_rate={params['top_rate']}, "
          f"other_rate={params['other_rate']} (effective: {effective_rate:.1%})")
    
    return params
 
 
def tune_goss_parameters(X, y, param_grid: list = None):
    """
    Tune GOSS parameters using cross-validation.
    
    Parameters:
    -----------
    X, y : Training data
    param_grid : List of (top_rate, other_rate) tuples to try
    """
    if param_grid is None:
        param_grid = [
            (0.3, 0.2),   # Conservative
            (0.2, 0.15),  # Moderate
            (0.2, 0.1),   # Default
            (0.15, 0.1),  # Aggressive
            (0.1, 0.1),   # More aggressive
        ]
    
    X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42)
    train_data = lgb.Dataset(X_train, label=y_train)
    val_data = lgb.Dataset(X_val, label=y_val, reference=train_data)
    
    results = []
    
    for top_rate, other_rate in param_grid:
        params = {
            'objective': 'binary',
            'metric': 'auc',
            'boosting_type': 'goss',
            'top_rate': top_rate,
            'other_rate': other_rate,
            'num_leaves': 31,
            'learning_rate': 0.05,
            'verbose': -1
        }
        
        import time
        start = time.time()
        
        model = lgb.train(
            params, train_data,
            num_boost_round=300,
            valid_sets=[val_data],
            callbacks=[lgb.early_stopping(30, verbose=False)]
        )
        
        elapsed = time.time() - start
        
        results.append({
            'top_rate': top_rate,
            'other_rate': other_rate,
            'effective_rate': top_rate + other_rate * (1 - top_rate),
            'val_auc': model.best_score['valid_0']['auc'],
            'best_iter': model.best_iteration,
            'time': elapsed
        })
    
    print("\nGOSS Parameter Tuning Results:")
    print("=" * 80)
    print(f"{'top_rate':<12} {'other_rate':<12} {'Eff. Rate':<12} "
          f"{'Val AUC':<12} {'Best Iter':<12} {'Time(s)':<12}")
    print("-" * 80)
    
    for r in results:
        print(f"{r['top_rate']:<12.2f} {r['other_rate']:<12.2f} "
              f"{r['effective_rate']:<12.1%} {r['val_auc']:<12.6f} "
              f"{r['best_iter']:<12} {r['time']:<12.2f}")
    
    # Find best by AUC
    best = max(results, key=lambda x: x['val_auc'])
    print(f"\nBest configuration: top_rate={best['top_rate']}, other_rate={best['other_rate']}")
    
    return results

GOSS and Imbalanced Data

Summary: Gradient-based One-Side Sampling

Key Takeaways

•Gradient magnitude = sample importance — Samples the model struggles with (high |g|) carry more information for learning.
•Keep the top, sample the rest — GOSS deterministically keeps top-a samples and randomly samples b fraction of the remainder.
•Weight correction ensures unbiasedness — The (1-a)/b amplification factor compensates for underrepresentation of low-gradient samples.
•Variance is controlled — Because low-gradient samples contribute little anyway, variance from sampling them is manageable.
•Superior to random sampling — GOSS guarantees retention of informative samples, unlike uniform random sampling.
•Best for large datasets — The O(n log n) sorting overhead is worthwhile only for larger datasets (>50K samples).

What's Next:

Page Complete

2 / 5