Robust Regression - Learning Module

Loading content...

0/278

RANSAC: Random Sample Consensus for Extreme Outlier Robustness

A Paradigm Shift in Robust Estimation

M-estimators approach robustness by down-weighting outliers—reducing their influence on the final estimate. But what if you could simply ignore outliers entirely? What if, instead of trying to accommodate bad data, you could find the good data and estimate from that alone?

RANSAC (Random Sample Consensus), introduced by Martin Fischler and Robert Bolles in 1981, takes exactly this approach. It represents a fundamental paradigm shift: rather than treating all data as coming from one (possibly contaminated) distribution, RANSAC explicitly models data as a mixture of inliers (data that fits the model) and outliers (data that doesn't).

The key insight is beautifully simple: if you randomly select a small subset of data points, there's a reasonable probability that subset contains only inliers. From that clean subset, you can fit a perfect model—and then identify which other points are also inliers by checking consistency.

RANSAC has achieved legendary status in computer vision, where it solved problems that had resisted decades of traditional approaches. Fitting lines to edge detections, estimating camera transformations, matching features across images—tasks where 50% or more of the data might be spurious—became tractable through RANSAC.

This page provides complete coverage of RANSAC: the algorithm, its theoretical foundations, parameter selection, variants, and practical implementation considerations.

What You Will Learn

By the end of this page, you will understand the RANSAC algorithm and its theoretical justification, how to select the number of iterations and threshold parameters, the relationship between RANSAC and statistical robustness concepts, and when to use RANSAC versus M-estimators.

The RANSAC Algorithm

The Core Algorithm:

RANSAC operates through repeated random sampling and hypothesis testing. Here's the complete procedure:

Input:

Data points ${(\mathbf{x}i, y_i)}{i=1}^n$
Minimum sample size $s$ (minimum points needed to fit model)
Inlier threshold $t$ (maximum residual for point to be considered inlier)
Number of iterations $k$
Minimum inlier count $d$ (optional, for early termination)

Algorithm:

best_model ← null
best_inliers ← ∅
best_error ← ∞

for iteration = 1 to k:
    # Step 1: Randomly select s points
    sample ← random_sample(data, size=s)
    
    # Step 2: Fit model to the sample
    candidate_model ← fit_model(sample)
    
    # Step 3: Find all inliers (points consistent with model)
    inliers ← {i : |y_i - predict(candidate_model, x_i)| < t}
    
    # Step 4: If enough inliers, refit and evaluate
    if |inliers| ≥ d:
        refined_model ← fit_model(data[inliers])
        error ← compute_error(refined_model, data[inliers])
        
        if |inliers| > |best_inliers| or error < best_error:
            best_model ← refined_model
            best_inliers ← inliers
            best_error ← error

return best_model, best_inliers

Intuition:

Each iteration takes a bet: "Maybe these $s$ points are all inliers." If they are, the fitted model will be correct, and checking against all points will reveal many more inliers. If the sample included an outlier, the model will be wrong, few points will be consistent, and we move on.

By trying many random samples, we maximize the probability of eventually finding a purely inlier sample.

Why Random Sampling Works

If 60% of data are inliers and we sample 2 points, there's a (0.6)² = 36% chance both are inliers. After 10 attempts: 1 - (1-0.36)^10 = 99% chance of finding at least one good sample. Randomness provides probabilistic guarantees.

Determining the Number of Iterations

The most important parameter in RANSAC is $k$, the number of iterations. Too few iterations and you might never sample all inliers; too many and computation is wasted.

Probabilistic derivation:

Let $w$ be the probability that any randomly selected point is an inlier (the "inlier ratio"). For a sample of size $s$:

Probability that all $s$ points are inliers: $w^s$
Probability that at least one is an outlier: $1 - w^s$

After $k$ iterations:

Probability of never finding an all-inlier sample: $(1 - w^s)^k$

We want this failure probability to be at most $p$ (typically $p = 0.01$ or $p = 0.001$):

$$(1 - w^s)^k \leq p$$

Solving for $k$:

$$k \geq \frac{\log(p)}{\log(1 - w^s)}$$

The formula in practice:

$$k = \left\lceil \frac{\log(p)}{\log(1 - w^s)} \right\rceil$$

Required RANSAC Iterations (p = 0.01 = 99% success probability)
Inlier Ratio (w)	s = 2	s = 3	s = 4	s = 5	s = 8
90%	3	4	5	6	9
80%	5	7	9	12	26
70%	7	11	17	26	78
60%	11	19	34	57	272
50%	17	35	72	146	1,177
40%	30	76	191	480	7,640
30%	57	191	638	2,134	79,273

The Curse of Large Samples

Notice how iterations explode as s increases or w decreases. With 50% outliers and s=8, you need over 1,000 iterations. This is why RANSAC is most efficient when s is small (minimal sample size) and the inlier ratio is reasonably high (>40%).

Adaptive iteration count:

In practice, we don't know $w$ in advance. A clever adaptation estimates $w$ on the fly:

Start with a conservative $k$ (e.g., based on worst-case $w = 0.5$)
When a new best model is found, estimate $w$ as |inliers|/n
Recompute $k$ using this new $w$ estimate
Continue until $k$ iterations completed or confidence threshold met

This adaptive approach often terminates much faster than worst-case bounds suggest.

Setting the Inlier Threshold

The threshold $t$ determines which points are classified as inliers. This parameter critically affects RANSAC performance:

$t$ too small: True inliers may be incorrectly rejected as outliers
$t$ too large: Outliers may be incorrectly accepted as inliers

Statistical approach:

Under a Gaussian noise model with standard deviation $\sigma$:

95% of inliers have $|r| < 1.96\sigma$
99% of inliers have $|r| < 2.576\sigma$
99.7% of inliers have $|r| < 3\sigma$

A common choice is $t = 2.5\sigma$ (covers ~99% of inliers under Gaussian noise).

When σ is unknown:

Estimate $\sigma$ from:

Prior knowledge about measurement noise
Median Absolute Deviation (MAD) of residuals from an initial robust fit
Scale of the data (domain knowledge)

Chi-squared approach (for multi-dimensional errors):

When errors are multi-dimensional (e.g., 2D point distances), use the Mahalanobis distance:

$$d^2 = \mathbf{e}^\top \Sigma^{-1} \mathbf{e}$$

Under Gaussian errors, $d^2 \sim \chi^2_m$ where $m$ is the dimension. Set threshold using $\chi^2$ quantiles:

$t^2 = \chi^2_{m, 0.95}$ includes 95% of inliers

Threshold Selection Strategy

When in doubt, use a slightly larger threshold. Including a few outliers in the inlier set is usually less harmful than excluding many true inliers. The final model is refit to all inliers, and a few outliers will be down-weighted by this averaging.

RANSAC Implementation

ransac_regression.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
import numpy as np
from scipy import linalg
 
def ransac_regression(X, y, min_samples=None, residual_threshold=None,
                      max_trials=1000, stop_probability=0.99,
                      min_inlier_fraction=0.5, random_state=None):
    """
    RANSAC for robust linear regression.
    
    Parameters:
    -----------
    X : array, shape (n_samples, n_features)
        Feature matrix
    y : array, shape (n_samples,)
        Target values
    min_samples : int, optional
        Minimum number of samples to fit model (default: n_features)
    residual_threshold : float, optional
        Maximum residual for inlier classification (default: MAD-based)
    max_trials : int
        Maximum number of RANSAC iterations
    stop_probability : float
        Probability of finding good model (for adaptive stopping)
    min_inlier_fraction : float
        Minimum fraction of inliers required
    random_state : int, optional
        Random seed for reproducibility
    
    Returns:
    --------
    best_coef : array
        Estimated coefficients
    inlier_mask : array, bool
        Mask indicating inliers
    info : dict
        Algorithm information (iterations, inlier count, etc.)
    """
    rng = np.random.RandomState(random_state)
    n_samples, n_features = X.shape
    
    # Default minimum samples
    if min_samples is None:
        min_samples = n_features
    
    # Initial threshold estimate using robust scale
    if residual_threshold is None:
        # Use median absolute deviation for robust scale estimate
        # Start with OLS fit for initial residuals
        try:
            init_coef = linalg.lstsq(X, y)[0]
            init_residuals = y - X @ init_coef
            mad = np.median(np.abs(init_residuals - np.median(init_residuals)))
            residual_threshold = 2.5 * 1.4826 * mad if mad > 0 else 1.0
        except:
            residual_threshold = 1.0
    
    best_coef = None
    best_inlier_mask = np.zeros(n_samples, dtype=bool)
    best_n_inliers = 0
    
    # Adaptive iteration count
    n_trials = max_trials
    trial = 0
    
    min_inliers = int(min_inlier_fraction * n_samples)
    
    info = {
        'n_trials': 0,
        'converged': False,
        'threshold': residual_threshold,
    }
    
    while trial < n_trials:
        # Step 1: Random sample
        sample_idx = rng.choice(n_samples, size=min_samples, replace=False)
        X_sample = X[sample_idx]
        y_sample = y[sample_idx]
        
        # Step 2: Fit model to sample
        try:
            sample_coef = linalg.lstsq(X_sample, y_sample)[0]
        except:
            trial += 1
            continue
        
        # Step 3: Compute residuals and find inliers
        residuals = np.abs(y - X @ sample_coef)
        inlier_mask = residuals < residual_threshold
        n_inliers = np.sum(inlier_mask)
        
        # Step 4: Check if this is a good model
        if n_inliers > best_n_inliers and n_inliers >= min_inliers:
            # Refit to all inliers
            X_inliers = X[inlier_mask]
            y_inliers = y[inlier_mask]
            
            try:
                refined_coef = linalg.lstsq(X_inliers, y_inliers)[0]
                
                # Recompute inliers with refined model
                refined_residuals = np.abs(y - X @ refined_coef)
                refined_inlier_mask = refined_residuals < residual_threshold
                refined_n_inliers = np.sum(refined_inlier_mask)
                
                if refined_n_inliers > best_n_inliers:
                    best_coef = refined_coef
                    best_inlier_mask = refined_inlier_mask
                    best_n_inliers = refined_n_inliers
                    
                    # Update adaptive iteration count
                    inlier_ratio = best_n_inliers / n_samples
                    if inlier_ratio > 0:
                        p_outlier = 1 - inlier_ratio ** min_samples
                        if p_outlier > 0 and p_outlier < 1:
                            n_trials = min(
                                max_trials,
                                int(np.ceil(np.log(1 - stop_probability) / np.log(p_outlier)))
                            )
            except:
                pass
        
        trial += 1
    
    info['n_trials'] = trial
    info['n_inliers'] = best_n_inliers
    info['inlier_ratio'] = best_n_inliers / n_samples if n_samples > 0 else 0
    info['converged'] = best_n_inliers >= min_inliers
    
    return best_coef, best_inlier_mask, info
 
# Example with extreme outliers
np.random.seed(42)
n = 100
 
# Generate data: 60% inliers, 40% outliers
n_inliers = 60
n_outliers = 40
 
# Inliers: clean linear relationship
X_inliers = np.random.randn(n_inliers, 1)
y_inliers = 2.0 + 3.0 * X_inliers.flatten() + 0.2 * np.random.randn(n_inliers)
 
# Outliers: random points (no relationship)
X_outliers = np.random.randn(n_outliers, 1) * 3
y_outliers = 10 * np.random.randn(n_outliers)
 
# Combine
X = np.vstack([X_inliers, X_outliers])
y = np.concatenate([y_inliers, y_outliers])
true_inliers = np.concatenate([np.ones(n_inliers), np.zeros(n_outliers)]).astype(bool)
 
# Add intercept
X_with_intercept = np.column_stack([np.ones(n), X])
 
# Compare methods
ols_coef = linalg.lstsq(X_with_intercept, y)[0]
ransac_coef, inlier_mask, info = ransac_regression(X_with_intercept, y, random_state=42)
 
print("True coefficients: [2.0, 3.0]")
print(f"OLS: [{ols_coef[0]:.3f}, {ols_coef[1]:.3f}] (corrupted by outliers)")
print(f"RANSAC: [{ransac_coef[0]:.3f}, {ransac_coef[1]:.3f}]")
print(f"
RANSAC found {info['n_inliers']} inliers ({info['inlier_ratio']*100:.1f}%)")
print(f"Used {info['n_trials']} iterations")
print(f"Correctly identified inliers: {np.sum(inlier_mask & true_inliers)}/{n_inliers}")

RANSAC vs M-Estimators: When to Use Which

RANSAC and M-estimators represent fundamentally different philosophies for handling outliers. Understanding when to use each is crucial for effective robust estimation.

Use M-Estimators When

•Outlier fraction is moderate (<20-30%)
•Outliers are 'tame' (not extremely extreme)
•Need well-characterized statistical properties
•Require deterministic results
•Want uncertainty quantification (standard errors)
•Sample size is moderate to small

Use RANSAC When

•Outlier fraction is high (30-70%)
•Outliers are very extreme or adversarial
•Need explicit inlier/outlier classification
•Model fitting is fast (small min_samples)
•Have large sample size (many samples to try)
•Computer vision / feature matching applications

Comparison Summary
Aspect	M-Estimators	RANSAC
Philosophy	Down-weight outliers	Identify and remove outliers
Breakdown point	0-50% (type dependent)	~50% achievable
Deterministic	Yes (with fixed initialization)	No (randomized algorithm)
Standard errors	Well-characterized	Harder to compute
Computational cost	O(n) per iteration	O(k × n) where k can be large
High outlier rates	May fail	Handles well
Small samples	Efficient	May not have enough inliers

Combining Approaches

A powerful strategy: use RANSAC to identify inliers, then apply M-estimation to the inliers for refined estimates with proper uncertainty quantification. This combines RANSAC's outlier detection with M-estimation's statistical framework.

RANSAC Variants and Extensions

The basic RANSAC algorithm has inspired numerous variants addressing its limitations:

1. MSAC (M-estimator Sample Consensus)

Instead of counting inliers (hard threshold), use a robust score:

$$\text{Score} = \sum_{i=1}^n \rho(r_i)$$

where $\rho$ is a truncated quadratic: $$\rho(r) = \min(r^2, t^2)$$

Inliers contribute their squared residual; outliers contribute $t^2$. This creates smoother score landscapes.

2. MLESAC (Maximum Likelihood Estimation Sample Consensus)

Models data as mixture of inliers (Gaussian) and outliers (uniform):

$$P(r_i) = \gamma \cdot \mathcal{N}(r_i; 0, \sigma^2) + (1-\gamma) \cdot U(r_i; a, b)$$

Maximizes likelihood instead of counting inliers. Automatically estimates inlier fraction $\gamma$.

3. LO-RANSAC (Locally Optimized RANSAC)

When a good model is found, perform local optimization on nearby samples before continuing. Finds better models with fewer iterations.

4. PROSAC (Progressive Sample Consensus)

Uses prior quality scores on points (e.g., feature match confidence). Samples high-quality points first, finding good models faster when quality information is available.

scikit-learn's RANSAC

sklearn.linear_model.RANSACRegressor implements basic RANSAC with configurable estimators, loss functions, and stopping criteria. For most applications, this well-tested implementation is preferred over custom code.

Summary: RANSAC as Robust Estimation Paradigm

Key Takeaways

•RANSAC paradigm: Randomly sample minimal subsets, fit models, find consensus among all points
•Explicit inlier/outlier separation: Unlike M-estimators, RANSAC produces hard classification of points
•Number of iterations: Computed from desired success probability, inlier ratio, and sample size
•Inlier threshold: Based on expected noise level, typically 2-3× estimated standard deviation
•Extreme robustness: Can handle >50% outliers when inlier ratio is sufficient
•Complementary to M-estimation: Use RANSAC for outlier detection, M-estimation for refined fitting

What's next:

We've now covered the two main paradigms for robust estimation: M-estimators (which down-weight outliers) and RANSAC (which identifies and removes them). The next page introduces breakdown point—the formal measure of how much contamination an estimator can tolerate before completely failing. This theoretical concept provides rigorous foundations for comparing robustness across methods.

Page Complete

You now understand RANSAC as a fundamentally different approach to robust estimation—one that explicitly separates inliers from outliers through random sampling and consensus. This powerful technique handles extreme contamination that would defeat M-estimators.