Loading content...
M-estimators approach robustness by down-weighting outliers—reducing their influence on the final estimate. But what if you could simply ignore outliers entirely? What if, instead of trying to accommodate bad data, you could find the good data and estimate from that alone?
RANSAC (Random Sample Consensus), introduced by Martin Fischler and Robert Bolles in 1981, takes exactly this approach. It represents a fundamental paradigm shift: rather than treating all data as coming from one (possibly contaminated) distribution, RANSAC explicitly models data as a mixture of inliers (data that fits the model) and outliers (data that doesn't).
The key insight is beautifully simple: if you randomly select a small subset of data points, there's a reasonable probability that subset contains only inliers. From that clean subset, you can fit a perfect model—and then identify which other points are also inliers by checking consistency.
RANSAC has achieved legendary status in computer vision, where it solved problems that had resisted decades of traditional approaches. Fitting lines to edge detections, estimating camera transformations, matching features across images—tasks where 50% or more of the data might be spurious—became tractable through RANSAC.
This page provides complete coverage of RANSAC: the algorithm, its theoretical foundations, parameter selection, variants, and practical implementation considerations.
By the end of this page, you will understand the RANSAC algorithm and its theoretical justification, how to select the number of iterations and threshold parameters, the relationship between RANSAC and statistical robustness concepts, and when to use RANSAC versus M-estimators.
The Core Algorithm:
RANSAC operates through repeated random sampling and hypothesis testing. Here's the complete procedure:
Input:
Algorithm:
best_model ← null
best_inliers ← ∅
best_error ← ∞
for iteration = 1 to k:
# Step 1: Randomly select s points
sample ← random_sample(data, size=s)
# Step 2: Fit model to the sample
candidate_model ← fit_model(sample)
# Step 3: Find all inliers (points consistent with model)
inliers ← {i : |y_i - predict(candidate_model, x_i)| < t}
# Step 4: If enough inliers, refit and evaluate
if |inliers| ≥ d:
refined_model ← fit_model(data[inliers])
error ← compute_error(refined_model, data[inliers])
if |inliers| > |best_inliers| or error < best_error:
best_model ← refined_model
best_inliers ← inliers
best_error ← error
return best_model, best_inliers
Intuition:
Each iteration takes a bet: "Maybe these $s$ points are all inliers." If they are, the fitted model will be correct, and checking against all points will reveal many more inliers. If the sample included an outlier, the model will be wrong, few points will be consistent, and we move on.
By trying many random samples, we maximize the probability of eventually finding a purely inlier sample.
If 60% of data are inliers and we sample 2 points, there's a (0.6)² = 36% chance both are inliers. After 10 attempts: 1 - (1-0.36)^10 = 99% chance of finding at least one good sample. Randomness provides probabilistic guarantees.
The most important parameter in RANSAC is $k$, the number of iterations. Too few iterations and you might never sample all inliers; too many and computation is wasted.
Probabilistic derivation:
Let $w$ be the probability that any randomly selected point is an inlier (the "inlier ratio"). For a sample of size $s$:
After $k$ iterations:
We want this failure probability to be at most $p$ (typically $p = 0.01$ or $p = 0.001$):
$$(1 - w^s)^k \leq p$$
Solving for $k$:
$$k \geq \frac{\log(p)}{\log(1 - w^s)}$$
The formula in practice:
$$k = \left\lceil \frac{\log(p)}{\log(1 - w^s)} \right\rceil$$
| Inlier Ratio (w) | s = 2 | s = 3 | s = 4 | s = 5 | s = 8 |
|---|---|---|---|---|---|
| 90% | 3 | 4 | 5 | 6 | 9 |
| 80% | 5 | 7 | 9 | 12 | 26 |
| 70% | 7 | 11 | 17 | 26 | 78 |
| 60% | 11 | 19 | 34 | 57 | 272 |
| 50% | 17 | 35 | 72 | 146 | 1,177 |
| 40% | 30 | 76 | 191 | 480 | 7,640 |
| 30% | 57 | 191 | 638 | 2,134 | 79,273 |
Notice how iterations explode as s increases or w decreases. With 50% outliers and s=8, you need over 1,000 iterations. This is why RANSAC is most efficient when s is small (minimal sample size) and the inlier ratio is reasonably high (>40%).
Adaptive iteration count:
In practice, we don't know $w$ in advance. A clever adaptation estimates $w$ on the fly:
This adaptive approach often terminates much faster than worst-case bounds suggest.
The threshold $t$ determines which points are classified as inliers. This parameter critically affects RANSAC performance:
Statistical approach:
Under a Gaussian noise model with standard deviation $\sigma$:
A common choice is $t = 2.5\sigma$ (covers ~99% of inliers under Gaussian noise).
When σ is unknown:
Estimate $\sigma$ from:
Chi-squared approach (for multi-dimensional errors):
When errors are multi-dimensional (e.g., 2D point distances), use the Mahalanobis distance:
$$d^2 = \mathbf{e}^\top \Sigma^{-1} \mathbf{e}$$
Under Gaussian errors, $d^2 \sim \chi^2_m$ where $m$ is the dimension. Set threshold using $\chi^2$ quantiles:
When in doubt, use a slightly larger threshold. Including a few outliers in the inlier set is usually less harmful than excluding many true inliers. The final model is refit to all inliers, and a few outliers will be down-weighted by this averaging.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165
import numpy as npfrom scipy import linalg def ransac_regression(X, y, min_samples=None, residual_threshold=None, max_trials=1000, stop_probability=0.99, min_inlier_fraction=0.5, random_state=None): """ RANSAC for robust linear regression. Parameters: ----------- X : array, shape (n_samples, n_features) Feature matrix y : array, shape (n_samples,) Target values min_samples : int, optional Minimum number of samples to fit model (default: n_features) residual_threshold : float, optional Maximum residual for inlier classification (default: MAD-based) max_trials : int Maximum number of RANSAC iterations stop_probability : float Probability of finding good model (for adaptive stopping) min_inlier_fraction : float Minimum fraction of inliers required random_state : int, optional Random seed for reproducibility Returns: -------- best_coef : array Estimated coefficients inlier_mask : array, bool Mask indicating inliers info : dict Algorithm information (iterations, inlier count, etc.) """ rng = np.random.RandomState(random_state) n_samples, n_features = X.shape # Default minimum samples if min_samples is None: min_samples = n_features # Initial threshold estimate using robust scale if residual_threshold is None: # Use median absolute deviation for robust scale estimate # Start with OLS fit for initial residuals try: init_coef = linalg.lstsq(X, y)[0] init_residuals = y - X @ init_coef mad = np.median(np.abs(init_residuals - np.median(init_residuals))) residual_threshold = 2.5 * 1.4826 * mad if mad > 0 else 1.0 except: residual_threshold = 1.0 best_coef = None best_inlier_mask = np.zeros(n_samples, dtype=bool) best_n_inliers = 0 # Adaptive iteration count n_trials = max_trials trial = 0 min_inliers = int(min_inlier_fraction * n_samples) info = { 'n_trials': 0, 'converged': False, 'threshold': residual_threshold, } while trial < n_trials: # Step 1: Random sample sample_idx = rng.choice(n_samples, size=min_samples, replace=False) X_sample = X[sample_idx] y_sample = y[sample_idx] # Step 2: Fit model to sample try: sample_coef = linalg.lstsq(X_sample, y_sample)[0] except: trial += 1 continue # Step 3: Compute residuals and find inliers residuals = np.abs(y - X @ sample_coef) inlier_mask = residuals < residual_threshold n_inliers = np.sum(inlier_mask) # Step 4: Check if this is a good model if n_inliers > best_n_inliers and n_inliers >= min_inliers: # Refit to all inliers X_inliers = X[inlier_mask] y_inliers = y[inlier_mask] try: refined_coef = linalg.lstsq(X_inliers, y_inliers)[0] # Recompute inliers with refined model refined_residuals = np.abs(y - X @ refined_coef) refined_inlier_mask = refined_residuals < residual_threshold refined_n_inliers = np.sum(refined_inlier_mask) if refined_n_inliers > best_n_inliers: best_coef = refined_coef best_inlier_mask = refined_inlier_mask best_n_inliers = refined_n_inliers # Update adaptive iteration count inlier_ratio = best_n_inliers / n_samples if inlier_ratio > 0: p_outlier = 1 - inlier_ratio ** min_samples if p_outlier > 0 and p_outlier < 1: n_trials = min( max_trials, int(np.ceil(np.log(1 - stop_probability) / np.log(p_outlier))) ) except: pass trial += 1 info['n_trials'] = trial info['n_inliers'] = best_n_inliers info['inlier_ratio'] = best_n_inliers / n_samples if n_samples > 0 else 0 info['converged'] = best_n_inliers >= min_inliers return best_coef, best_inlier_mask, info # Example with extreme outliersnp.random.seed(42)n = 100 # Generate data: 60% inliers, 40% outliersn_inliers = 60n_outliers = 40 # Inliers: clean linear relationshipX_inliers = np.random.randn(n_inliers, 1)y_inliers = 2.0 + 3.0 * X_inliers.flatten() + 0.2 * np.random.randn(n_inliers) # Outliers: random points (no relationship)X_outliers = np.random.randn(n_outliers, 1) * 3y_outliers = 10 * np.random.randn(n_outliers) # CombineX = np.vstack([X_inliers, X_outliers])y = np.concatenate([y_inliers, y_outliers])true_inliers = np.concatenate([np.ones(n_inliers), np.zeros(n_outliers)]).astype(bool) # Add interceptX_with_intercept = np.column_stack([np.ones(n), X]) # Compare methodsols_coef = linalg.lstsq(X_with_intercept, y)[0]ransac_coef, inlier_mask, info = ransac_regression(X_with_intercept, y, random_state=42) print("True coefficients: [2.0, 3.0]")print(f"OLS: [{ols_coef[0]:.3f}, {ols_coef[1]:.3f}] (corrupted by outliers)")print(f"RANSAC: [{ransac_coef[0]:.3f}, {ransac_coef[1]:.3f}]")print(f"RANSAC found {info['n_inliers']} inliers ({info['inlier_ratio']*100:.1f}%)")print(f"Used {info['n_trials']} iterations")print(f"Correctly identified inliers: {np.sum(inlier_mask & true_inliers)}/{n_inliers}")RANSAC and M-estimators represent fundamentally different philosophies for handling outliers. Understanding when to use each is crucial for effective robust estimation.
| Aspect | M-Estimators | RANSAC |
|---|---|---|
| Philosophy | Down-weight outliers | Identify and remove outliers |
| Breakdown point | 0-50% (type dependent) | ~50% achievable |
| Deterministic | Yes (with fixed initialization) | No (randomized algorithm) |
| Standard errors | Well-characterized | Harder to compute |
| Computational cost | O(n) per iteration | O(k × n) where k can be large |
| High outlier rates | May fail | Handles well |
| Small samples | Efficient | May not have enough inliers |
A powerful strategy: use RANSAC to identify inliers, then apply M-estimation to the inliers for refined estimates with proper uncertainty quantification. This combines RANSAC's outlier detection with M-estimation's statistical framework.
The basic RANSAC algorithm has inspired numerous variants addressing its limitations:
1. MSAC (M-estimator Sample Consensus)
Instead of counting inliers (hard threshold), use a robust score:
$$\text{Score} = \sum_{i=1}^n \rho(r_i)$$
where $\rho$ is a truncated quadratic: $$\rho(r) = \min(r^2, t^2)$$
Inliers contribute their squared residual; outliers contribute $t^2$. This creates smoother score landscapes.
2. MLESAC (Maximum Likelihood Estimation Sample Consensus)
Models data as mixture of inliers (Gaussian) and outliers (uniform):
$$P(r_i) = \gamma \cdot \mathcal{N}(r_i; 0, \sigma^2) + (1-\gamma) \cdot U(r_i; a, b)$$
Maximizes likelihood instead of counting inliers. Automatically estimates inlier fraction $\gamma$.
3. LO-RANSAC (Locally Optimized RANSAC)
When a good model is found, perform local optimization on nearby samples before continuing. Finds better models with fewer iterations.
4. PROSAC (Progressive Sample Consensus)
Uses prior quality scores on points (e.g., feature match confidence). Samples high-quality points first, finding good models faster when quality information is available.
sklearn.linear_model.RANSACRegressor implements basic RANSAC with configurable estimators, loss functions, and stopping criteria. For most applications, this well-tested implementation is preferred over custom code.
What's next:
We've now covered the two main paradigms for robust estimation: M-estimators (which down-weight outliers) and RANSAC (which identifies and removes them). The next page introduces breakdown point—the formal measure of how much contamination an estimator can tolerate before completely failing. This theoretical concept provides rigorous foundations for comparing robustness across methods.
You now understand RANSAC as a fundamentally different approach to robust estimation—one that explicitly separates inliers from outliers through random sampling and consensus. This powerful technique handles extreme contamination that would defeat M-estimators.