Loading learning content...
Standard KNN (even with sophisticated weighting) makes a strong assumption: within each neighborhood, the target function is approximately constant. The prediction is a weighted average—essentially assuming the function is flat locally.
But what if the function has a clear local trend? A linear slope within the neighborhood? Even curvature?
Consider predicting house prices: within a neighborhood of similar homes, prices might increase with square footage at a consistent rate. A weighted average ignores this trend, introducing bias. A local linear model captures the trend, dramatically improving predictions.
This insight leads to local polynomial regression—fitting parametric models (constant, linear, quadratic) within each neighborhood using weighted least squares. The result combines the flexibility of nonparametric methods with the efficiency of parametric modeling.
By the end of this page, you will understand local constant models (Nadaraya-Watson as a special case), local linear regression and its bias-reduction properties, the famous LOESS/LOWESS algorithm, local polynomial regression of arbitrary degree, and the tradeoffs between local model complexity and variance.
Let's formalize what standard weighted KNN regression actually computes.
The Local Constant Model:
For a query point $\mathbf{x}$, we fit a constant model $\hat{f}(\mathbf{x}) = \beta_0$ by minimizing weighted squared error:
$$\hat{\beta}0 = \arg\min{\beta_0} \sum_{i=1}^{n} w_i(\mathbf{x}) \cdot (y_i - \beta_0)^2$$
where $w_i(\mathbf{x}) = K_h(\mathbf{x} - \mathbf{x}_i)$ is the kernel weight of training point $i$.
Solution:
Taking the derivative and setting to zero:
$$\hat{\beta}_0 = \frac{\sum_i w_i(\mathbf{x}) \cdot y_i}{\sum_i w_i(\mathbf{x})}$$
This is exactly the Nadaraya-Watson estimator! Weighted KNN regression is equivalent to fitting a local constant model.
Geometric Interpretation:
We're fitting a horizontal plane (0-dimensional polynomial in the output) to the weighted data, and using its height as our prediction.
Local constant models have O(h²) bias due to first-order terms in the Taylor expansion of the true function. If f(x) has a slope at the query point, averaging nearby points introduces systematic bias toward the direction with more data (typically toward the interior of the distribution).
Where Local Constant Fails:
These problems motivate fitting richer local models that can capture trends rather than just location.
Local linear regression fits a linear model (intercept + slopes) at each query point using weighted least squares.
The Local Linear Model:
For query $\mathbf{x}$, we fit: $$\hat{f}(\mathbf{z}) = \beta_0 + \boldsymbol{\beta}_1^T (\mathbf{z} - \mathbf{x})$$
by minimizing: $$\sum_{i=1}^{n} w_i(\mathbf{x}) \cdot \left(y_i - \beta_0 - \boldsymbol{\beta}_1^T (\mathbf{x}_i - \mathbf{x})\right)^2$$
The prediction at $\mathbf{x}$ is $\hat{f}(\mathbf{x}) = \hat{\beta}_0$ (since $(\mathbf{x} - \mathbf{x}) = \mathbf{0}$).
Key Insight: We fit a local hyperplane but only use the intercept for prediction. The slopes are estimated but discarded—their role is to remove bias from the intercept estimate.
Weighted Least Squares Solution:
Define the design matrix and weights: $$\mathbf{X} = \begin{bmatrix} 1 & (\mathbf{x}_1 - \mathbf{x})^T \ \vdots & \vdots \ 1 & (\mathbf{x}_n - \mathbf{x})^T \end{bmatrix}, \quad \mathbf{W} = \text{diag}(w_1(\mathbf{x}), ..., w_n(\mathbf{x}))$$
The solution is: $$\hat{\boldsymbol{\beta}} = (\mathbf{X}^T \mathbf{W} \mathbf{X})^{-1} \mathbf{X}^T \mathbf{W} \mathbf{y}$$
Prediction: $\hat{f}(\mathbf{x}) = \hat{\beta}_0 = [\hat{\boldsymbol{\beta}}]_0$
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154
import numpy as npfrom typing import Tuple class LocalLinearRegression: """ Local linear regression using weighted least squares. At each query point, fits a linear model to nearby points using kernel weights, then uses the intercept as the prediction. """ def __init__(self, bandwidth: float = None, k: int = None): """ Parameters ---------- bandwidth : fixed bandwidth h for kernel weights k : if set, use k-NN adaptive bandwidth (overrides bandwidth) """ if bandwidth is None and k is None: raise ValueError("Must specify either bandwidth or k") self.bandwidth = bandwidth self.k = k self.X_train = None self.y_train = None def fit(self, X: np.ndarray, y: np.ndarray): """Store training data.""" self.X_train = np.array(X) self.y_train = np.array(y) return self def _get_weights(self, x_query: np.ndarray) -> Tuple[np.ndarray, float]: """Compute kernel weights for a query point.""" distances = np.linalg.norm(self.X_train - x_query, axis=1) # Determine bandwidth if self.k is not None: k_idx = np.argsort(distances)[:self.k] h = distances[k_idx[-1]] + 1e-10 else: h = self.bandwidth # Tricube kernel (common in LOESS) u = distances / h weights = np.where(u <= 1, (1 - u**3)**3, 0.0) return weights, h def _predict_one(self, x_query: np.ndarray) -> Tuple[float, dict]: """ Predict for a single query point using local linear model. """ n, d = self.X_train.shape weights, h = self._get_weights(x_query) # Filter to points with non-zero weight for efficiency active = weights > 1e-10 if active.sum() < d + 1: # Not enough points for linear fit, fall back to weighted average if active.sum() == 0: return self.y_train.mean(), {'method': 'global_mean'} else: pred = np.sum(weights[active] * self.y_train[active]) / weights[active].sum() return pred, {'method': 'weighted_average'} X_active = self.X_train[active] y_active = self.y_train[active] w_active = weights[active] # Build design matrix: [1, (x_i - x_query)] X_centered = X_active - x_query design = np.column_stack([np.ones(len(X_active)), X_centered]) # Weighted least squares: (X^T W X)^{-1} X^T W y W = np.diag(w_active) XtWX = design.T @ W @ design XtWy = design.T @ W @ y_active # Regularize for numerical stability XtWX += 1e-8 * np.eye(d + 1) try: beta = np.linalg.solve(XtWX, XtWy) prediction = beta[0] # Intercept is our prediction metadata = { 'method': 'local_linear', 'bandwidth': h, 'n_active': active.sum(), 'slope': beta[1:], # The estimated local slopes } except np.linalg.LinAlgError: # Singular matrix - fall back to weighted average prediction = np.sum(w_active * y_active) / w_active.sum() metadata = {'method': 'fallback_weighted_average'} return prediction, metadata def predict(self, X: np.ndarray, return_slopes: bool = False): """Predict for all query points.""" X = np.atleast_2d(X) predictions = [] all_meta = [] for x in X: pred, meta = self._predict_one(x) predictions.append(pred) all_meta.append(meta) if return_slopes: return np.array(predictions), all_meta return np.array(predictions) # Demonstration: Compare local constant vs local linearnp.random.seed(42) # Generate data with clear linear trendX_train = np.random.uniform(0, 10, (100, 1))true_func = lambda x: 2 * x + np.sin(x)y_train = true_func(X_train.flatten()) + 0.5 * np.random.randn(100) # Query at boundary (where local constant is biased)X_query = np.array([[0.5], [9.5], [5.0]]) print("Local Constant vs Local Linear Regression:")print("=" * 60) # Local constant (Nadaraya-Watson)def local_constant(X_train, y_train, x_query, k): distances = np.linalg.norm(X_train - x_query, axis=1) k_idx = np.argsort(distances)[:k] h = distances[k_idx[-1]] + 1e-10 u = distances[k_idx] / h weights = (1 - u**3)**3 # Tricube return np.sum(weights * y_train[k_idx]) / weights.sum() # Local linearmodel_linear = LocalLinearRegression(k=20)model_linear.fit(X_train, y_train) for x in X_query: truth = true_func(x[0]) pred_const = local_constant(X_train, y_train, x, k=20) pred_linear, meta = model_linear.predict(x.reshape(1, -1), return_slopes=True) print(f"Query x = {x[0]:.1f}:") print(f" True value: {truth:.4f}") print(f" Local constant: {pred_const:.4f} (error: {abs(pred_const-truth):.4f})") print(f" Local linear: {pred_linear[0]:.4f} (error: {abs(pred_linear[0]-truth):.4f})") if 'slope' in meta[0]: print(f" Estimated slope: {meta[0]['slope'][0]:.4f} (true: ~2.0)")Local linear regression automatically corrects for first-order bias. It achieves the same O(h²) variance as local constant but with O(h²) bias from second-order terms only (first-order terms cancel). At boundaries, this is particularly dramatic—local linear doesn't suffer the boundary bias that plagues local constant.
LOESS (Locally Estimated Scatterplot Smoothing) and LOWESS (Locally Weighted Scatterplot Smoothing) are the most widely used local polynomial methods in practice, developed by Cleveland (1979) and Cleveland & Devlin (1988).
Key Features of LOESS:
Tricube kernel: Uses the tricube weight function: $$w(u) = (1 - |u|^3)^3 \text{ for } |u| \leq 1$$ This is very smooth at boundaries (zero derivative at $u = \pm 1$).
Adaptive bandwidth: Uses fraction of data ($\alpha$) rather than fixed k or h: $$k = \lceil \alpha \cdot n \rceil$$ Typical values: $\alpha \in [0.2, 0.8]$
Optional polynomial degree: Usually degree 1 (linear) or 2 (quadratic)
Robustness iteration: Optionally re-weights to downweight outliers
The LOWESS Algorithm:
for each query point x:
1. Find k = ⌈α·n⌉ nearest neighbors
2. Compute bandwidth h = distance to k-th neighbor
3. Compute tricube weights w_i = (1 - (d_i/h)³)³
4. Fit weighted polynomial (degree 1 or 2)
5. Predict using fitted polynomial at x
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174
import numpy as npfrom typing import Tuple, Optional class LOESS: """ LOESS (Locally Estimated Scatterplot Smoothing). Implements Cleveland's classic algorithm with: - Tricube weighting - Fraction-based neighborhood (span parameter) - Optional robustness iterations - Polynomial degree 1 or 2 """ def __init__(self, span: float = 0.3, degree: int = 1, robust_iters: int = 0): """ Parameters ---------- span : fraction of data to use in each local fit (α) degree : polynomial degree (1 = linear, 2 = quadratic) robust_iters : number of robustness iterations (0 = none) """ if not 0 < span <= 1: raise ValueError("span must be in (0, 1]") if degree not in [1, 2]: raise ValueError("degree must be 1 or 2") self.span = span self.degree = degree self.robust_iters = robust_iters self.X_train = None self.y_train = None self.robustness_weights = None def _tricube(self, u: np.ndarray) -> np.ndarray: """Tricube weight function.""" return np.where(u < 1, (1 - u**3)**3, 0.0) def _bisquare(self, u: np.ndarray) -> np.ndarray: """Bisquare weight for robustness.""" return np.where(u < 1, (1 - u**2)**2, 0.0) def fit(self, X: np.ndarray, y: np.ndarray): """Fit LOESS model (store data and optionally compute robustness weights).""" self.X_train = np.array(X).reshape(-1, 1) if X.ndim == 1 else np.array(X) self.y_train = np.array(y) n = len(self.y_train) self.robustness_weights = np.ones(n) if self.robust_iters > 0: # Perform robustness iterations for iteration in range(self.robust_iters): # Get fitted values at training points fitted = self._predict_many(self.X_train, self.robustness_weights) residuals = self.y_train - fitted # Median absolute deviation mad = np.median(np.abs(residuals)) if mad < 1e-10: break # Normalized residuals u = np.abs(residuals) / (6 * mad) # Update robustness weights using bisquare self.robustness_weights = self._bisquare(u) return self def _predict_one(self, x_query: np.ndarray, robustness_weights: np.ndarray) -> float: """Predict for a single point.""" n, d = self.X_train.shape k = max(int(np.ceil(self.span * n)), d + 2) # At least enough for fit # Distances distances = np.linalg.norm(self.X_train - x_query, axis=1) # Find k nearest k_idx = np.argsort(distances)[:k] h = distances[k_idx[-1]] + 1e-10 # Tricube weights u = distances[k_idx] / h weights = self._tricube(u) * robustness_weights[k_idx] # Build design matrix X_local = self.X_train[k_idx] y_local = self.y_train[k_idx] X_centered = X_local - x_query if self.degree == 1: design = np.column_stack([np.ones(k), X_centered]) else: # degree == 2 # Add quadratic terms if d == 1: design = np.column_stack([ np.ones(k), X_centered, X_centered**2 ]) else: # For multivariate: include all quadratic terms quad_terms = [] for i in range(d): for j in range(i, d): quad_terms.append(X_centered[:, i] * X_centered[:, j]) design = np.column_stack([ np.ones(k), X_centered, np.column_stack(quad_terms) ]) # Weighted least squares W = np.diag(weights) XtWX = design.T @ W @ design + 1e-8 * np.eye(design.shape[1]) XtWy = design.T @ W @ y_local try: beta = np.linalg.solve(XtWX, XtWy) return beta[0] # Intercept at query point except np.linalg.LinAlgError: return np.sum(weights * y_local) / (weights.sum() + 1e-10) def _predict_many(self, X: np.ndarray, robustness_weights: np.ndarray) -> np.ndarray: """Predict for multiple points.""" X = np.atleast_2d(X) return np.array([self._predict_one(x, robustness_weights) for x in X]) def predict(self, X: np.ndarray) -> np.ndarray: """Predict for query points using fitted model.""" X = np.atleast_2d(X) return self._predict_many(X, self.robustness_weights) # Demonstration: LOESS with different spans and degreesnp.random.seed(42) # Generate noisy data with nonlinear patternX_train = np.linspace(0, 2*np.pi, 50).reshape(-1, 1)y_train = np.sin(X_train.flatten()) + 0.3 * np.random.randn(50) # Add outliersy_train[10] += 3 # Positive outliery_train[40] -= 3 # Negative outlier X_query = np.linspace(0, 2*np.pi, 100).reshape(-1, 1)y_true = np.sin(X_query.flatten()) print("LOESS Comparison:")print("=" * 60) configs = [ {'span': 0.2, 'degree': 1, 'robust_iters': 0}, {'span': 0.4, 'degree': 1, 'robust_iters': 0}, {'span': 0.4, 'degree': 2, 'robust_iters': 0}, {'span': 0.4, 'degree': 1, 'robust_iters': 3},] for cfg in configs: model = LOESS(**cfg) model.fit(X_train, y_train) y_pred = model.predict(X_query) mse = np.mean((y_pred - y_true)**2) print(f"span={cfg['span']}, degree={cfg['degree']}, robust_iters={cfg['robust_iters']}") print(f" MSE: {mse:.4f}")The span parameter controls bias-variance tradeoff just like k or h. Use smaller spans (0.1-0.3) for capturing fine detail, larger spans (0.5-0.8) for overall trends. Cross-validation can select optimal span. For data exploration, try multiple spans to understand the signal at different scales.
Local polynomial regression generalizes to degree $p$ polynomials, fitting:
$$\hat{f}(\mathbf{z}) = \sum_{|\alpha| \leq p} \beta_\alpha (\mathbf{z} - \mathbf{x})^\alpha$$
at each query point $\mathbf{x}$ using weighted least squares.
Bias-Variance Tradeoffs Across Degrees:
| Degree $p$ | Bias Order | Variance | Boundary Bias |
|---|---|---|---|
| 0 (constant) | $O(h^2)$ | Lowest | Severe |
| 1 (linear) | $O(h^2)$ | Low | Eliminated |
| 2 (quadratic) | $O(h^4)$ | Moderate | Eliminated |
| 3 (cubic) | $O(h^4)$ | Higher | Eliminated |
Key Theoretical Results:
Odd vs Even Degrees: Local polynomial estimators of degree $p$ have bias of order $O(h^{p+1})$ when $p$ is odd, and $O(h^{p+2})$ when $p$ is even. This gives odd degrees an advantage.
Design Adaptive: Local polynomial estimators automatically adapt to design (clustering of X values) without requiring explicit handling.
Boundary Correction: Polynomials of degree ≥ 1 automatically correct boundary bias without needing special boundary kernels.
Minimax Optimality: Local linear estimators are asymptotically minimax optimal in a certain function class.
Beyond degree 2, the variance increase typically outweighs bias reduction benefits. Local cubic or higher fits become numerically unstable and overfit the local data. In practice, degree 1 (linear) is the sweet spot for most applications—degree 2 is used only when you have evidence of significant local curvature and sufficient data.
Equivalent Kernels:
A remarkable result: every local polynomial estimator can be expressed as a linear smoother with an effective kernel:
$$\hat{f}(\mathbf{x}) = \sum_{i=1}^{n} K^*_i(\mathbf{x}) \cdot y_i$$
where $K^*_i(\mathbf{x})$ is the equivalent kernel weight (different from the original kernel $K$).
For local linear regression: $$K^*_i(\mathbf{x}) = \frac{K_h(\mathbf{x} - \mathbf{x}_i)}{\sum_j K_h(\mathbf{x} - \mathbf{x}_j)} \cdot \left(1 + \frac{(\mathbf{x} - \bar{\mathbf{x}}_w)^T (\mathbf{x}_i - \bar{\mathbf{x}}_w)}{\text{var}_w(\mathbf{X})}\right)$$
where $\bar{\mathbf{x}}_w$ and $\text{var}_w$ are weighted mean and variance.
Interpretation: Local linear regression implicitly constructs an asymmetric equivalent kernel that tilts toward the direction of the query point, correcting for bias.
Local polynomial regression is computationally more demanding than simple weighted averages. Let's analyze the costs.
Per-Query Computation (Local Linear, 1D):
For degree 1 in 1D: $O(k)$ per query For degree 2 in $d$ dimensions: $O(k \cdot d^2 + d^6)$ per query
Total for $m$ queries: $O(m \cdot (k + p^3))$ assuming k neighbors are pre-found.
Comparison to Simple Weighted KNN:
Simple weighted KNN: $O(m \cdot k)$ Local linear: $O(m \cdot k)$ (linear system is small) Local quadratic: $O(m \cdot (k \cdot d^2 + d^6))$
The overhead is minimal for low-dimensional data but grows rapidly with dimension.
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273747576777879
import numpy as npfrom scipy.spatial import KDTreeimport time class EfficientLocalLinear: """ Efficient local linear regression using KD-tree for neighbor lookup. """ def __init__(self, k: int = 20): self.k = k self.tree = None self.X_train = None self.y_train = None def fit(self, X: np.ndarray, y: np.ndarray): """Build KD-tree index.""" self.X_train = np.array(X) self.y_train = np.array(y) self.tree = KDTree(self.X_train) return self def predict(self, X: np.ndarray) -> np.ndarray: """Predict using KD-tree accelerated local linear regression.""" X = np.atleast_2d(X) predictions = np.zeros(len(X)) # Batch query for all nearest neighbors at once distances, indices = self.tree.query(X, k=self.k) for i, (x_query, dists, idx) in enumerate(zip(X, distances, indices)): # Tricube weights h = dists[-1] + 1e-10 u = dists / h weights = np.where(u < 1, (1 - u**3)**3, 0.0) # Get local data X_local = self.X_train[idx] y_local = self.y_train[idx] # Build design matrix X_centered = X_local - x_query n_local, d = X_centered.shape design = np.column_stack([np.ones(n_local), X_centered]) # Weighted least squares (optimized for small system) W_diag = weights XtW = design.T * W_diag # Broadcast multiplication XtWX = XtW @ design XtWy = XtW @ y_local # Solve (with regularization) XtWX += 1e-8 * np.eye(d + 1) try: beta = np.linalg.solve(XtWX, XtWy) predictions[i] = beta[0] except np.linalg.LinAlgError: predictions[i] = np.sum(weights * y_local) / (weights.sum() + 1e-10) return predictions # Benchmark: efficient vs naive implementationnp.random.seed(42) for n_train in [1000, 5000, 10000]: for d in [2, 5]: X_train = np.random.randn(n_train, d) y_train = X_train.sum(axis=1) + 0.5 * np.random.randn(n_train) X_test = np.random.randn(100, d) model = EfficientLocalLinear(k=30) model.fit(X_train, y_train) start = time.time() predictions = model.predict(X_test) elapsed = time.time() - start print(f"n={n_train}, d={d}: {elapsed*1000:.2f} ms for 100 predictions")Local polynomial regression connects to numerous other methods in machine learning and statistics.
Connection to Splines:
Smoothing splines can be viewed as a global version of local polynomial regression. While local methods fit polynomials in sliding windows, splines fit piecewise polynomials with continuity constraints. The connection:
Connection to Gaussian Processes:
Gaussian Process (GP) regression with a polynomial mean function and local covariance structure resembles local polynomial regression:
Connection to Random Forests:
Random forest predictions are local averages over tree partitions:
| Method | Local Model | Neighborhood | Strengths |
|---|---|---|---|
| Nadaraya-Watson | Constant | Kernel bandwidth | Simple, low variance |
| Local Linear | Linear | Kernel bandwidth | Boundary correction, low bias |
| LOESS | Linear/Quadratic | Span-based | Robust, well-studied |
| Smoothing Splines | Cubic pieces | Global with local influence | Smooth, automatic |
| Gaussian Process | Infinite-dimensional | Covariance-based | Uncertainty quantification |
| Random Forest | Constant in leaves | Tree partitions | Feature selection, scalable |
Use local polynomial regression when: (1) you need interpretable local structure (slopes, curvature), (2) boundary behavior matters, (3) the data is small-to-medium sized (thousands to hundreds of thousands), (4) you want direct control over smoothness through span/bandwidth.
Local polynomial ideas extend to classification through local logistic regression or locally weighted learning.
Local Logistic Regression:
For binary classification, fit a logistic model locally:
$$P(Y = 1 | \mathbf{x}) = \sigma(\beta_0 + \boldsymbol{\beta}_1^T (\mathbf{x}_i - \mathbf{x}))$$
where $\sigma$ is the sigmoid function.
Minimize weighted negative log-likelihood: $$\sum_i w_i(\mathbf{x}) \cdot \left[-y_i \log \hat{p}_i - (1-y_i) \log(1-\hat{p}_i)\right]$$
Predict: $\hat{P}(Y=1 | \mathbf{x}) = \sigma(\hat{\beta}_0)$
Benefits of Local Logistic:
Implementation Complexity:
Local logistic regression requires iterative optimization (Newton-Raphson or gradient descent) at each query point, making it significantly more expensive than local linear regression. In practice, often approximated by local linear regression on log-transformed probabilities.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101
import numpy as npfrom scipy.optimize import minimize class LocalLogisticClassifier: """ Local logistic regression classifier. Fits a weighted logistic regression at each query point. """ def __init__(self, k: int = 30): self.k = k self.X_train = None self.y_train = None def fit(self, X: np.ndarray, y: np.ndarray): """Store training data.""" self.X_train = np.array(X) self.y_train = np.array(y).astype(float) return self def _sigmoid(self, z): return 1 / (1 + np.exp(-np.clip(z, -500, 500))) def _predict_one(self, x_query: np.ndarray) -> float: """Predict probability for one query point.""" n, d = self.X_train.shape # Find k neighbors distances = np.linalg.norm(self.X_train - x_query, axis=1) k_idx = np.argsort(distances)[:self.k] h = distances[k_idx[-1]] + 1e-10 # Tricube weights u = distances[k_idx] / h weights = np.where(u < 1, (1 - u**3)**3, 0.0) X_local = self.X_train[k_idx] y_local = self.y_train[k_idx] # Centered design matrix X_centered = X_local - x_query # Weighted negative log-likelihood def neg_log_likelihood(beta): linear = beta[0] + X_centered @ beta[1:] p = self._sigmoid(linear) p = np.clip(p, 1e-10, 1-1e-10) ll = weights @ (y_local * np.log(p) + (1-y_local) * np.log(1-p)) # Add small L2 regularization for stability reg = 0.01 * np.sum(beta[1:]**2) return -ll + reg # Optimize beta_init = np.zeros(d + 1) result = minimize(neg_log_likelihood, beta_init, method='BFGS') # Prediction at query point (where x - x_query = 0) prob = self._sigmoid(result.x[0]) return prob def predict_proba(self, X: np.ndarray) -> np.ndarray: """Predict probabilities.""" X = np.atleast_2d(X) probs = np.array([self._predict_one(x) for x in X]) return np.column_stack([1 - probs, probs]) def predict(self, X: np.ndarray) -> np.ndarray: """Predict class labels.""" proba = self.predict_proba(X) return (proba[:, 1] > 0.5).astype(int) # Demonstration: XOR-like patternnp.random.seed(42) # Generate XOR patternn = 200X_train = np.random.uniform(-2, 2, (n, 2))y_train = ((X_train[:, 0] * X_train[:, 1]) > 0).astype(int) # Add noiseflip_idx = np.random.choice(n, size=20, replace=False)y_train[flip_idx] = 1 - y_train[flip_idx] # Test pointsX_test = np.array([ [0.5, 0.5], # Class 1 region [-0.5, -0.5], # Class 1 region [0.5, -0.5], # Class 0 region [0, 0], # Boundary]) model = LocalLogisticClassifier(k=30)model.fit(X_train, y_train)probas = model.predict_proba(X_test) print("Local Logistic Classification (XOR pattern):")print("=" * 50)for x, p in zip(X_test, probas): print(f"x = {x}: P(class=0) = {p[0]:.3f}, P(class=1) = {p[1]:.3f}")You now understand how local models elevate KNN from simple averaging to principled local modeling with bias-variance control. The final page explores Regression with KNN—examining how all these techniques come together for regression tasks and comparing them to other regression approaches.