Loading content...
Logistic regression appears nonlinear—its predictions follow an S-shaped curve, and the relationship between features and predicted probabilities is decidedly non-linear. Yet beneath this curved surface lies a perfectly linear model.
The key insight is that logistic regression is linear on the log-odds scale. While probabilities range from 0 to 1 and odds range from 0 to infinity, log-odds span all real numbers—exactly the domain where linear models are natural.
This log-odds perspective transforms logistic regression from a 'magic squashing function' into an intuitive, interpretable model. Coefficients gain clear meaning: each unit increase in a feature multiplies the odds by a specific factor. Decision boundaries become linear hyperplanes. The entire framework suddenly makes sense.
By the end of this page, you will understand: (1) the mathematical relationships among probabilities, odds, and log-odds, (2) why log-odds linearize the logistic model, (3) how to interpret logistic regression coefficients as log-odds changes and odds multipliers, (4) the geometric meaning of odds ratios, and (5) why this interpretation matters for practical applications.
Before diving into log-odds, we need a solid understanding of odds themselves—a concept that predates probability theory and remains central to gambling, epidemiology, and statistical inference.
Definition of Odds
Given a probability $p$ of an event occurring (where $0 < p < 1$), the odds of that event are defined as:
$$\text{odds} = \frac{p}{1-p}$$
This is the ratio of the probability of success to the probability of failure. If an event has probability 0.75, its odds are:
$$\text{odds} = \frac{0.75}{0.25} = 3$$
We often express this as "3 to 1 odds" or "odds of 3:1 in favor."
Intuitive Interpretation
Odds express how many times more likely success is compared to failure:
The odds take values in $(0, \infty)$:
| Probability (p) | Odds (p/(1-p)) | Odds Notation | Interpretation |
|---|---|---|---|
| 0.01 | 0.0101 | 1:99 | Highly unlikely (failure 99× more likely) |
| 0.10 | 0.111 | 1:9 | Unlikely (failure 9× more likely) |
| 0.25 | 0.333 | 1:3 | Unlikely (failure 3× more likely) |
| 0.50 | 1.000 | 1:1 | Equally likely (no preference) |
| 0.75 | 3.000 | 3:1 | Likely (success 3× more likely) |
| 0.90 | 9.000 | 9:1 | Very likely (success 9× more likely) |
| 0.99 | 99.00 | 99:1 | Near certain (success 99× more likely) |
Converting Back from Odds to Probability
Given odds $\omega$, we can recover the probability:
$$p = \frac{\omega}{1 + \omega}$$
This is exactly the sigmoid function applied to $\log(\omega)$! The relationships form a complete system:
$$p \to \frac{p}{1-p} = \omega \to \log(\omega) = z \to \sigma(z) = p$$
Why Odds Matter
Odds have several advantages over probabilities in certain contexts:
Multiplicative effects are natural: Doubling the odds has a clear meaning; doubling a probability often doesn't make sense (what if $p = 0.75$?).
Symmetric treatment of outcomes: Odds of $k$ and odds of $1/k$ are equally 'extreme' in opposite directions.
Range matches linear predictors better: Probabilities are bounded; odds can take any positive value, making them easier to model linearly after a log transform.
The odds formulation comes from gambling, where payouts are calculated based on odds rather than probabilities. A fair bet at 3:1 odds means you risk $1 to potentially win $3, which is appropriate when the probability of winning is 0.25 (1 in 4 outcomes favorable). Understanding this historical context helps demystify the seemingly odd choice to work with odds.
While odds improve upon probabilities for modeling purposes, their range $(0, \infty)$ is still asymmetric around a 'neutral' value. Taking the logarithm of odds—the log-odds or logit—creates a symmetric scale spanning all real numbers.
Definition of Log-Odds (Logit)
$$\text{logit}(p) = \log\left(\frac{p}{1-p}\right) = \log(p) - \log(1-p)$$
The logit function is the inverse of the sigmoid:
$$\sigma(z) = p \iff z = \text{logit}(p)$$
Properties of the Log-Odds Scale
Domain and Range: logit maps $(0, 1) \to (-\infty, +\infty)$
Symmetry around zero: $\text{logit}(1-p) = -\text{logit}(p)$
Zero at neutrality: $\text{logit}(0.5) = \log(1) = 0$
Additive structure: Multiplicative changes in odds become additive changes in log-odds
12345678910111213141516171819202122232425262728293031323334353637383940414243444546
import numpy as np def odds(p): """Convert probability to odds.""" return p / (1 - p) def logit(p): """Convert probability to log-odds (logit).""" return np.log(p / (1 - p)) def sigmoid(z): """Convert log-odds back to probability.""" return 1 / (1 + np.exp(-z)) # Explore the transformationsprobabilities = np.array([0.01, 0.05, 0.10, 0.25, 0.50, 0.75, 0.90, 0.95, 0.99]) print("Complete Probability ↔ Odds ↔ Log-Odds Table")print("=" * 70)print(f"{'Probability':>12} | {'Odds':>12} | {'Log-Odds':>12} | {'Back to Prob':>14}")print("-" * 70) for p in probabilities: o = odds(p) lo = logit(p) p_back = sigmoid(lo) print(f"{p:>12.4f} | {o:>12.4f} | {lo:>12.4f} | {p_back:>14.4f}") # Demonstrate additive property of log-oddsprint("\n" + "=" * 70)print("Log-Odds Additive Property: Doubling the Odds")print("=" * 70) p1 = 0.25odds1 = odds(p1)odds2 = odds1 * 2 # Double the oddsp2 = odds2 / (1 + odds2) lo1 = logit(p1)lo2 = logit(p2) print(f"Initial: p = {p1:.4f}, odds = {odds1:.4f}, log-odds = {lo1:.4f}")print(f"After doubling odds: p = {p2:.4f}, odds = {odds2:.4f}, log-odds = {lo2:.4f}")print(f"Change in log-odds: {lo2 - lo1:.4f}")print(f"log(2) = {np.log(2):.4f}")print("→ Doubling odds adds log(2) ≈ 0.693 to log-odds")The log-odds scale is where logistic regression 'lives.' When we say a coefficient β = 0.7, we mean that a one-unit increase in the corresponding feature adds 0.7 to the log-odds, which is equivalent to multiplying the odds by e^0.7 ≈ 2. This multiplicative interpretation of odds is the key to understanding logistic regression coefficients.
We can now state the logistic regression model equation in its most illuminating form.
The Core Equation: Linear in Log-Odds
$$\log\left(\frac{P(Y=1|X)}{1-P(Y=1|X)}\right) = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \cdots + \beta_p X_p = \mathbf{w}^T \mathbf{x} + b$$
Or equivalently:
$$\text{logit}(P(Y=1|X)) = \mathbf{w}^T \mathbf{x} + b$$
This reveals the essential structure: logistic regression assumes log-odds are a linear function of the features. The nonlinearity comes entirely from inverting this relationship to obtain probabilities.
Three Equivalent Forms
The same model can be written in three equivalent ways, each revealing different aspects:
Form 1: Log-Odds (Logit) Form $$\log\left(\frac{p}{1-p}\right) = \mathbf{w}^T \mathbf{x} + b$$
Shows the linear structure directly. Coefficients are log-odds changes.
Form 2: Odds Form $$\frac{p}{1-p} = e^{\mathbf{w}^T \mathbf{x} + b} = e^{b} \cdot e^{w_1 x_1} \cdot e^{w_2 x_2} \cdots$$
Shows multiplicative structure. Coefficients are exponentiated to get odds multipliers.
Form 3: Probability Form $$p = P(Y=1|X) = \frac{e^{\mathbf{w}^T \mathbf{x} + b}}{1 + e^{\mathbf{w}^T \mathbf{x} + b}} = \sigma(\mathbf{w}^T \mathbf{x} + b)$$
The form we use for predictions. Shows the sigmoid transformation.
Why This Matters
The log-odds formulation explains several key properties of logistic regression:
Bounded predictions: No matter how extreme the linear predictor, probabilities stay in (0, 1).
Interpretable coefficients: Each coefficient has a clear meaning in terms of odds.
Proper uncertainty: Near the decision boundary ($z \approx 0$), small changes in features produce noticeable probability changes. Far from the boundary (saturated regions), even large feature changes barely affect the probability—encoding appropriate confidence.
Multiplicative odds model: Effects multiply rather than add. If drug A doubles survival odds and drug B triples them (independently), together they multiply to 6× the odds.
This structure—linear predictor transformed by a link function—is the essence of Generalized Linear Models (GLMs). The logit link maps from probabilities to a scale where linear models make sense. Other link functions (probit, log, identity) serve similar purposes for different response types.
The log-odds formulation provides the foundation for interpreting logistic regression coefficients. Each coefficient $\beta_j$ tells us how the log-odds change per unit increase in $X_j$, holding other features constant.
The Coefficient as Log-Odds Change
Consider a single feature model: $\text{logit}(p) = \beta_0 + \beta_1 X$
When $X$ increases by 1 unit:
$$\text{logit}(p_{new}) = \beta_0 + \beta_1(X+1) = \text{logit}(p_{old}) + \beta_1$$
The log-odds increase by exactly $\beta_1$.
The Odds Ratio Interpretation
Exponentiating both sides:
$$\text{odds}{new} = e^{\beta_1} \times \text{odds}{old}$$
The quantity $e^{\beta_1}$ is called the odds ratio (OR). A one-unit increase in $X$ multiplies the odds by $e^{\beta_1}$.
Examples:
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950
import numpy as npfrom sklearn.linear_model import LogisticRegressionfrom sklearn.datasets import make_classification # Create synthetic data with known coefficientsnp.random.seed(42)n_samples = 10000 # Single feature exampleX_single = np.random.randn(n_samples, 1)true_beta = 1.0 # true coefficientlog_odds = -0.5 + true_beta * X_single.ravel()probs = 1 / (1 + np.exp(-log_odds))y = (np.random.rand(n_samples) < probs).astype(int) # Fit modelmodel = LogisticRegression(fit_intercept=True, solver='lbfgs')model.fit(X_single, y) print("Coefficient Interpretation Demo")print("=" * 60)print(f"True coefficient: β₁ = {true_beta:.4f}")print(f"Estimated coefficient: β̂₁ = {model.coef_[0][0]:.4f}")print(f"Estimated intercept: β̂₀ = {model.intercept_[0]:.4f}")print() # Odds ratio interpretationbeta_hat = model.coef_[0][0]odds_ratio = np.exp(beta_hat)print(f"Odds Ratio: e^β̂₁ = e^{beta_hat:.4f} = {odds_ratio:.4f}")print(f"Interpretation: A 1-unit increase in X multiplies odds by {odds_ratio:.4f}")print() # Demonstrate with concrete predictionsX_test = np.array([[0], [1], [2]])probs_predicted = model.predict_proba(X_test)[:, 1]odds_predicted = probs_predicted / (1 - probs_predicted) print("Predictions at different X values:")print("-" * 60)print(f"{'X':>5} | {'P(Y=1)':>10} | {'Odds':>12} | {'Odds Ratio':>12}")print("-" * 60)for i in range(len(X_test)): x_val = X_test[i, 0] p = probs_predicted[i] o = odds_predicted[i] ratio = odds_predicted[i] / odds_predicted[0] if i > 0 else 1.0 print(f"{x_val:>5} | {p:>10.4f} | {o:>12.4f} | {ratio:>12.4f}") print(f"\nNote: Each unit increase multiplies odds by ~{odds_ratio:.2f}")| Coefficient (β) | Odds Ratio (e^β) | Effect on Odds | Interpretation |
|---|---|---|---|
| -2.0 | 0.14 | ÷7.4 | Strongly decreases odds (86% reduction) |
| -1.0 | 0.37 | ÷2.7 | Decreases odds (63% reduction) |
| -0.5 | 0.61 | ÷1.6 | Moderately decreases odds (39% reduction) |
| 0.0 | 1.00 | ×1.0 | No effect on odds |
| 0.5 | 1.65 | ×1.6 | Moderately increases odds (65% increase) |
| 1.0 | 2.72 | ×2.7 | Increases odds (172% increase) |
| 2.0 | 7.39 | ×7.4 | Strongly increases odds (639% increase) |
The coefficient does NOT tell you how much the probability changes per unit of X. Probability changes depend on where you start. A coefficient of 0.5 means odds multiply by 1.65, but the probability might increase by 0.10 (around p=0.5) or by only 0.02 (around p=0.95). Always interpret coefficients on the odds or log-odds scale.
While log-odds effects are constant (linear), probability effects are not. The marginal effect of a feature on probability depends on where we are in the probability space.
Mathematical Derivation
Starting from $p = \sigma(\mathbf{w}^T \mathbf{x} + b)$, the marginal effect of $X_j$ on $p$ is:
$$\frac{\partial p}{\partial X_j} = \frac{\partial \sigma(z)}{\partial z} \cdot \frac{\partial z}{\partial X_j} = \sigma(z)(1-\sigma(z)) \cdot w_j = p(1-p) \cdot w_j$$
This reveals the critical insight: the marginal effect of feature $j$ on probability is $w_j \cdot p(1-p)$.
The factor $p(1-p)$ is maximized at $p = 0.5$ (where it equals 0.25) and approaches zero as $p \to 0$ or $p \to 1$.
Implications
Near certainty, effects are small: If $p = 0.99$, then $p(1-p) = 0.0099$. Even a large coefficient $w_j = 2$ produces only a marginal effect of $\approx 0.02$.
At maximum uncertainty, effects are largest: If $p = 0.5$, then $p(1-p) = 0.25$. The same $w_j = 2$ produces a marginal effect of $0.5$.
Symmetry around 0.5: The marginal effect at $p = 0.1$ equals that at $p = 0.9$ (both have $p(1-p) = 0.09$).
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960
import numpy as npimport matplotlib.pyplot as plt def sigmoid(z): return 1 / (1 + np.exp(-z)) # Consider a model with coefficient w = 1w = 1.0 # The marginal effect on probability depends on where we arez_values = np.linspace(-5, 5, 100)p_values = sigmoid(z_values)marginal_effects = p_values * (1 - p_values) * w # Visualizationfig, axes = plt.subplots(1, 2, figsize=(14, 5)) # Left: Probability as function of linear predictorax1 = axes[0]ax1.plot(z_values, p_values, 'b-', linewidth=2)ax1.axhline(y=0.5, color='gray', linestyle='--', alpha=0.5)ax1.axvline(x=0, color='gray', linestyle='--', alpha=0.5) # Add tangent lines at different points to show varying slopesfor z_point in [-3, 0, 3]: p_point = sigmoid(z_point) slope = p_point * (1 - p_point) * w z_line = np.linspace(z_point - 1.5, z_point + 1.5, 50) p_line = p_point + slope * (z_line - z_point) ax1.plot(z_line, p_line, 'r--', alpha=0.7) ax1.scatter([z_point], [p_point], color='red', s=50, zorder=5) ax1.set_xlabel('Linear Predictor (z = wx + b)')ax1.set_ylabel('Probability P(Y=1)')ax1.set_title('Probability Curve with Tangent Lines')ax1.set_ylim(-0.1, 1.1)ax1.grid(True, alpha=0.3) # Right: Marginal effect as function of probabilityax2 = axes[1]ax2.plot(p_values, marginal_effects, 'r-', linewidth=2)ax2.fill_between(p_values, 0, marginal_effects, alpha=0.2, color='red')ax2.axvline(x=0.5, color='gray', linestyle='--', alpha=0.5)ax2.set_xlabel('Current Probability')ax2.set_ylabel('Marginal Effect = dp/dz')ax2.set_title(f'Marginal Effect on Probability (w={w})')ax2.set_xlim(0, 1)ax2.grid(True, alpha=0.3) plt.tight_layout()plt.savefig('marginal_effects.png', dpi=150)plt.show() # Print specific valuesprint("Marginal Effects at Different Probability Levels")print("-" * 50)probs_example = [0.01, 0.1, 0.25, 0.5, 0.75, 0.9, 0.99]for p in probs_example: me = p * (1 - p) * w print(f"At p = {p:.2f}: marginal effect = {me:.4f}")In practice, researchers often report the Average Marginal Effect (AME)—the average of dp/dX across all observations in the dataset. This provides a single summary number while acknowledging that true effects vary. Modern software like statsmodels and R's margins package can compute AMEs automatically.
Let's work through a complete example to solidify our understanding of log-odds interpretation.
Scenario: Predicting Heart Disease
A logistic regression model predicts heart disease probability based on:
Fitted model (hypothetical): $$\text{logit}(p) = -6.0 + 0.05 \cdot \text{Age} + 0.01 \cdot \text{Cholesterol} + 0.8 \cdot \text{Smoker}$$
Interpreting Each Coefficient
Intercept ($\beta_0 = -6.0$):
Age ($\beta_{Age} = 0.05$):
Cholesterol ($\beta_{Chol} = 0.01$):
Smoker ($\beta_{Smoker} = 0.8$):
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657
import numpy as np def sigmoid(z): return 1 / (1 + np.exp(-z)) # Model coefficientsbeta_0 = -6.0beta_age = 0.05beta_chol = 0.01beta_smoker = 0.8 def predict_prob(age, cholesterol, smoker): """Predict heart disease probability.""" z = beta_0 + beta_age * age + beta_chol * cholesterol + beta_smoker * smoker return sigmoid(z) def predict_odds(age, cholesterol, smoker): """Predict odds of heart disease.""" p = predict_prob(age, cholesterol, smoker) return p / (1 - p) # Calculate odds ratiosprint("Odds Ratios for Each Feature")print("=" * 50)print(f"Age (per year): e^{beta_age:.2f} = {np.exp(beta_age):.4f}")print(f"Cholesterol (per mg): e^{beta_chol:.2f} = {np.exp(beta_chol):.4f}")print(f"Smoker (yes vs no): e^{beta_smoker:.2f} = {np.exp(beta_smoker):.4f}") # Example predictionsprint("\nExample Predictions")print("=" * 50) # Person 1: 50-year-old non-smoker with cholesterol 200p1 = predict_prob(50, 200, 0)o1 = predict_odds(50, 200, 0)print(f"50yo, non-smoker, chol=200: P = {p1:.4f}, odds = {o1:.4f}") # Person 2: Same but smokerp2 = predict_prob(50, 200, 1)o2 = predict_odds(50, 200, 1)print(f"50yo, smoker, chol=200: P = {p2:.4f}, odds = {o2:.4f}")print(f"Odds ratio (smoker/non): {o2/o1:.4f} ≈ e^0.8 = {np.exp(0.8):.4f}") # Person 3: 10 years olderp3 = predict_prob(60, 200, 0)o3 = predict_odds(60, 200, 0)print(f"60yo, non-smoker, chol=200: P = {p3:.4f}, odds = {o3:.4f}")print(f"Odds ratio (60yo/50yo): {o3/o1:.4f} ≈ e^0.5 = {np.exp(0.5):.4f}") # Combine effectsprint("\nCombined Effects")print("=" * 50)p4 = predict_prob(60, 200, 1)o4 = predict_odds(60, 200, 1)print(f"60yo, smoker, chol=200: P = {p4:.4f}, odds = {o4:.4f}")print(f"Combined odds ratio: {o4/o1:.4f}")print(f"Expected (1.65 × 2.23): {1.65 * 2.23:.4f}")Notice that combined effects multiply on the odds scale. A 60-year-old smoker has 1.65 × 2.23 ≈ 3.7× the odds of a 50-year-old non-smoker. This multiplicative structure is one of the elegant properties of logistic regression—effects combine independently on the log-odds (additively) or odds (multiplicatively) scale.
In practice, we need uncertainty estimates for our odds ratios. Since maximum likelihood estimation gives us standard errors on the log-odds scale, we construct confidence intervals there and then exponentiate.
The Procedure
Estimate coefficient: $\hat{\beta}$ with standard error $SE(\hat{\beta})$
Construct CI for log-odds coefficient: $$\hat{\beta} \pm z_{\alpha/2} \cdot SE(\hat{\beta})$$
For 95% CI with $z_{0.025} = 1.96$: $$[\hat{\beta} - 1.96 \cdot SE(\hat{\beta}), \hat{\beta} + 1.96 \cdot SE(\hat{\beta})]$$
Exponentiate to get CI for odds ratio: $$[e^{\hat{\beta} - 1.96 \cdot SE(\hat{\beta})}, e^{\hat{\beta} + 1.96 \cdot SE(\hat{\beta})}]$$
Why Exponentiate the Entire Interval?
Because the exponential function is monotonic, the confidence interval transforms correctly: $$P(L \leq \beta \leq U) = P(e^L \leq e^\beta \leq e^U)$$
Note that the odds ratio CI is asymmetric around the point estimate (because exponentiation is nonlinear), but the coverage probability is preserved.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051
import numpy as npfrom sklearn.linear_model import LogisticRegressionfrom scipy import stats # Generate sample datanp.random.seed(42)n = 1000X = np.random.randn(n, 2)true_beta = np.array([0.8, -0.5])z = X @ true_beta + 0.2p = 1 / (1 + np.exp(-z))y = (np.random.rand(n) < p).astype(int) # Fit model using statsmodels for proper inferenceimport statsmodels.api as sm X_with_const = sm.add_constant(X)model = sm.Logit(y, X_with_const)result = model.fit(disp=0) print("Logistic Regression with Confidence Intervals")print("=" * 70) # Extract coefficients and standard errorscoef_names = ['intercept', 'X1', 'X2']for i, name in enumerate(coef_names): beta = result.params[i] se = result.bse[i] z_val = result.tvalues[i] p_val = result.pvalues[i] # 95% CI for coefficient (log-odds) ci_low_logodds = beta - 1.96 * se ci_high_logodds = beta + 1.96 * se # 95% CI for odds ratio or_point = np.exp(beta) ci_low_or = np.exp(ci_low_logodds) ci_high_or = np.exp(ci_high_logodds) print(f"\n{name}:") print(f" Coefficient (log-odds): {beta:.4f} (SE: {se:.4f})") print(f" 95% CI (log-odds): [{ci_low_logodds:.4f}, {ci_high_logodds:.4f}]") print(f" Odds Ratio: {or_point:.4f}") print(f" 95% CI (OR): [{ci_low_or:.4f}, {ci_high_or:.4f}]") print(f" p-value: {p_val:.4f} {'***' if p_val < 0.001 else '**' if p_val < 0.01 else '*' if p_val < 0.05 else ''}") print("\n" + "=" * 70)print("Statistical Significance:")print("An odds ratio CI that does NOT contain 1.0 indicates")print("statistically significant association at the chosen α level.")If the 95% CI for an odds ratio contains 1.0, we cannot reject the null hypothesis that the feature has no effect on odds. If the entire CI is above 1.0, the effect is significantly positive; if entirely below 1.0, significantly negative. The width of the CI indicates precision—narrower intervals mean more precise estimates.
We've explored the log-odds interpretation of logistic regression in depth. This perspective transforms logistic regression from seemingly nonlinear magic into a principled, interpretable linear model. Let's consolidate the essential insights:
What's Next:
With the log-odds interpretation mastered, we turn to the model parameters page—examining the roles of weights and bias, how they're estimated via maximum likelihood, and their geometric interpretation as defining a separating hyperplane in feature space.
You now understand the log-odds interpretation of logistic regression—the key that unlocks coefficient interpretation and reveals the linear structure hidden beneath the S-curve. This understanding is essential for proper model interpretation and communication of results.