Loading learning content...
Factor analysis posits a generative model: observed data arise from latent factors through the equation x = Λz + ε. But how do we estimate the parameters—the factor loadings Λ and uniquenesses Ψ—from observed data?
Maximum Likelihood (ML) estimation is the gold standard approach. It finds parameter values that maximize the probability of observing the data we actually observed, given the factor model. ML estimation provides:
This page develops the ML estimation framework from first principles, covering the likelihood function, optimization procedures, inference, and model assessment. We'll also compare ML with alternative estimation methods.
By the end of this page, you will understand: • The log-likelihood function for the factor model under normality • How ML estimation works: optimization and identification constraints • Standard errors for loadings and how to use them • Goodness-of-fit assessment: χ², RMSEA, and information criteria • Heywood cases and improper solutions • Comparison with Principal Axis Factoring and other methods • Practical implementation with modern software
Under the factor model with Gaussian assumptions, the marginal distribution of observed x is:
$$\mathbf{x} \sim \mathcal{N}(\boldsymbol{\mu}, \boldsymbol{\Sigma})$$
where the model-implied covariance is:
$$\boldsymbol{\Sigma} = \mathbf{\Lambda \Lambda}' + \boldsymbol{\Psi}$$
Given n independent observations x₁, x₂, ..., xₙ, the likelihood is the product of individual densities:
$$L(\boldsymbol{\mu}, \mathbf{\Lambda}, \boldsymbol{\Psi} | \mathbf{X}) = \prod_{i=1}^{n} (2\pi)^{-p/2} |\boldsymbol{\Sigma}|^{-1/2} \exp\left( -\frac{1}{2} (\mathbf{x}_i - \boldsymbol{\mu})' \boldsymbol{\Sigma}^{-1} (\mathbf{x}_i - \boldsymbol{\mu}) \right)$$
Taking the natural logarithm (and ignoring constants):
$$\ell(\boldsymbol{\mu}, \mathbf{\Lambda}, \boldsymbol{\Psi}) = -\frac{n}{2} \left[ \ln |\boldsymbol{\Sigma}| + \text{tr}(\boldsymbol{\Sigma}^{-1} \mathbf{S}) + (\bar{\mathbf{x}} - \boldsymbol{\mu})' \boldsymbol{\Sigma}^{-1} (\bar{\mathbf{x}} - \boldsymbol{\mu}) \right]$$
where:
The ML estimate of μ is simply $\hat{\boldsymbol{\mu}} = \bar{\mathbf{x}}$ (the sample mean). For covariance structure, we can work with centered data.
Maximum likelihood is often expressed as minimizing a fitting function F rather than maximizing the log-likelihood. The ML fitting function is:
F_ML = ln|Σ| + tr(SΣ⁻¹) - ln|S| - p
This equals zero when Σ = S (perfect fit). It's minimized when the model-implied covariance best matches the sample covariance.
Since $\hat{\boldsymbol{\mu}} = \bar{\mathbf{x}}$ regardless of Λ and Ψ, we can work with the concentrated log-likelihood that depends only on the covariance parameters:
$$\ell^*(\mathbf{\Lambda}, \boldsymbol{\Psi}) = -\frac{n}{2} \left[ \ln |\boldsymbol{\Sigma}| + \text{tr}(\boldsymbol{\Sigma}^{-1} \mathbf{S}) \right]$$
This focuses on matching the model-implied covariance Σ = ΛΛ' + Ψ to the sample covariance S.
Free parameters:
However, the model is not identified without constraints because of rotational indeterminacy. Standard constraint: require Λ'Ψ⁻¹Λ to be diagonal. This fixes k(k-1)/2 parameters, giving:
Net free parameters: pk + p - k(k-1)/2
Degrees of freedom for the model:
$$df = \frac{(p)(p+1)}{2} - \left[ pk + p - \frac{k(k-1)}{2} \right] = \frac{(p-k)^2 - (p+k)}{2}$$
For the model to be testable, we need df > 0.
| Variables (p) | Factors (k) | Free Parameters | df |
|---|---|---|---|
| 5 | 1 | 10 | 5 |
| 5 | 2 | 13 | 1 |
| 8 | 2 | 22 | 13 |
| 10 | 3 | 40 | 18 |
| 20 | 5 | 130 | 100 |
The ML estimates $\hat{\mathbf{\Lambda}}$ and $\hat{\boldsymbol{\Psi}}$ are found by maximizing the log-likelihood iteratively. There is no closed-form solution; numerical optimization is required.
Common algorithms:
Newton-Raphson: Uses first (gradient) and second (Hessian) derivatives
Fisher Scoring: Similar to Newton-Raphson but uses expected Hessian
EM Algorithm: Treats factors as missing data
Quasi-Newton (BFGS, L-BFGS): Approximates Hessian
Starting values matter. Poor initialization can lead to:
Common initialization strategies:
Convergence failure can occur due to: • Too few observations (n < 3p recommended) • Too many factors for the data • Model misspecification • Linear dependencies among variables
If convergence fails, try: fewer factors, simplify the model, check for constant or nearly constant variables, or increase sample size.
The EM algorithm treats the latent factors z as "missing data":
E-step: Given current Λ⁽ᵗ⁾, Ψ⁽ᵗ⁾, compute:
M-step: Given expected sufficient statistics, update:
$$\mathbf{\Lambda}^{(t+1)} = \mathbf{S} \cdot \mathbf{\Beta}' \cdot (\mathbf{\Beta} \mathbf{S} \mathbf{\Beta}' + \mathbf{I} - \mathbf{\Beta} \mathbf{\Lambda}^{(t)})^{-1}$$
where β = Λ'Σ⁻¹ (factor score regression weights).
$$\psi_j^{(t+1)} = s_{jj} - \sum_m \lambda_{jm}^{(t+1)} \beta_{mj}$$
Iterate until convergence (change in log-likelihood < tolerance).
EM advantages: Monotonic likelihood increase, handles some missing data naturally. EM disadvantages: Can be slow; may converge to saddle points or local maxima.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152
import numpy as npfrom scipy.optimize import minimizefrom scipy import linalg class MaximumLikelihoodFactorAnalysis: """ Maximum Likelihood Factor Analysis implementation. Minimizes the ML fitting function to estimate factor loadings and uniquenesses from a sample covariance matrix. """ def __init__(self, n_factors, max_iter=1000, tol=1e-8): self.n_factors = n_factors self.max_iter = max_iter self.tol = tol def _fitting_function(self, params, S, p, k): """ ML fitting function: F = log|Σ| + tr(SΣ⁻¹) - log|S| - p params: flattened [Λ, log(ψ)] to ensure ψ > 0 """ # Unpack parameters Lambda = params[:p*k].reshape(p, k) log_psi = params[p*k:] Psi = np.diag(np.exp(log_psi)) # Ensure positivity # Model-implied covariance Sigma = Lambda @ Lambda.T + Psi try: # Compute fitting function sign, logdet_Sigma = np.linalg.slogdet(Sigma) if sign <= 0: return 1e10 # Invalid covariance Sigma_inv = linalg.inv(Sigma) _, logdet_S = np.linalg.slogdet(S) F = logdet_Sigma + np.trace(S @ Sigma_inv) - logdet_S - p return F except linalg.LinAlgError: return 1e10 # Singular matrix def fit(self, X): """ Fit factor analysis model using ML estimation. Parameters ---------- X : ndarray of shape (n, p) Observed data matrix Returns ------- self """ n, p = X.shape k = self.n_factors # Sample covariance matrix X_centered = X - X.mean(axis=0) S = (X_centered.T @ X_centered) / n # Initialize with PCA eigenvalues, eigenvectors = linalg.eigh(S) idx = np.argsort(eigenvalues)[::-1] eigenvalues = eigenvalues[idx] eigenvectors = eigenvectors[:, idx] Lambda_init = eigenvectors[:, :k] * np.sqrt(np.maximum(eigenvalues[:k], 0.01)) psi_init = np.maximum(np.diag(S) - np.sum(Lambda_init**2, axis=1), 0.01) # Pack parameters (use log(psi) for positivity) params_init = np.concatenate([Lambda_init.ravel(), np.log(psi_init)]) # Optimize result = minimize( self._fitting_function, params_init, args=(S, p, k), method='L-BFGS-B', options={'maxiter': self.max_iter, 'ftol': self.tol} ) # Extract results self.Lambda_ = result.x[:p*k].reshape(p, k) self.psi_ = np.exp(result.x[p*k:]) self.Psi_ = np.diag(self.psi_) # Implied covariance self.Sigma_ = self.Lambda_ @ self.Lambda_.T + self.Psi_ # Communalities self.communalities_ = np.sum(self.Lambda_**2, axis=1) # Fitting function value and test statistic self.F_ = result.fun self.chi2_ = n * self.F_ self.df_ = ((p - k)**2 - (p + k)) // 2 # Store sample info self.n_samples_ = n self.n_features_ = p self.S_ = S return self def get_standard_errors(self): """ Compute approximate standard errors for loadings. Based on asymptotic covariance of ML estimates. """ # This is a simplified version; exact SE computation # requires the information matrix, which is complex # For production use, employ specialized SE computation # Rough approximation: SE ≈ sqrt((1-h²)/(n-1)) n = self.n_samples_ h2 = self.communalities_ se_approx = np.sqrt((1 - h2) / (n - 1)) return se_approx # Example usagenp.random.seed(42) # Generate data from a 2-factor modeln, p, k = 200, 6, 2z = np.random.randn(n, k)Lambda_true = np.array([ [0.8, 0.1], [0.7, 0.2], [0.75, 0.1], [0.1, 0.8], [0.2, 0.7], [0.1, 0.75]])psi_true = np.array([0.35, 0.47, 0.42, 0.35, 0.47, 0.42])X = z @ Lambda_true.T + np.random.randn(n, p) * np.sqrt(psi_true) # Fit modelmlfa = MaximumLikelihoodFactorAnalysis(n_factors=2)mlfa.fit(X) print("Estimated Loadings:")print(np.round(mlfa.Lambda_, 3))print("\nTrue Loadings:")print(Lambda_true)print("\nEstimated Uniquenesses:", np.round(mlfa.psi_, 3))print("True Uniquenesses:", psi_true)print(f"\nChi-square: {mlfa.chi2_:.2f}, df: {mlfa.df_}")Under regularity conditions, ML estimates are:
The expected information matrix I is: $$\mathbf{I} = -E\left[ \frac{\partial^2 \ell}{\partial \theta \partial \theta'} \right]$$
The asymptotic covariance of estimates is I⁻¹/n.
For each loading λᵢⱼ, the standard error allows construction of confidence intervals and z-tests:
$$SE(\hat{\lambda}{ij}) = \sqrt{[\mathbf{I}^{-1}]{(i,j),(i,j)} / n}$$
Practical computation: Modern software computes the observed or expected information matrix and extracts standard errors from its inverse.
To test H₀: λᵢⱼ = 0:
z-statistic: $$z = \frac{\hat{\lambda}{ij}}{SE(\hat{\lambda}{ij})}$$
Under H₀, z ~ N(0,1) asymptotically. A loading is "significant" if |z| > 1.96 (α = 0.05).
In exploratory FA, testing individual loadings is problematic: • Multiple testing inflates Type I error • The hypothesis that a specific loading equals zero is often not meaningful • Loading values depend on rotation
Loading significance tests are more appropriate in confirmatory FA where specific loadings are constrained to zero by hypothesis.
The normality assumption often fails. Robust (sandwich) standard errors account for non-normality:
$$\hat{\mathbf{V}}_{robust} = \mathbf{I}^{-1} \mathbf{B} \mathbf{I}^{-1}$$
where B is estimated from the outer product of score vectors, capturing the actual variability of the data.
When to use robust SEs:
Software support: Most modern SEM software (Mplus, lavaan) offers robust SEs as an option (e.g., MLR, MLM estimators).
An alternative to analytic SEs is bootstrapping:
Advantages:
Disadvantages:
| Method | Distributional Assumption | Computational Cost | When to Use |
|---|---|---|---|
| ML (expected information) | Multivariate normal | Low | Default if normality holds |
| ML (observed information) | Multivariate normal | Low | Slightly more robust to model misspec |
| Robust (sandwich) | Relaxed normality | Moderate | Non-normal data, cautious practice |
| Bootstrap | None (nonparametric) | High | Complex statistics, outliers |
Unlike PCA, factor analysis is a model that can be right or wrong. Fit assessment asks: does the hypothesized factor structure adequately explain the observed covariances?
The fundamental test compares the factor model to a saturated model (which exactly reproduces S):
Test statistic: $$\chi^2 = n \cdot F_{ML}$$
where F_ML is the minimized fitting function.
Under H₀ (k factors are sufficient): $\chi^2 \sim \chi^2_{df}$
Degrees of freedom: df = [(p - k)² - (p + k)] / 2
Interpretation:
The χ² test has well-known limitations:
Thus, we supplement χ² with descriptive fit indices.
No single index is definitive. Best practice: • Report χ², df, p-value (even if limited) • Report RMSEA with 90% CI • Report CFI and/or TLI • Report SRMR • Consider residuals (unexplained correlations)
A model with RMSEA < 0.06, CFI > 0.95, and SRMR < 0.08 has good fit by conventional standards.
For comparing models with different numbers of factors:
AIC (Akaike Information Criterion): $$AIC = -2 \ell + 2q$$
where q is the number of free parameters and ℓ is the maximized log-likelihood.
BIC (Bayesian Information Criterion): $$BIC = -2 \ell + q \ln(n)$$
Using AIC/BIC:
Residual covariance matrix: $$\mathbf{R} = \mathbf{S} - \hat{\boldsymbol{\Sigma}}$$
where $\hat{\boldsymbol{\Sigma}} = \hat{\mathbf{\Lambda}}\hat{\mathbf{\Lambda}}' + \hat{\boldsymbol{\Psi}}$.
Standardized residuals: $r_{ij}^* = r_{ij} / \sqrt{var(r_{ij})}$
This is analogous to regression residual analysis.
| Index | Excellent Fit | Acceptable Fit | Poor Fit |
|---|---|---|---|
| χ²/df | < 2 | < 3 | 3 |
| RMSEA | < 0.05 | 0.05 - 0.08 | 0.10 |
| CFI | 0.97 | 0.95 - 0.97 | < 0.90 |
| TLI | 0.97 | 0.95 - 0.97 | < 0.90 |
| SRMR | < 0.05 | 0.05 - 0.08 | 0.10 |
A Heywood case occurs when:
This is an improper solution—parameter estimates that violate constraints inherent to the model.
1. Model misspecification:
2. Sampling error:
3. Insufficient indicator variables:
4. Near-zero true uniqueness:
Heywood cases are often flagged by software: • "Negative variance estimate" • "Communality > 1.0" • "Improper solution" • Uniqueness "constrained at boundary"
Always check communalities and uniquenesses after estimation. Any h² > 0.99 is suspicious.
1. Constrain uniquenesses to be positive: Most software allows fixing ψᵢ ≥ ε (small positive number). This yields a proper solution but may mask underlying problems.
2. Try different number of factors: Add more factors: the Heywood case may disappear when the model has sufficient factors. Remove factors: if you've overfactored, simpler models may converge properly.
3. Examine the offending variable:
4. Increase sample size: If sampling error is the cause, more data stabilizes estimates.
5. Use ridge FA or regularization: Adding a small constant to the diagonal of S can prevent ill-conditioning.
Sometimes loadings become extremely large (implausible values like λ = 5). This is a sign of severe model problems:
Action: Do not interpret such solutions. Revisit model specification fundamentally.
While ML is the gold standard, other estimation methods are available, each with advantages and drawbacks.
Also called Principal Factor Analysis or Iterated Principal Factor.
Procedure:
Advantages:
Disadvantages:
When to use: Exploratory analysis when normality is doubtful and formal testing is not needed.
Minimizes the sum of squared off-diagonal residuals:
$$\min_{\Lambda, \Psi} \sum_{i \neq j} (s_{ij} - \sigma_{ij})^2$$
Advantages:
Disadvantages:
WLS methods minimize a weighted sum of residuals: F_WLS = (s - σ)'W⁻¹(s - σ) where s and σ are vectorized covariances and W is a weight matrix.
Options include: • GLS (Generalized LS): W = Σ⁻¹ (asymptotically equivalent to ML under normality) • ULS (Unweighted LS): W = I (MINRES is essentially ULS) • DWLS (Diagonally Weighted LS): Common for ordinal data
| Method | Distributional Assumptions | Standard Errors | Fit Test | Efficiency | Robustness |
|---|---|---|---|---|---|
| ML | Multivariate normal | Yes | Yes (χ²) | Best under normality | Poor under non-normality |
| PAF | None | No | No | Lower | More robust |
| MINRES | None | No | No | Lower | More robust |
| GLS | Multivariate normal | Yes | Yes | Similar to ML | Poor under non-normality |
| ULS | None | Limited | Limited | Lower | Most robust |
Use ML when:
Use PAF/MINRES when:
Use robust ML (MLR, MLM) when:
| Software | ML | PAF | MINRES | ULS | Robust ML |
|---|---|---|---|---|---|
| SPSS FACTOR | Yes | Yes | No | Yes (ULS) | No |
| SAS PROC FACTOR | Yes | Yes | No | Yes | No |
| R psych | Yes | Yes | Yes | No | No |
| R lavaan | Yes | No | No | Yes (ULSMV) | Yes (MLR, MLM) |
| Mplus | Yes | No | No | Yes (ULSMV) | Yes (MLR) |
| Python sklearn | Yes | No | No | No | No |
1. Data Preparation:
2. Determine Number of Factors:
3. Run Initial ML Estimation:
4. Rotate Solution:
5. Assess Model Fit:
6. Refine If Needed:
Always document: • Software and version used • Estimation method (ML, PAF, etc.) • How number of factors was determined • Rotation method and parameters • Starting values (if non-default) • Handling of convergence/estimation issues
This enables replication and demonstrates methodological rigor.
library(lavaan)
# Specify 3-factor EFA model
efa_model <- efa(
data = mydata,
nfactors = 3,
rotation = "oblimin"
)
# Summary with fit indices
summary(efa_model, fit.measures = TRUE, standardized = TRUE)
# Get loadings
inspect(efa_model, what = "std")$lambda
# Get factor correlations
lavInspect(efa_model, what = "cov.lv")
from factor_analyzer import FactorAnalyzer
# Fit model
fa = FactorAnalyzer(n_factors=3, rotation='oblimin', method='ml')
fa.fit(data)
# Loadings
print(fa.loadings_)
# Communalities
print(fa.get_communalities())
# Eigenvalues
print(fa.get_eigenvalues())
Maximum likelihood estimation provides a principled, inferentially rich framework for factor analysis. Let's consolidate the key concepts:
With estimation and model assessment covered, the final page addresses Interpretation—how to make sense of factor solutions, name factors, communicate findings, and connect factor analysis to broader research contexts.
You now understand how ML estimation works in factor analysis, from the likelihood function through optimization, inference, and fit assessment. This technical foundation enables both proper application and critical evaluation of factor analyses you encounter.