Machine LearningDimensionality Reduction

Factor Analysis

LevelAdvanced

Duration90 mins

TopicDimensionality Reduction

4 / 5

Maximum Likelihood Estimation

Principled Parameter Estimation

Factor analysis posits a generative model: observed data arise from latent factors through the equation x = Λz + ε. But how do we estimate the parameters—the factor loadings Λ and uniquenesses Ψ—from observed data?

Maximum Likelihood (ML) estimation is the gold standard approach. It finds parameter values that maximize the probability of observing the data we actually observed, given the factor model. ML estimation provides:

Asymptotically efficient estimates: Among consistent estimators, ML has the smallest variance as sample size grows
Standard errors: Allowing inference about whether loadings differ from zero
Goodness-of-fit tests: Chi-square statistics to assess model adequacy
Model comparison criteria: AIC, BIC for selecting number of factors

This page develops the ML estimation framework from first principles, covering the likelihood function, optimization procedures, inference, and model assessment. We'll also compare ML with alternative estimation methods.

What You Will Learn

By the end of this page, you will understand: • The log-likelihood function for the factor model under normality • How ML estimation works: optimization and identification constraints • Standard errors for loadings and how to use them • Goodness-of-fit assessment: χ², RMSEA, and information criteria • Heywood cases and improper solutions • Comparison with Principal Axis Factoring and other methods • Practical implementation with modern software

The Likelihood Function

The Multivariate Normal Likelihood

Under the factor model with Gaussian assumptions, the marginal distribution of observed x is:

$$\mathbf{x} \sim \mathcal{N}(\boldsymbol{\mu}, \boldsymbol{\Sigma})$$

where the model-implied covariance is:

$$\boldsymbol{\Sigma} = \mathbf{\Lambda \Lambda}' + \boldsymbol{\Psi}$$

Given n independent observations x₁, x₂, ..., xₙ, the likelihood is the product of individual densities:

$$L(\boldsymbol{\mu}, \mathbf{\Lambda}, \boldsymbol{\Psi} | \mathbf{X}) = \prod_{i=1}^{n} (2\pi)^{-p/2} |\boldsymbol{\Sigma}|^{-1/2} \exp\left( -\frac{1}{2} (\mathbf{x}_i - \boldsymbol{\mu})' \boldsymbol{\Sigma}^{-1} (\mathbf{x}_i - \boldsymbol{\mu}) \right)$$

The Log-Likelihood

Taking the natural logarithm (and ignoring constants):

$$\ell(\boldsymbol{\mu}, \mathbf{\Lambda}, \boldsymbol{\Psi}) = -\frac{n}{2} \left[ \ln |\boldsymbol{\Sigma}| + \text{tr}(\boldsymbol{\Sigma}^{-1} \mathbf{S}) + (\bar{\mathbf{x}} - \boldsymbol{\mu})' \boldsymbol{\Sigma}^{-1} (\bar{\mathbf{x}} - \boldsymbol{\mu}) \right]$$

where:

S is the sample covariance matrix: $\mathbf{S} = \frac{1}{n} \sum_{i=1}^{n} (\mathbf{x}_i - \bar{\mathbf{x}})(\mathbf{x}_i - \bar{\mathbf{x}})'$
$\bar{\mathbf{x}}$ is the sample mean

The ML estimate of μ is simply $\hat{\boldsymbol{\mu}} = \bar{\mathbf{x}}$ (the sample mean). For covariance structure, we can work with centered data.

The Fitting Function

Maximum likelihood is often expressed as minimizing a fitting function F rather than maximizing the log-likelihood. The ML fitting function is:

F_ML = ln|Σ| + tr(SΣ⁻¹) - ln|S| - p

This equals zero when Σ = S (perfect fit). It's minimized when the model-implied covariance best matches the sample covariance.

Concentrating Out the Mean

Since $\hat{\boldsymbol{\mu}} = \bar{\mathbf{x}}$ regardless of Λ and Ψ, we can work with the concentrated log-likelihood that depends only on the covariance parameters:

$$\ell^*(\mathbf{\Lambda}, \boldsymbol{\Psi}) = -\frac{n}{2} \left[ \ln |\boldsymbol{\Sigma}| + \text{tr}(\boldsymbol{\Sigma}^{-1} \mathbf{S}) \right]$$

This focuses on matching the model-implied covariance Σ = ΛΛ' + Ψ to the sample covariance S.

Parameter Count and Identification

Free parameters:

Λ: p × k loadings
Ψ: p uniquenesses (diagonal elements)
Total: pk + p parameters

However, the model is not identified without constraints because of rotational indeterminacy. Standard constraint: require Λ'Ψ⁻¹Λ to be diagonal. This fixes k(k-1)/2 parameters, giving:

Net free parameters: pk + p - k(k-1)/2

Degrees of freedom for the model:

$$df = \frac{(p)(p+1)}{2} - \left[ pk + p - \frac{k(k-1)}{2} \right] = \frac{(p-k)^2 - (p+k)}{2}$$

For the model to be testable, we need df > 0.

Degrees of Freedom for Common Configurations
Variables (p)	Factors (k)	Free Parameters	df
5	1	10	5
5	2	13	1
8	2	22	13
10	3	40	18
20	5	130	100

The Optimization Process

Iterative Optimization

The ML estimates $\hat{\mathbf{\Lambda}}$ and $\hat{\boldsymbol{\Psi}}$ are found by maximizing the log-likelihood iteratively. There is no closed-form solution; numerical optimization is required.

Common algorithms:

Newton-Raphson: Uses first (gradient) and second (Hessian) derivatives
- Fast convergence near the optimum
- Requires Hessian computation (expensive)
Fisher Scoring: Similar to Newton-Raphson but uses expected Hessian
- More stable in some cases
- Standard in many FA implementations
EM Algorithm: Treats factors as missing data
- E-step: Compute expected factor scores
- M-step: Update Λ and Ψ
- Guaranteed to increase likelihood each iteration
- May converge slowly
Quasi-Newton (BFGS, L-BFGS): Approximates Hessian
- Good for high-dimensional problems
- Less per-iteration computation

Initialization and Convergence

Starting values matter. Poor initialization can lead to:

Slow convergence
Convergence to a local (not global) maximum
Non-convergence

Common initialization strategies:

Principal components (use PCA loadings scaled by eigenvalue)
Prior reliable communality estimates
Random starts with selection of best solution

Convergence Issues

Convergence failure can occur due to: • Too few observations (n < 3p recommended) • Too many factors for the data • Model misspecification • Linear dependencies among variables

If convergence fails, try: fewer factors, simplify the model, check for constant or nearly constant variables, or increase sample size.

The EM Algorithm for Factor Analysis

The EM algorithm treats the latent factors z as "missing data":

E-step: Given current Λ⁽ᵗ⁾, Ψ⁽ᵗ⁾, compute:

Expected factor scores: E[z|x] = Λ'(ΛΛ' + Ψ)⁻¹(x - μ)
Expected factor covariance: E[zz'|x]

M-step: Given expected sufficient statistics, update:

$$\mathbf{\Lambda}^{(t+1)} = \mathbf{S} \cdot \mathbf{\Beta}' \cdot (\mathbf{\Beta} \mathbf{S} \mathbf{\Beta}' + \mathbf{I} - \mathbf{\Beta} \mathbf{\Lambda}^{(t)})^{-1}$$

where β = Λ'Σ⁻¹ (factor score regression weights).

$$\psi_j^{(t+1)} = s_{jj} - \sum_m \lambda_{jm}^{(t+1)} \beta_{mj}$$

Iterate until convergence (change in log-likelihood < tolerance).

EM advantages: Monotonic likelihood increase, handles some missing data naturally. EM disadvantages: Can be slow; may converge to saddle points or local maxima.

ml_factor_analysis.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
import numpy as np
from scipy.optimize import minimize
from scipy import linalg
 
class MaximumLikelihoodFactorAnalysis:
    """
    Maximum Likelihood Factor Analysis implementation.
    
    Minimizes the ML fitting function to estimate factor loadings
    and uniquenesses from a sample covariance matrix.
    """
    
    def __init__(self, n_factors, max_iter=1000, tol=1e-8):
        self.n_factors = n_factors
        self.max_iter = max_iter
        self.tol = tol
        
    def _fitting_function(self, params, S, p, k):
        """
        ML fitting function: F = log|Σ| + tr(SΣ⁻¹) - log|S| - p
        
        params: flattened [Λ, log(ψ)] to ensure ψ > 0
        """
        # Unpack parameters
        Lambda = params[:p*k].reshape(p, k)
        log_psi = params[p*k:]
        Psi = np.diag(np.exp(log_psi))  # Ensure positivity
        
        # Model-implied covariance
        Sigma = Lambda @ Lambda.T + Psi
        
        try:
            # Compute fitting function
            sign, logdet_Sigma = np.linalg.slogdet(Sigma)
            if sign <= 0:
                return 1e10  # Invalid covariance
            
            Sigma_inv = linalg.inv(Sigma)
            
            _, logdet_S = np.linalg.slogdet(S)
            
            F = logdet_Sigma + np.trace(S @ Sigma_inv) - logdet_S - p
            
            return F
            
        except linalg.LinAlgError:
            return 1e10  # Singular matrix
    
    def fit(self, X):
        """
        Fit factor analysis model using ML estimation.
        
        Parameters
        ----------
        X : ndarray of shape (n, p)
            Observed data matrix
            
        Returns
        -------
        self
        """
        n, p = X.shape
        k = self.n_factors
        
        # Sample covariance matrix
        X_centered = X - X.mean(axis=0)
        S = (X_centered.T @ X_centered) / n
        
        # Initialize with PCA
        eigenvalues, eigenvectors = linalg.eigh(S)
        idx = np.argsort(eigenvalues)[::-1]
        eigenvalues = eigenvalues[idx]
        eigenvectors = eigenvectors[:, idx]
        
        Lambda_init = eigenvectors[:, :k] * np.sqrt(np.maximum(eigenvalues[:k], 0.01))
        psi_init = np.maximum(np.diag(S) - np.sum(Lambda_init**2, axis=1), 0.01)
        
        # Pack parameters (use log(psi) for positivity)
        params_init = np.concatenate([Lambda_init.ravel(), np.log(psi_init)])
        
        # Optimize
        result = minimize(
            self._fitting_function,
            params_init,
            args=(S, p, k),
            method='L-BFGS-B',
            options={'maxiter': self.max_iter, 'ftol': self.tol}
        )
        
        # Extract results
        self.Lambda_ = result.x[:p*k].reshape(p, k)
        self.psi_ = np.exp(result.x[p*k:])
        self.Psi_ = np.diag(self.psi_)
        
        # Implied covariance
        self.Sigma_ = self.Lambda_ @ self.Lambda_.T + self.Psi_
        
        # Communalities
        self.communalities_ = np.sum(self.Lambda_**2, axis=1)
        
        # Fitting function value and test statistic
        self.F_ = result.fun
        self.chi2_ = n * self.F_
        self.df_ = ((p - k)**2 - (p + k)) // 2
        
        # Store sample info
        self.n_samples_ = n
        self.n_features_ = p
        self.S_ = S
        
        return self
    
    def get_standard_errors(self):
        """
        Compute approximate standard errors for loadings.
        Based on asymptotic covariance of ML estimates.
        """
        # This is a simplified version; exact SE computation
        # requires the information matrix, which is complex
        # For production use, employ specialized SE computation
        
        # Rough approximation: SE ≈ sqrt((1-h²)/(n-1))
        n = self.n_samples_
        h2 = self.communalities_
        se_approx = np.sqrt((1 - h2) / (n - 1))
        
        return se_approx
 
# Example usage
np.random.seed(42)
 
# Generate data from a 2-factor model
n, p, k = 200, 6, 2
z = np.random.randn(n, k)
Lambda_true = np.array([
    [0.8, 0.1], [0.7, 0.2], [0.75, 0.1],
    [0.1, 0.8], [0.2, 0.7], [0.1, 0.75]
])
psi_true = np.array([0.35, 0.47, 0.42, 0.35, 0.47, 0.42])
X = z @ Lambda_true.T + np.random.randn(n, p) * np.sqrt(psi_true)
 
# Fit model
mlfa = MaximumLikelihoodFactorAnalysis(n_factors=2)
mlfa.fit(X)
 
print("Estimated Loadings:")
print(np.round(mlfa.Lambda_, 3))
print("\nTrue Loadings:")
print(Lambda_true)
print("\nEstimated Uniquenesses:", np.round(mlfa.psi_, 3))
print("True Uniquenesses:", psi_true)
print(f"\nChi-square: {mlfa.chi2_:.2f}, df: {mlfa.df_}")

Standard Errors and Inference

Asymptotic Theory

Under regularity conditions, ML estimates are:

Consistent: $\hat{\theta} \xrightarrow{p} \theta$ as n → ∞
Asymptotically normal: $\sqrt{n}(\hat{\theta} - \theta) \xrightarrow{d} \mathcal{N}(0, \mathbf{I}^{-1})$
Asymptotically efficient: Achieve the Cramér-Rao lower bound

The expected information matrix I is: $$\mathbf{I} = -E\left[ \frac{\partial^2 \ell}{\partial \theta \partial \theta'} \right]$$

The asymptotic covariance of estimates is I⁻¹/n.

Standard Errors for Loadings

For each loading λᵢⱼ, the standard error allows construction of confidence intervals and z-tests:

$$SE(\hat{\lambda}{ij}) = \sqrt{[\mathbf{I}^{-1}]{(i,j),(i,j)} / n}$$

Practical computation: Modern software computes the observed or expected information matrix and extracts standard errors from its inverse.

Testing Individual Loadings

To test H₀: λᵢⱼ = 0:

z-statistic: $$z = \frac{\hat{\lambda}{ij}}{SE(\hat{\lambda}{ij})}$$

Under H₀, z ~ N(0,1) asymptotically. A loading is "significant" if |z| > 1.96 (α = 0.05).

Caution with Loading Tests

In exploratory FA, testing individual loadings is problematic: • Multiple testing inflates Type I error • The hypothesis that a specific loading equals zero is often not meaningful • Loading values depend on rotation

Loading significance tests are more appropriate in confirmatory FA where specific loadings are constrained to zero by hypothesis.

Robust Standard Errors

The normality assumption often fails. Robust (sandwich) standard errors account for non-normality:

$$\hat{\mathbf{V}}_{robust} = \mathbf{I}^{-1} \mathbf{B} \mathbf{I}^{-1}$$

where B is estimated from the outer product of score vectors, capturing the actual variability of the data.

When to use robust SEs:

Multivariate non-normality (skewness, kurtosis)
Violation of distributional assumptions
Cautious practice in applied research

Software support: Most modern SEM software (Mplus, lavaan) offers robust SEs as an option (e.g., MLR, MLM estimators).

Bootstrap Standard Errors

An alternative to analytic SEs is bootstrapping:

Resample n observations with replacement
Re-estimate the factor model on the bootstrap sample
Repeat B times (e.g., B = 1000)
Standard error = Standard deviation of bootstrap estimates

Advantages:

No distributional assumptions
Works for complex statistics (factor scores, factor correlations after rotation)

Disadvantages:

Computationally intensive
Requires convergence on each bootstrap sample
Confidence interval coverage may not be exact

Standard Error Methods Comparison
Method	Distributional Assumption	Computational Cost	When to Use
ML (expected information)	Multivariate normal	Low	Default if normality holds
ML (observed information)	Multivariate normal	Low	Slightly more robust to model misspec
Robust (sandwich)	Relaxed normality	Moderate	Non-normal data, cautious practice
Bootstrap	None (nonparametric)	High	Complex statistics, outliers

Goodness-of-Fit Assessment

Unlike PCA, factor analysis is a model that can be right or wrong. Fit assessment asks: does the hypothesized factor structure adequately explain the observed covariances?

The Likelihood Ratio Test

The fundamental test compares the factor model to a saturated model (which exactly reproduces S):

Test statistic: $$\chi^2 = n \cdot F_{ML}$$

where F_ML is the minimized fitting function.

Under H₀ (k factors are sufficient): $\chi^2 \sim \chi^2_{df}$

Degrees of freedom: df = [(p - k)² - (p + k)] / 2

Interpretation:

Significant χ² (p < 0.05): Reject the k-factor model
Non-significant χ² (p > 0.05): Do not reject; k factors may be sufficient

The Problem with χ²

The χ² test has well-known limitations:

Sensitive to sample size: With large n, even trivial misfits are significant
Sensitive to normality: Non-normality inflates χ²
All models are wrong: Rejecting a useful model because it's not perfect is counterproductive

Thus, we supplement χ² with descriptive fit indices.

Common Fit Indices

•RMSEA (Root Mean Square Error of Approximation): √[(χ² - df)/(df × n)]. Good fit: < 0.05; Acceptable: < 0.08; Poor: > 0.10
•CFI (Comparative Fit Index): Compares to null model. Good fit: > 0.95; Acceptable: > 0.90
•TLI (Tucker-Lewis Index): Like CFI but penalizes complexity. Good fit: > 0.95
•SRMR (Standardized Root Mean Residual): Average absolute residual correlation. Good fit: < 0.08

Fit Index Guidelines

No single index is definitive. Best practice: • Report χ², df, p-value (even if limited) • Report RMSEA with 90% CI • Report CFI and/or TLI • Report SRMR • Consider residuals (unexplained correlations)

A model with RMSEA < 0.06, CFI > 0.95, and SRMR < 0.08 has good fit by conventional standards.

Information Criteria

For comparing models with different numbers of factors:

AIC (Akaike Information Criterion): $$AIC = -2 \ell + 2q$$

where q is the number of free parameters and ℓ is the maximized log-likelihood.

BIC (Bayesian Information Criterion): $$BIC = -2 \ell + q \ln(n)$$

Using AIC/BIC:

Fit models with k = 1, 2, 3, ... factors
Select the model with lowest AIC or BIC
BIC penalizes complexity more heavily, tends to favor simpler models

Residual Analysis

Residual covariance matrix: $$\mathbf{R} = \mathbf{S} - \hat{\boldsymbol{\Sigma}}$$

where $\hat{\boldsymbol{\Sigma}} = \hat{\mathbf{\Lambda}}\hat{\mathbf{\Lambda}}' + \hat{\boldsymbol{\Psi}}$.

Standardized residuals: $r_{ij}^* = r_{ij} / \sqrt{var(r_{ij})}$

Large residuals (|r*| > 2-3) indicate specific misfits
Patterns in residuals suggest missing factors or local dependencies

This is analogous to regression residual analysis.

Fit Index Thresholds (Conventional Guidelines)
Index	Excellent Fit	Acceptable Fit	Poor Fit
χ²/df	< 2	< 3	3
RMSEA	< 0.05	0.05 - 0.08	0.10
CFI	0.97	0.95 - 0.97	< 0.90
TLI	0.97	0.95 - 0.97	< 0.90
SRMR	< 0.05	0.05 - 0.08	0.10

Improper Solutions and Heywood Cases

The Heywood Case Problem

A Heywood case occurs when:

An estimated uniqueness ψᵢ ≤ 0 (impossible—variance cannot be negative)
Equivalently, communality h²ᵢ ≥ 1 (impossible—cannot explain more than 100%)

This is an improper solution—parameter estimates that violate constraints inherent to the model.

Why Heywood Cases Occur

1. Model misspecification:

Wrong number of factors (often too few)
Missing factors that would account for the variance

2. Sampling error:

Small samples produce unstable estimates
Random fluctuation can push estimates to boundaries

3. Insufficient indicator variables:

Factors defined by too few variables (1-2 indicators per factor)
Identification problems

4. Near-zero true uniqueness:

Some variables may genuinely have near-zero unique variance
Slight sampling error pushes the estimate negative

Detecting Heywood Cases

Heywood cases are often flagged by software: • "Negative variance estimate" • "Communality > 1.0" • "Improper solution" • Uniqueness "constrained at boundary"

Always check communalities and uniquenesses after estimation. Any h² > 0.99 is suspicious.

Remedies for Heywood Cases

1. Constrain uniquenesses to be positive: Most software allows fixing ψᵢ ≥ ε (small positive number). This yields a proper solution but may mask underlying problems.

2. Try different number of factors: Add more factors: the Heywood case may disappear when the model has sufficient factors. Remove factors: if you've overfactored, simpler models may converge properly.

3. Examine the offending variable:

Is the variable well-measured?
Does it correlate extremely highly with others?
Consider removing it from the analysis

4. Increase sample size: If sampling error is the cause, more data stabilizes estimates.

5. Use ridge FA or regularization: Adding a small constant to the diagonal of S can prevent ill-conditioning.

Ultra-Heywood Cases

Sometimes loadings become extremely large (implausible values like λ = 5). This is a sign of severe model problems:

Complete model misspecification
Collinearity issues
Numerical instability

Action: Do not interpret such solutions. Revisit model specification fundamentally.

Warning Signs of Improper Solutions

•Communality > 0.99 or = 1.00: Very unusual in practice; suggests estimation boundary
•Negative uniqueness: Definitional impossibility
•Loading > 1.0: Can occur with oblique rotation, but loadings >> 1 are suspect
•Standard errors = 0 or NA: Often indicates parameters at boundaries
•Non-convergence with 'solution found': Algorithm hit bounds, not true optimum

Alternative Estimation Methods

While ML is the gold standard, other estimation methods are available, each with advantages and drawbacks.

Principal Axis Factoring (PAF)

Also called Principal Factor Analysis or Iterated Principal Factor.

Procedure:

Replace diagonal of R with estimated communalities (initially 1 or SMC)
Extract eigenvalues and eigenvectors of the reduced matrix
Update communalities as sum of squared loadings
Repeat until communalities converge

Advantages:

No distributional assumptions
Computationally simpler than ML
Sometimes converges when ML fails

Disadvantages:

No standard errors
No formal fit test
May converge to different solution than ML

When to use: Exploratory analysis when normality is doubtful and formal testing is not needed.

Minimum Residual (MINRES)

Minimizes the sum of squared off-diagonal residuals:

$$\min_{\Lambda, \Psi} \sum_{i \neq j} (s_{ij} - \sigma_{ij})^2$$

Advantages:

No distributional assumptions
Focuses on reproducing correlations (off-diagonal)
Robust to some types of misspecification

Disadvantages:

Less efficient than ML under correct model
No formal inference

Weighted Least Squares (WLS)

WLS methods minimize a weighted sum of residuals: F_WLS = (s - σ)'W⁻¹(s - σ) where s and σ are vectorized covariances and W is a weight matrix.

Options include: • GLS (Generalized LS): W = Σ⁻¹ (asymptotically equivalent to ML under normality) • ULS (Unweighted LS): W = I (MINRES is essentially ULS) • DWLS (Diagonally Weighted LS): Common for ordinal data

Comparison of Estimation Methods

Method	Distributional Assumptions	Standard Errors	Fit Test	Efficiency	Robustness
ML	Multivariate normal	Yes	Yes (χ²)	Best under normality	Poor under non-normality
PAF	None	No	No	Lower	More robust
MINRES	None	No	No	Lower	More robust
GLS	Multivariate normal	Yes	Yes	Similar to ML	Poor under non-normality
ULS	None	Limited	Limited	Lower	Most robust

Which to Choose?

Use ML when:

Distributional assumptions are approximately met
You need standard errors and fit tests
Sample size is adequate (n > 200 or so)

Use PAF/MINRES when:

Data are clearly non-normal
Exploratory analysis without formal testing
ML doesn't converge
Quick approximate solution is sufficient

Use robust ML (MLR, MLM) when:

Some non-normality but you want inferential results
Available in Mplus, lavaan, and other SEM software

Software Support for Estimation Methods
Software	ML	PAF	MINRES	ULS	Robust ML
SPSS FACTOR	Yes	Yes	No	Yes (ULS)	No
SAS PROC FACTOR	Yes	Yes	No	Yes	No
R psych	Yes	Yes	Yes	No	No
R lavaan	Yes	No	No	Yes (ULSMV)	Yes (MLR, MLM)
Mplus	Yes	No	No	Yes (ULSMV)	Yes (MLR)
Python sklearn	Yes	No	No	No	No

Practical ML Estimation Workflow

Step-by-Step Procedure

1. Data Preparation:

Check for missing data (consider imputation or FIML)
Examine univariate distributions (skewness, kurtosis)
Check multivariate normality (Mardia's test)
Standardize if using correlation matrix

2. Determine Number of Factors:

Eigenvalue analysis (Kaiser rule as starting point)
Scree plot
Parallel analysis (more reliable)
Theory and interpretability

3. Run Initial ML Estimation:

Extract k factors
Check convergence
Check for Heywood cases

4. Rotate Solution:

Apply varimax or oblimin rotation
Examine loading patterns

5. Assess Model Fit:

χ² test (with caution)
RMSEA, CFI, TLI, SRMR
Examine residuals

6. Refine If Needed:

Try k-1 or k+1 factors
Consider removing poorly fitting items
Re-estimate and compare fit

Checklist Before Reporting Results

•✓ Model converged without warnings
•✓ No Heywood cases or boundary solutions
•✓ All communalities are in (0, 1) range
•✓ Fit indices are acceptable
•✓ Loading pattern is interpretable
•✓ Factors make substantive sense
•✓ Solution is not radically different from theory

Documentation Is Key

Always document: • Software and version used • Estimation method (ML, PAF, etc.) • How number of factors was determined • Rotation method and parameters • Starting values (if non-default) • Handling of convergence/estimation issues

This enables replication and demonstrates methodological rigor.

Common R Code (lavaan)

library(lavaan)

# Specify 3-factor EFA model
efa_model <- efa(
  data = mydata,
  nfactors = 3,
  rotation = "oblimin"
)

# Summary with fit indices
summary(efa_model, fit.measures = TRUE, standardized = TRUE)

# Get loadings
inspect(efa_model, what = "std")$lambda

# Get factor correlations
lavInspect(efa_model, what = "cov.lv")

Common Python Code (factor_analyzer)

from factor_analyzer import FactorAnalyzer

# Fit model
fa = FactorAnalyzer(n_factors=3, rotation='oblimin', method='ml')
fa.fit(data)

# Loadings
print(fa.loadings_)

# Communalities
print(fa.get_communalities())

# Eigenvalues
print(fa.get_eigenvalues())

Summary: Maximum Likelihood Estimation

Maximum likelihood estimation provides a principled, inferentially rich framework for factor analysis. Let's consolidate the key concepts:

Key Takeaways

•The ML fitting function minimizes discrepancy between model-implied and sample covariance: minimize ln|Σ| + tr(SΣ⁻¹) - ln|S| - p.
•Iterative optimization is required: no closed-form solution exists. EM, Newton-Raphson, and quasi-Newton methods are used.
•Standard errors are available from the information matrix, enabling inference about loadings (with caveats in EFA).
•Goodness-of-fit is assessed via χ² test (with known limitations) and descriptive indices (RMSEA, CFI, SRMR).
•Heywood cases (negative uniquenesses) indicate model problems—investigate root causes, don't just constrain.
•Alternative methods (PAF, MINRES) provide options when normality fails or ML doesn't converge.
•Practical workflow involves careful data prep, factor number determination, estimation, rotation, and fit assessment.

What's Next

With estimation and model assessment covered, the final page addresses Interpretation—how to make sense of factor solutions, name factors, communicate findings, and connect factor analysis to broader research contexts.

Estimation Foundations Complete

You now understand how ML estimation works in factor analysis, from the likelihood function through optimization, inference, and fit assessment. This technical foundation enables both proper application and critical evaluation of factor analyses you encounter.

4 / 5

Loading learning content...

Machine LearningDimensionality Reduction

Factor Analysis

LevelAdvanced

Duration90 mins

TopicDimensionality Reduction

4 / 5

Maximum Likelihood Estimation

Principled Parameter Estimation

Asymptotically efficient estimates: Among consistent estimators, ML has the smallest variance as sample size grows
Standard errors: Allowing inference about whether loadings differ from zero
Goodness-of-fit tests: Chi-square statistics to assess model adequacy
Model comparison criteria: AIC, BIC for selecting number of factors

What You Will Learn

The Likelihood Function

The Multivariate Normal Likelihood

Under the factor model with Gaussian assumptions, the marginal distribution of observed x is:

$$\mathbf{x} \sim \mathcal{N}(\boldsymbol{\mu}, \boldsymbol{\Sigma})$$

where the model-implied covariance is:

$$\boldsymbol{\Sigma} = \mathbf{\Lambda \Lambda}' + \boldsymbol{\Psi}$$

Given n independent observations x₁, x₂, ..., xₙ, the likelihood is the product of individual densities:

The Log-Likelihood

Taking the natural logarithm (and ignoring constants):

where:

S is the sample covariance matrix: $\mathbf{S} = \frac{1}{n} \sum_{i=1}^{n} (\mathbf{x}_i - \bar{\mathbf{x}})(\mathbf{x}_i - \bar{\mathbf{x}})'$
$\bar{\mathbf{x}}$ is the sample mean

The ML estimate of μ is simply $\hat{\boldsymbol{\mu}} = \bar{\mathbf{x}}$ (the sample mean). For covariance structure, we can work with centered data.

The Fitting Function

Maximum likelihood is often expressed as minimizing a fitting function F rather than maximizing the log-likelihood. The ML fitting function is:

F_ML = ln|Σ| + tr(SΣ⁻¹) - ln|S| - p

This equals zero when Σ = S (perfect fit). It's minimized when the model-implied covariance best matches the sample covariance.

Concentrating Out the Mean

Since $\hat{\boldsymbol{\mu}} = \bar{\mathbf{x}}$ regardless of Λ and Ψ, we can work with the concentrated log-likelihood that depends only on the covariance parameters:

$$\ell^*(\mathbf{\Lambda}, \boldsymbol{\Psi}) = -\frac{n}{2} \left[ \ln |\boldsymbol{\Sigma}| + \text{tr}(\boldsymbol{\Sigma}^{-1} \mathbf{S}) \right]$$

This focuses on matching the model-implied covariance Σ = ΛΛ' + Ψ to the sample covariance S.

Parameter Count and Identification

Free parameters:

Λ: p × k loadings
Ψ: p uniquenesses (diagonal elements)
Total: pk + p parameters

However, the model is not identified without constraints because of rotational indeterminacy. Standard constraint: require Λ'Ψ⁻¹Λ to be diagonal. This fixes k(k-1)/2 parameters, giving:

Net free parameters: pk + p - k(k-1)/2

Degrees of freedom for the model:

$$df = \frac{(p)(p+1)}{2} - \left[ pk + p - \frac{k(k-1)}{2} \right] = \frac{(p-k)^2 - (p+k)}{2}$$

For the model to be testable, we need df > 0.

Degrees of Freedom for Common Configurations
Variables (p)	Factors (k)	Free Parameters	df
5	1	10	5
5	2	13	1
8	2	22	13
10	3	40	18
20	5	130	100

The Optimization Process

Iterative Optimization

The ML estimates $\hat{\mathbf{\Lambda}}$ and $\hat{\boldsymbol{\Psi}}$ are found by maximizing the log-likelihood iteratively. There is no closed-form solution; numerical optimization is required.

Common algorithms:

Newton-Raphson: Uses first (gradient) and second (Hessian) derivatives
- Fast convergence near the optimum
- Requires Hessian computation (expensive)
Fisher Scoring: Similar to Newton-Raphson but uses expected Hessian
- More stable in some cases
- Standard in many FA implementations
EM Algorithm: Treats factors as missing data
- E-step: Compute expected factor scores
- M-step: Update Λ and Ψ
- Guaranteed to increase likelihood each iteration
- May converge slowly
Quasi-Newton (BFGS, L-BFGS): Approximates Hessian
- Good for high-dimensional problems
- Less per-iteration computation

Initialization and Convergence

Starting values matter. Poor initialization can lead to:

Slow convergence
Convergence to a local (not global) maximum
Non-convergence

Common initialization strategies:

Principal components (use PCA loadings scaled by eigenvalue)
Prior reliable communality estimates
Random starts with selection of best solution

Convergence Issues

Convergence failure can occur due to: • Too few observations (n < 3p recommended) • Too many factors for the data • Model misspecification • Linear dependencies among variables

If convergence fails, try: fewer factors, simplify the model, check for constant or nearly constant variables, or increase sample size.

The EM Algorithm for Factor Analysis

The EM algorithm treats the latent factors z as "missing data":

E-step: Given current Λ⁽ᵗ⁾, Ψ⁽ᵗ⁾, compute:

Expected factor scores: E[z|x] = Λ'(ΛΛ' + Ψ)⁻¹(x - μ)
Expected factor covariance: E[zz'|x]

M-step: Given expected sufficient statistics, update:

$$\mathbf{\Lambda}^{(t+1)} = \mathbf{S} \cdot \mathbf{\Beta}' \cdot (\mathbf{\Beta} \mathbf{S} \mathbf{\Beta}' + \mathbf{I} - \mathbf{\Beta} \mathbf{\Lambda}^{(t)})^{-1}$$

where β = Λ'Σ⁻¹ (factor score regression weights).

$$\psi_j^{(t+1)} = s_{jj} - \sum_m \lambda_{jm}^{(t+1)} \beta_{mj}$$

Iterate until convergence (change in log-likelihood < tolerance).

EM advantages: Monotonic likelihood increase, handles some missing data naturally. EM disadvantages: Can be slow; may converge to saddle points or local maxima.

ml_factor_analysis.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
import numpy as np
from scipy.optimize import minimize
from scipy import linalg
 
class MaximumLikelihoodFactorAnalysis:
    """
    Maximum Likelihood Factor Analysis implementation.
    
    Minimizes the ML fitting function to estimate factor loadings
    and uniquenesses from a sample covariance matrix.
    """
    
    def __init__(self, n_factors, max_iter=1000, tol=1e-8):
        self.n_factors = n_factors
        self.max_iter = max_iter
        self.tol = tol
        
    def _fitting_function(self, params, S, p, k):
        """
        ML fitting function: F = log|Σ| + tr(SΣ⁻¹) - log|S| - p
        
        params: flattened [Λ, log(ψ)] to ensure ψ > 0
        """
        # Unpack parameters
        Lambda = params[:p*k].reshape(p, k)
        log_psi = params[p*k:]
        Psi = np.diag(np.exp(log_psi))  # Ensure positivity
        
        # Model-implied covariance
        Sigma = Lambda @ Lambda.T + Psi
        
        try:
            # Compute fitting function
            sign, logdet_Sigma = np.linalg.slogdet(Sigma)
            if sign <= 0:
                return 1e10  # Invalid covariance
            
            Sigma_inv = linalg.inv(Sigma)
            
            _, logdet_S = np.linalg.slogdet(S)
            
            F = logdet_Sigma + np.trace(S @ Sigma_inv) - logdet_S - p
            
            return F
            
        except linalg.LinAlgError:
            return 1e10  # Singular matrix
    
    def fit(self, X):
        """
        Fit factor analysis model using ML estimation.
        
        Parameters
        ----------
        X : ndarray of shape (n, p)
            Observed data matrix
            
        Returns
        -------
        self
        """
        n, p = X.shape
        k = self.n_factors
        
        # Sample covariance matrix
        X_centered = X - X.mean(axis=0)
        S = (X_centered.T @ X_centered) / n
        
        # Initialize with PCA
        eigenvalues, eigenvectors = linalg.eigh(S)
        idx = np.argsort(eigenvalues)[::-1]
        eigenvalues = eigenvalues[idx]
        eigenvectors = eigenvectors[:, idx]
        
        Lambda_init = eigenvectors[:, :k] * np.sqrt(np.maximum(eigenvalues[:k], 0.01))
        psi_init = np.maximum(np.diag(S) - np.sum(Lambda_init**2, axis=1), 0.01)
        
        # Pack parameters (use log(psi) for positivity)
        params_init = np.concatenate([Lambda_init.ravel(), np.log(psi_init)])
        
        # Optimize
        result = minimize(
            self._fitting_function,
            params_init,
            args=(S, p, k),
            method='L-BFGS-B',
            options={'maxiter': self.max_iter, 'ftol': self.tol}
        )
        
        # Extract results
        self.Lambda_ = result.x[:p*k].reshape(p, k)
        self.psi_ = np.exp(result.x[p*k:])
        self.Psi_ = np.diag(self.psi_)
        
        # Implied covariance
        self.Sigma_ = self.Lambda_ @ self.Lambda_.T + self.Psi_
        
        # Communalities
        self.communalities_ = np.sum(self.Lambda_**2, axis=1)
        
        # Fitting function value and test statistic
        self.F_ = result.fun
        self.chi2_ = n * self.F_
        self.df_ = ((p - k)**2 - (p + k)) // 2
        
        # Store sample info
        self.n_samples_ = n
        self.n_features_ = p
        self.S_ = S
        
        return self
    
    def get_standard_errors(self):
        """
        Compute approximate standard errors for loadings.
        Based on asymptotic covariance of ML estimates.
        """
        # This is a simplified version; exact SE computation
        # requires the information matrix, which is complex
        # For production use, employ specialized SE computation
        
        # Rough approximation: SE ≈ sqrt((1-h²)/(n-1))
        n = self.n_samples_
        h2 = self.communalities_
        se_approx = np.sqrt((1 - h2) / (n - 1))
        
        return se_approx
 
# Example usage
np.random.seed(42)
 
# Generate data from a 2-factor model
n, p, k = 200, 6, 2
z = np.random.randn(n, k)
Lambda_true = np.array([
    [0.8, 0.1], [0.7, 0.2], [0.75, 0.1],
    [0.1, 0.8], [0.2, 0.7], [0.1, 0.75]
])
psi_true = np.array([0.35, 0.47, 0.42, 0.35, 0.47, 0.42])
X = z @ Lambda_true.T + np.random.randn(n, p) * np.sqrt(psi_true)
 
# Fit model
mlfa = MaximumLikelihoodFactorAnalysis(n_factors=2)
mlfa.fit(X)
 
print("Estimated Loadings:")
print(np.round(mlfa.Lambda_, 3))
print("\nTrue Loadings:")
print(Lambda_true)
print("\nEstimated Uniquenesses:", np.round(mlfa.psi_, 3))
print("True Uniquenesses:", psi_true)
print(f"\nChi-square: {mlfa.chi2_:.2f}, df: {mlfa.df_}")

Standard Errors and Inference

Asymptotic Theory

Under regularity conditions, ML estimates are:

Consistent: $\hat{\theta} \xrightarrow{p} \theta$ as n → ∞
Asymptotically normal: $\sqrt{n}(\hat{\theta} - \theta) \xrightarrow{d} \mathcal{N}(0, \mathbf{I}^{-1})$
Asymptotically efficient: Achieve the Cramér-Rao lower bound

The expected information matrix I is: $$\mathbf{I} = -E\left[ \frac{\partial^2 \ell}{\partial \theta \partial \theta'} \right]$$

The asymptotic covariance of estimates is I⁻¹/n.

Standard Errors for Loadings

For each loading λᵢⱼ, the standard error allows construction of confidence intervals and z-tests:

$$SE(\hat{\lambda}{ij}) = \sqrt{[\mathbf{I}^{-1}]{(i,j),(i,j)} / n}$$

Practical computation: Modern software computes the observed or expected information matrix and extracts standard errors from its inverse.

Testing Individual Loadings

To test H₀: λᵢⱼ = 0:

z-statistic: $$z = \frac{\hat{\lambda}{ij}}{SE(\hat{\lambda}{ij})}$$

Under H₀, z ~ N(0,1) asymptotically. A loading is "significant" if |z| > 1.96 (α = 0.05).

Caution with Loading Tests

Loading significance tests are more appropriate in confirmatory FA where specific loadings are constrained to zero by hypothesis.

Robust Standard Errors

The normality assumption often fails. Robust (sandwich) standard errors account for non-normality:

$$\hat{\mathbf{V}}_{robust} = \mathbf{I}^{-1} \mathbf{B} \mathbf{I}^{-1}$$

where B is estimated from the outer product of score vectors, capturing the actual variability of the data.

When to use robust SEs:

Multivariate non-normality (skewness, kurtosis)
Violation of distributional assumptions
Cautious practice in applied research

Software support: Most modern SEM software (Mplus, lavaan) offers robust SEs as an option (e.g., MLR, MLM estimators).

Bootstrap Standard Errors

An alternative to analytic SEs is bootstrapping:

Resample n observations with replacement
Re-estimate the factor model on the bootstrap sample
Repeat B times (e.g., B = 1000)
Standard error = Standard deviation of bootstrap estimates

Advantages:

No distributional assumptions
Works for complex statistics (factor scores, factor correlations after rotation)

Disadvantages:

Computationally intensive
Requires convergence on each bootstrap sample
Confidence interval coverage may not be exact

Standard Error Methods Comparison
Method	Distributional Assumption	Computational Cost	When to Use
ML (expected information)	Multivariate normal	Low	Default if normality holds
ML (observed information)	Multivariate normal	Low	Slightly more robust to model misspec
Robust (sandwich)	Relaxed normality	Moderate	Non-normal data, cautious practice
Bootstrap	None (nonparametric)	High	Complex statistics, outliers

Goodness-of-Fit Assessment

Unlike PCA, factor analysis is a model that can be right or wrong. Fit assessment asks: does the hypothesized factor structure adequately explain the observed covariances?

The Likelihood Ratio Test

The fundamental test compares the factor model to a saturated model (which exactly reproduces S):

Test statistic: $$\chi^2 = n \cdot F_{ML}$$

where F_ML is the minimized fitting function.

Under H₀ (k factors are sufficient): $\chi^2 \sim \chi^2_{df}$

Degrees of freedom: df = [(p - k)² - (p + k)] / 2

Interpretation:

Significant χ² (p < 0.05): Reject the k-factor model
Non-significant χ² (p > 0.05): Do not reject; k factors may be sufficient

The Problem with χ²

The χ² test has well-known limitations:

Sensitive to sample size: With large n, even trivial misfits are significant
Sensitive to normality: Non-normality inflates χ²
All models are wrong: Rejecting a useful model because it's not perfect is counterproductive

Thus, we supplement χ² with descriptive fit indices.

Common Fit Indices

•RMSEA (Root Mean Square Error of Approximation): √[(χ² - df)/(df × n)]. Good fit: < 0.05; Acceptable: < 0.08; Poor: > 0.10
•CFI (Comparative Fit Index): Compares to null model. Good fit: > 0.95; Acceptable: > 0.90
•TLI (Tucker-Lewis Index): Like CFI but penalizes complexity. Good fit: > 0.95
•SRMR (Standardized Root Mean Residual): Average absolute residual correlation. Good fit: < 0.08

Fit Index Guidelines

A model with RMSEA < 0.06, CFI > 0.95, and SRMR < 0.08 has good fit by conventional standards.

Information Criteria

For comparing models with different numbers of factors:

AIC (Akaike Information Criterion): $$AIC = -2 \ell + 2q$$

where q is the number of free parameters and ℓ is the maximized log-likelihood.

BIC (Bayesian Information Criterion): $$BIC = -2 \ell + q \ln(n)$$

Using AIC/BIC:

Fit models with k = 1, 2, 3, ... factors
Select the model with lowest AIC or BIC
BIC penalizes complexity more heavily, tends to favor simpler models

Residual Analysis

Residual covariance matrix: $$\mathbf{R} = \mathbf{S} - \hat{\boldsymbol{\Sigma}}$$

where $\hat{\boldsymbol{\Sigma}} = \hat{\mathbf{\Lambda}}\hat{\mathbf{\Lambda}}' + \hat{\boldsymbol{\Psi}}$.

Standardized residuals: $r_{ij}^* = r_{ij} / \sqrt{var(r_{ij})}$

Large residuals (|r*| > 2-3) indicate specific misfits
Patterns in residuals suggest missing factors or local dependencies

This is analogous to regression residual analysis.

Fit Index Thresholds (Conventional Guidelines)
Index	Excellent Fit	Acceptable Fit	Poor Fit
χ²/df	< 2	< 3	3
RMSEA	< 0.05	0.05 - 0.08	0.10
CFI	0.97	0.95 - 0.97	< 0.90
TLI	0.97	0.95 - 0.97	< 0.90
SRMR	< 0.05	0.05 - 0.08	0.10

Improper Solutions and Heywood Cases

The Heywood Case Problem

A Heywood case occurs when:

An estimated uniqueness ψᵢ ≤ 0 (impossible—variance cannot be negative)
Equivalently, communality h²ᵢ ≥ 1 (impossible—cannot explain more than 100%)

This is an improper solution—parameter estimates that violate constraints inherent to the model.

Why Heywood Cases Occur

1. Model misspecification:

Wrong number of factors (often too few)
Missing factors that would account for the variance

2. Sampling error:

Small samples produce unstable estimates
Random fluctuation can push estimates to boundaries

3. Insufficient indicator variables:

Factors defined by too few variables (1-2 indicators per factor)
Identification problems

4. Near-zero true uniqueness:

Some variables may genuinely have near-zero unique variance
Slight sampling error pushes the estimate negative

Detecting Heywood Cases

Heywood cases are often flagged by software: • "Negative variance estimate" • "Communality > 1.0" • "Improper solution" • Uniqueness "constrained at boundary"

Always check communalities and uniquenesses after estimation. Any h² > 0.99 is suspicious.

Remedies for Heywood Cases

1. Constrain uniquenesses to be positive: Most software allows fixing ψᵢ ≥ ε (small positive number). This yields a proper solution but may mask underlying problems.

3. Examine the offending variable:

Is the variable well-measured?
Does it correlate extremely highly with others?
Consider removing it from the analysis

4. Increase sample size: If sampling error is the cause, more data stabilizes estimates.

5. Use ridge FA or regularization: Adding a small constant to the diagonal of S can prevent ill-conditioning.

Ultra-Heywood Cases

Sometimes loadings become extremely large (implausible values like λ = 5). This is a sign of severe model problems:

Complete model misspecification
Collinearity issues
Numerical instability

Action: Do not interpret such solutions. Revisit model specification fundamentally.

Warning Signs of Improper Solutions

•Communality > 0.99 or = 1.00: Very unusual in practice; suggests estimation boundary
•Negative uniqueness: Definitional impossibility
•Loading > 1.0: Can occur with oblique rotation, but loadings >> 1 are suspect
•Standard errors = 0 or NA: Often indicates parameters at boundaries
•Non-convergence with 'solution found': Algorithm hit bounds, not true optimum

Alternative Estimation Methods

While ML is the gold standard, other estimation methods are available, each with advantages and drawbacks.

Principal Axis Factoring (PAF)

Also called Principal Factor Analysis or Iterated Principal Factor.

Procedure:

Replace diagonal of R with estimated communalities (initially 1 or SMC)
Extract eigenvalues and eigenvectors of the reduced matrix
Update communalities as sum of squared loadings
Repeat until communalities converge

Advantages:

No distributional assumptions
Computationally simpler than ML
Sometimes converges when ML fails

Disadvantages:

No standard errors
No formal fit test
May converge to different solution than ML

When to use: Exploratory analysis when normality is doubtful and formal testing is not needed.

Minimum Residual (MINRES)

Minimizes the sum of squared off-diagonal residuals:

$$\min_{\Lambda, \Psi} \sum_{i \neq j} (s_{ij} - \sigma_{ij})^2$$

Advantages:

No distributional assumptions
Focuses on reproducing correlations (off-diagonal)
Robust to some types of misspecification

Disadvantages:

Less efficient than ML under correct model
No formal inference

Weighted Least Squares (WLS)

WLS methods minimize a weighted sum of residuals: F_WLS = (s - σ)'W⁻¹(s - σ) where s and σ are vectorized covariances and W is a weight matrix.

Comparison of Estimation Methods

Method	Distributional Assumptions	Standard Errors	Fit Test	Efficiency	Robustness
ML	Multivariate normal	Yes	Yes (χ²)	Best under normality	Poor under non-normality
PAF	None	No	No	Lower	More robust
MINRES	None	No	No	Lower	More robust
GLS	Multivariate normal	Yes	Yes	Similar to ML	Poor under non-normality
ULS	None	Limited	Limited	Lower	Most robust

Which to Choose?

Use ML when:

Distributional assumptions are approximately met
You need standard errors and fit tests
Sample size is adequate (n > 200 or so)

Use PAF/MINRES when:

Data are clearly non-normal
Exploratory analysis without formal testing
ML doesn't converge
Quick approximate solution is sufficient

Use robust ML (MLR, MLM) when:

Some non-normality but you want inferential results
Available in Mplus, lavaan, and other SEM software

Software Support for Estimation Methods
Software	ML	PAF	MINRES	ULS	Robust ML
SPSS FACTOR	Yes	Yes	No	Yes (ULS)	No
SAS PROC FACTOR	Yes	Yes	No	Yes	No
R psych	Yes	Yes	Yes	No	No
R lavaan	Yes	No	No	Yes (ULSMV)	Yes (MLR, MLM)
Mplus	Yes	No	No	Yes (ULSMV)	Yes (MLR)
Python sklearn	Yes	No	No	No	No

Practical ML Estimation Workflow

Step-by-Step Procedure

1. Data Preparation:

Check for missing data (consider imputation or FIML)
Examine univariate distributions (skewness, kurtosis)
Check multivariate normality (Mardia's test)
Standardize if using correlation matrix

2. Determine Number of Factors:

Eigenvalue analysis (Kaiser rule as starting point)
Scree plot
Parallel analysis (more reliable)
Theory and interpretability

3. Run Initial ML Estimation:

Extract k factors
Check convergence
Check for Heywood cases

4. Rotate Solution:

Apply varimax or oblimin rotation
Examine loading patterns

5. Assess Model Fit:

χ² test (with caution)
RMSEA, CFI, TLI, SRMR
Examine residuals

6. Refine If Needed:

Try k-1 or k+1 factors
Consider removing poorly fitting items
Re-estimate and compare fit

Checklist Before Reporting Results

•✓ Model converged without warnings
•✓ No Heywood cases or boundary solutions
•✓ All communalities are in (0, 1) range
•✓ Fit indices are acceptable
•✓ Loading pattern is interpretable
•✓ Factors make substantive sense
•✓ Solution is not radically different from theory

Documentation Is Key

This enables replication and demonstrates methodological rigor.

Common R Code (lavaan)

library(lavaan)

# Specify 3-factor EFA model
efa_model <- efa(
  data = mydata,
  nfactors = 3,
  rotation = "oblimin"
)

# Summary with fit indices
summary(efa_model, fit.measures = TRUE, standardized = TRUE)

# Get loadings
inspect(efa_model, what = "std")$lambda

# Get factor correlations
lavInspect(efa_model, what = "cov.lv")

Common Python Code (factor_analyzer)

from factor_analyzer import FactorAnalyzer

# Fit model
fa = FactorAnalyzer(n_factors=3, rotation='oblimin', method='ml')
fa.fit(data)

# Loadings
print(fa.loadings_)

# Communalities
print(fa.get_communalities())

# Eigenvalues
print(fa.get_eigenvalues())

Summary: Maximum Likelihood Estimation

Maximum likelihood estimation provides a principled, inferentially rich framework for factor analysis. Let's consolidate the key concepts:

Key Takeaways

•The ML fitting function minimizes discrepancy between model-implied and sample covariance: minimize ln|Σ| + tr(SΣ⁻¹) - ln|S| - p.
•Iterative optimization is required: no closed-form solution exists. EM, Newton-Raphson, and quasi-Newton methods are used.
•Standard errors are available from the information matrix, enabling inference about loadings (with caveats in EFA).
•Goodness-of-fit is assessed via χ² test (with known limitations) and descriptive indices (RMSEA, CFI, SRMR).
•Heywood cases (negative uniquenesses) indicate model problems—investigate root causes, don't just constrain.
•Alternative methods (PAF, MINRES) provide options when normality fails or ML doesn't converge.
•Practical workflow involves careful data prep, factor number determination, estimation, rotation, and fit assessment.

What's Next

Estimation Foundations Complete

4 / 5