Loading learning content...
The theoretical foundations of ICA—independence, non-Gaussianity, and negentropy—lead to a natural question: how do we actually compute the demixing matrix in practice? Many algorithms have been proposed, but one stands out for its elegance, efficiency, and robustness: FastICA.
Developed by Aapo Hyvärinen and Erkki Oja in the late 1990s, FastICA has become the de facto standard for Independent Component Analysis. It achieves source separation through a remarkably simple fixed-point iteration that converges rapidly to components that maximize non-Gaussianity (approximated negentropy).
FastICA's appeal comes from several properties:
This page develops FastICA from first principles, deriving the fixed-point iteration, explaining deflation and symmetric extraction strategies, and providing complete implementation details.
By the end of this page, you will understand the mathematical derivation of the FastICA fixed-point iteration, be able to implement both deflation and symmetric variants, understand convergence properties and stopping criteria, and be equipped to apply FastICA to real source separation problems.
FastICA works on whitened data. Recall from our preprocessing discussion that whitening transforms the observed data $\mathbf{x}$ into $\mathbf{z}$ with identity covariance:
$$\mathbf{z} = \mathbf{V}\mathbf{x}, \quad E[\mathbf{z}\mathbf{z}^T] = \mathbf{I}$$
After whitening, the effective mixing matrix $\tilde{\mathbf{A}} = \mathbf{V}\mathbf{A}$ becomes orthogonal. The ICA problem reduces to finding an orthogonal demixing matrix $\tilde{\mathbf{W}}$.
Single Component Extraction
To find one independent component, we seek a vector $\mathbf{w}$ (a row of $\tilde{\mathbf{W}}$) such that:
$$y = \mathbf{w}^T\mathbf{z}$$
is maximally non-Gaussian, subject to the constraint $|\mathbf{w}| = 1$.
The Negentropy Objective
Using the negentropy approximation from the previous page:
$$J(y) \approx [E[G(y)] - E[G( u)]]^2$$
where $ u \sim N(0,1)$ and $G$ is a non-quadratic function (e.g., $G(u) = \log\cosh(u)$).
Since $E[G( u)]$ is a constant, maximizing $J(y)$ is equivalent to:
$$\max_{\mathbf{w}: |\mathbf{w}|=1} |E[G(\mathbf{w}^T\mathbf{z})]|$$
For symmetric distributions (typical case), we can drop the absolute value and maximize $E[G(\mathbf{w}^T\mathbf{z})]$ for super-Gaussian sources or minimize for sub-Gaussian.
After whitening, the constraint set becomes the unit sphere (|\mathbf{w}| = 1) rather than a more complex manifold. This regularizes the optimization and ensures that all directions have equal "scale" in terms of variance, so we directly compare non-Gaussianity without variance confounds.
The Lagrangian Formulation
The constrained optimization problem:
$$\max_{\mathbf{w}} E[G(\mathbf{w}^T\mathbf{z})] \quad \text{subject to } |\mathbf{w}|^2 = 1$$
has Lagrangian:
$$\mathcal{L}(\mathbf{w}, \lambda) = E[G(\mathbf{w}^T\mathbf{z})] - \lambda(|\mathbf{w}|^2 - 1)$$
Gradient and Stationarity
Taking the gradient with respect to $\mathbf{w}$:
$$ abla_{\mathbf{w}} \mathcal{L} = E[\mathbf{z}g(\mathbf{w}^T\mathbf{z})] - 2\lambda\mathbf{w}$$
where $g(u) = G'(u)$ is the derivative of $G$.
At a stationary point:
$$E[\mathbf{z}g(\mathbf{w}^T\mathbf{z})] = \lambda\mathbf{w}$$
(absorbing the factor of 2 into $\lambda$).
This is a nonlinear equation in $\mathbf{w}$—the expectation on the left depends on $\mathbf{w}$ through the nonlinearity $g$.
Newton's Method Approach
FastICA uses an approximate Newton method to solve this equation. The key insight is that for whitened data, the Hessian simplifies, leading to a particularly efficient fixed-point iteration.
The FastICA iteration emerges from applying Newton's method to the stationarity condition, with a crucial simplification enabled by the whitening assumption.
Newton's Method for the KKT Condition
We want to find $\mathbf{w}$ satisfying:
$$\mathbf{F}(\mathbf{w}) = E[\mathbf{z}g(\mathbf{w}^T\mathbf{z})] - \lambda\mathbf{w} = \mathbf{0}$$
Newton's method would update:
$$\mathbf{w}_{\text{new}} = \mathbf{w} - [\mathbf{J}_F(\mathbf{w})]^{-1}\mathbf{F}(\mathbf{w})$$
where $\mathbf{J}_F$ is the Jacobian of $\mathbf{F}$.
Computing the Jacobian
The Jacobian is:
$$\mathbf{J}_F(\mathbf{w}) = E[\mathbf{z}\mathbf{z}^T g'(\mathbf{w}^T\mathbf{z})] - \lambda\mathbf{I}$$
For whitened data, $E[\mathbf{z}\mathbf{z}^T] = \mathbf{I}$. A key approximation (valid for large samples and well-behaved $g'$):
$$E[\mathbf{z}\mathbf{z}^T g'(\mathbf{w}^T\mathbf{z})] \approx E[g'(\mathbf{w}^T\mathbf{z})] \cdot E[\mathbf{z}\mathbf{z}^T] = E[g'(\mathbf{w}^T\mathbf{z})] \cdot \mathbf{I}$$
This decorrelation approximation uses the fact that for independent components, $\mathbf{z}\mathbf{z}^T$ is approximately independent of $g'(\mathbf{w}^T\mathbf{z})$ when averaged.
With this approximation:
$$\mathbf{J}_F(\mathbf{w}) \approx (E[g'(\mathbf{w}^T\mathbf{z})] - \lambda)\mathbf{I}$$
The approximation makes the Jacobian proportional to the identity matrix—a scalar times identity! This dramatically simplifies Newton's method: instead of inverting a matrix, we just divide by a scalar. This is the key to FastICA's speed.
The FastICA Update
Substituting back into Newton's method:
$$\mathbf{w}_{\text{new}} = \mathbf{w} - \frac{E[\mathbf{z}g(\mathbf{w}^T\mathbf{z})] - \lambda\mathbf{w}}{E[g'(\mathbf{w}^T\mathbf{z})] - \lambda}$$
Multiplying through and simplifying (noting that $\lambda$ cancels appropriately):
$$\mathbf{w}_{\text{new}} = E[\mathbf{z}g(\mathbf{w}^T\mathbf{z})] - E[g'(\mathbf{w}^T\mathbf{z})]\mathbf{w}$$
After this update, we renormalize to maintain unit length:
$$\mathbf{w}{\text{new}} \leftarrow \frac{\mathbf{w}{\text{new}}}{|\mathbf{w}_{\text{new}}|}$$
The Complete FastICA Fixed-Point Rule
$$\boxed{\mathbf{w} \leftarrow E[\mathbf{z}g(\mathbf{w}^T\mathbf{z})] - E[g'(\mathbf{w}^T\mathbf{z})]\mathbf{w}}$$ $$\boxed{\mathbf{w} \leftarrow \frac{\mathbf{w}}{|\mathbf{w}|}}$$
Repeat until convergence.
Intuition
The first term $E[\mathbf{z}g(\mathbf{w}^T\mathbf{z})]$ pushes $\mathbf{w}$ toward directions of high non-Gaussianity (where $g$ outputs large values).
The second term $-E[g'(\mathbf{w}^T\mathbf{z})]\mathbf{w}$ is a stabilizing correction that prevents divergence and ensures convergence to a fixed point.
Normalization ensures the constraint $|\mathbf{w}| = 1$ is maintained.
| Name | $G(u)$ | $g(u) = G'(u)$ | $g'(u)$ | Use Case |
|---|---|---|---|---|
| logcosh | $\log\cosh(u)$ | $\tanh(u)$ | $1 - \tanh^2(u)$ | General purpose, robust |
| exp | $-\exp(-u^2/2)$ | $u\exp(-u^2/2)$ | $(1-u^2)\exp(-u^2/2)$ | Outlier-robust |
| cube | $u^4/4$ | $u^3$ | $3u^2$ | Super-Gaussian, fast |
| skew | $u^3/3$ | $u^2$ | $2u$ | Asymmetric sources |
The fixed-point iteration finds one independent component. To find all $n$ components, we need a strategy. Deflation extracts components sequentially, removing the influence of already-found components before finding the next.
The Deflation Strategy
Gram-Schmidt Orthogonalization
After each FastICA update step, we orthogonalize against previously found components:
$$\mathbf{w}_p \leftarrow \mathbf{w}p - \sum{j=1}^{p-1} (\mathbf{w}_p^T \mathbf{w}_j) \mathbf{w}_j$$ $$\mathbf{w}_p \leftarrow \frac{\mathbf{w}_p}{|\mathbf{w}_p|}$$
This ensures that the $p$-th component is orthogonal to all previously found components.
Why Orthogonality? (On Whitened Data)
For whitened data, the demixing matrix should be orthogonal. If we've found components $\mathbf{w}1, \ldots, \mathbf{w}{p-1}$ corresponding to sources $s_1, \ldots, s_{p-1}$, the remaining sources live in the orthogonal complement. Gram-Schmidt ensures we search in this subspace.
Deflation propagates errors: if $\mathbf{w}_1$ is slightly wrong, subsequent components inherit this error. The first few components are most accurate; later components may degrade. For applications requiring uniform accuracy across all components, symmetric approaches are preferred.
Deflation FastICA Algorithm
Input: Whitened data Z ∈ ℝⁿˣᵀ, number of components n, contrast function g
Output: Demixing matrix W ∈ ℝⁿˣⁿ
1. Initialize W = empty matrix
2. For p = 1 to n:
a. Initialize wₚ randomly, normalize: wₚ ← wₚ / ||wₚ||
b. Repeat:
i. wₚ ← (1/T) Σₜ z(t) g(wₚᵀz(t)) - (1/T) Σₜ g'(wₚᵀz(t)) · wₚ
ii. Orthogonalize: wₚ ← wₚ - Σⱼ₌₁ᵖ⁻¹ (wₚᵀwⱼ) wⱼ
iii. Normalize: wₚ ← wₚ / ||wₚ||
iv. Check convergence: if |wₚᵀwₚ_old| > 1 - ε, break
c. Add wₚ to W
3. Return W
Convergence Criterion
We converge when $\mathbf{w}$ stops changing. Since $\mathbf{w}$ and $-\mathbf{w}$ are equivalent (sign ambiguity), we check:
$$|\mathbf{w}{\text{new}}^T \mathbf{w}{\text{old}}| > 1 - \epsilon$$
where $\epsilon \approx 10^{-6}$ is a tolerance. If the angle between successive iterates is nearly zero (dot product near ±1), we've converged.
Properties of Deflation
| Advantage | Disadvantage |
|---|---|
| Simple to implement | Errors accumulate in later components |
| Can stop after $k < n$ components | Sequential, less parallelizable |
| Converges to local solutions | Sensitive to initialization for later components |
| Intuitive one-at-a-time approach | Ordering of components is arbitrary |
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768697071
import numpy as np def fastica_deflation(Z, n_components, g, g_prime, tol=1e-6, max_iter=200): """ FastICA using deflation approach. Parameters: ----------- Z : ndarray (n_features, n_samples) Whitened data matrix n_components : int Number of independent components to extract g : callable Contrast function g(u) = G'(u) g_prime : callable Derivative of contrast function g'(u) = G''(u) tol : float Convergence tolerance max_iter : int Maximum iterations per component Returns: -------- W : ndarray (n_components, n_features) Demixing matrix (rows are component directions) """ n_features, n_samples = Z.shape W = np.zeros((n_components, n_features)) for p in range(n_components): # Random initialization w = np.random.randn(n_features) w = w / np.linalg.norm(w) for iteration in range(max_iter): w_old = w.copy() # FastICA fixed-point update # y = w^T z for all samples y = w @ Z # (n_samples,) # Update: E[z*g(y)] - E[g'(y)] * w w = np.mean(Z * g(y), axis=1) - np.mean(g_prime(y)) * w # Gram-Schmidt orthogonalization against previous components for j in range(p): w = w - np.dot(w, W[j]) * W[j] # Normalize w = w / np.linalg.norm(w) # Check convergence (using absolute value for sign ambiguity) if np.abs(np.dot(w, w_old)) > 1 - tol: break W[p] = w return W # Contrast functionsdef g_logcosh(u, alpha=1.0): return np.tanh(alpha * u) def g_prime_logcosh(u, alpha=1.0): return alpha * (1 - np.tanh(alpha * u) ** 2) def g_exp(u): return u * np.exp(-u**2 / 2) def g_prime_exp(u): return (1 - u**2) * np.exp(-u**2 / 2)The symmetric approach finds all independent components simultaneously, avoiding the error accumulation problem of deflation. All component vectors are estimated in parallel and orthogonalized as a batch.
The Symmetric Update
Let $\mathbf{W} = [\mathbf{w}_1, \ldots, \mathbf{w}_n]^T$ be the demixing matrix (rows are component directions). The symmetric FastICA update is:
Apply fixed-point update to all rows: $$\mathbf{w}_i \leftarrow E[\mathbf{z}g(\mathbf{w}_i^T\mathbf{z})] - E[g'(\mathbf{w}_i^T\mathbf{z})]\mathbf{w}_i, \quad i = 1, \ldots, n$$
Symmetric orthogonalization: $$\mathbf{W} \leftarrow (\mathbf{W}\mathbf{W}^T)^{-1/2}\mathbf{W}$$
The symmetric orthogonalization ensures $\mathbf{W}\mathbf{W}^T = \mathbf{I}$ (orthonormal rows).
Computing the Matrix Square Root Inverse
Given $\mathbf{W}\mathbf{W}^T$, compute its eigen-decomposition: $$\mathbf{W}\mathbf{W}^T = \mathbf{U}\mathbf{\Lambda}\mathbf{U}^T$$
Then: $$(\mathbf{W}\mathbf{W}^T)^{-1/2} = \mathbf{U}\mathbf{\Lambda}^{-1/2}\mathbf{U}^T$$
Unlike Gram-Schmidt (which privileges early components), symmetric orthogonalization treats all components equally. The transformation $(\mathbf{W}\mathbf{W}^T)^{-1/2}\mathbf{W}$ finds the closest orthonormal matrix to $\mathbf{W}$ in a symmetric sense—no component is "more orthogonalized" than another.
Symmetric FastICA Algorithm
Input: Whitened data Z ∈ ℝⁿˣᵀ, contrast function g
Output: Demixing matrix W ∈ ℝⁿˣⁿ
1. Initialize W randomly, apply symmetric orthogonalization
2. Repeat:
a. For each row wᵢ of W:
wᵢ ← (1/T) Σₜ z(t) g(wᵢᵀz(t)) - (1/T) Σₜ g'(wᵢᵀz(t)) · wᵢ
b. Symmetric orthogonalization:
W ← (WWᵀ)^(-1/2) W
c. Check convergence: if ||W - W_old||_∞ < ε or ||W + W_old||_∞ < ε, break
3. Return W
Advantages of Symmetric FastICA
| Property | Benefit |
|---|---|
| All components estimated equally | No error accumulation |
| Parallel updates | Faster on parallel hardware |
| More stable convergence | Less sensitive to initialization |
| Global orthogonality | Better overall solution quality |
Disadvantages
| Property | Drawback |
|---|---|
| Must estimate all $n$ components | Cannot stop early |
| Matrix square root | Slightly more computation per iteration |
| May converge to different local minimum | Depends on initialization |
Convergence Check for Symmetric
Since all rows might flip signs, we check: $$\max_i \min(|\mathbf{w}_i^{\text{new}} - \mathbf{w}_i^{\text{old}}|, |\mathbf{w}_i^{\text{new}} + \mathbf{w}_i^{\text{old}}|) < \epsilon$$
or equivalently, that $|\mathbf{w}_i^{\text{new}} \cdot \mathbf{w}_i^{\text{old}}| \approx 1$ for all $i$.
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273747576777879808182
import numpy as npfrom scipy import linalg def fastica_symmetric(Z, g, g_prime, tol=1e-6, max_iter=200): """ FastICA using symmetric approach. Parameters: ----------- Z : ndarray (n_features, n_samples) Whitened data matrix g : callable Contrast function g(u) = G'(u) g_prime : callable Derivative of contrast function g'(u) tol : float Convergence tolerance max_iter : int Maximum iterations Returns: -------- W : ndarray (n_features, n_features) Demixing matrix """ n_features, n_samples = Z.shape # Random initialization W = np.random.randn(n_features, n_features) W = symmetric_orthogonalize(W) for iteration in range(max_iter): W_old = W.copy() # Fixed-point update for all components simultaneously # Y = W @ Z, shape (n_features, n_samples) Y = W @ Z # Update each row: E[z*g(y)] - E[g'(y)] * w # Vectorized over all rows gY = g(Y) # (n_features, n_samples) g_primeY = g_prime(Y) # (n_features, n_samples) # E[z * g(w^T z)] for each row W_new = (Z @ gY.T) / n_samples # (n_features, n_features) # Subtract E[g'(w^T z)] * w for each row W_new = W_new.T - np.diag(np.mean(g_primeY, axis=1)) @ W # Symmetric orthogonalization W = symmetric_orthogonalize(W_new) # Check convergence # Using max absolute correlation should be near 1 converged = True for i in range(n_features): # Check if row converged (accounting for sign flip) correlation = np.abs(np.dot(W[i], W_old[i])) if correlation < 1 - tol: converged = False break if converged: break return W def symmetric_orthogonalize(W): """ Symmetric orthogonalization: W = (W @ W.T)^(-1/2) @ W Finds the closest orthogonal matrix to W in the Frobenius norm. """ # Eigen-decomposition of W @ W.T eigenvalues, eigenvectors = linalg.eigh(W @ W.T) # Compute (W @ W.T)^(-1/2) # eigenvalues should be positive; add small epsilon for stability eigenvalues = np.maximum(eigenvalues, 1e-10) sqrt_inv = eigenvectors @ np.diag(1.0 / np.sqrt(eigenvalues)) @ eigenvectors.T return sqrt_inv @ WA production FastICA implementation includes preprocessing, the core algorithm, and post-processing. Here we present the complete pipeline.
Step 1: Centering
Remove the mean from each observed signal: $$\tilde{\mathbf{x}}(t) = \mathbf{x}(t) - \frac{1}{T}\sum_{\tau=1}^{T}\mathbf{x}(\tau)$$
Step 2: Whitening
Compute the covariance matrix and its eigendecomposition: $$\mathbf{C} = \frac{1}{T}\sum_t \tilde{\mathbf{x}}(t)\tilde{\mathbf{x}}(t)^T = \mathbf{E}\mathbf{D}\mathbf{E}^T$$
Construct the whitening matrix: $$\mathbf{V} = \mathbf{D}^{-1/2}\mathbf{E}^T$$
Apply whitening: $$\mathbf{z}(t) = \mathbf{V}\tilde{\mathbf{x}}(t)$$
Step 3: FastICA
Apply deflation or symmetric FastICA to find demixing matrix $\tilde{\mathbf{W}}$ on whitened data.
Step 4: Recover Original-Space Demixing
The complete demixing matrix (for original centered data): $$\mathbf{W} = \tilde{\mathbf{W}}\mathbf{V}$$
Step 5: Recover Sources
$$\mathbf{s}(t) = \mathbf{W}\tilde{\mathbf{x}}(t)$$
Step 6: Recover Mixing Matrix (Optional)
$$\mathbf{A} = \mathbf{W}^{-1}$$
The columns of $\mathbf{A}$ show how each source contributes to each observation.
Often we work with more observations than sources (overcomplete). A common approach: apply PCA first to reduce to $k$ dimensions, then apply ICA to find $k$ independent components. This is ICA with dimensionality reduction: $\mathbf{z} = \mathbf{V}_k \tilde{\mathbf{x}}$ where $\mathbf{V}_k$ is the whitening matrix using only top $k$ eigenvalues.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169
import numpy as npfrom scipy import linalg class FastICA: """ Complete FastICA implementation with preprocessing and postprocessing. """ def __init__(self, n_components=None, algorithm='symmetric', fun='logcosh', tol=1e-4, max_iter=200, random_state=None): """ Parameters: ----------- n_components : int or None Number of components. If None, use all. algorithm : str, 'symmetric' or 'deflation' ICA algorithm variant fun : str, 'logcosh', 'exp', or 'cube' Contrast function tol : float Convergence tolerance max_iter : int Maximum iterations random_state : int or None Random seed for reproducibility """ self.n_components = n_components self.algorithm = algorithm self.fun = fun self.tol = tol self.max_iter = max_iter self.random_state = random_state # Set contrast function and derivative if fun == 'logcosh': self.g = lambda u: np.tanh(u) self.g_prime = lambda u: 1 - np.tanh(u)**2 elif fun == 'exp': self.g = lambda u: u * np.exp(-u**2 / 2) self.g_prime = lambda u: (1 - u**2) * np.exp(-u**2 / 2) elif fun == 'cube': self.g = lambda u: u**3 self.g_prime = lambda u: 3 * u**2 else: raise ValueError(f"Unknown contrast function: {fun}") def fit(self, X): """ Fit the ICA model to data X. Parameters: ----------- X : ndarray (n_samples, n_features) Observed data matrix """ if self.random_state is not None: np.random.seed(self.random_state) n_samples, n_features = X.shape n_components = self.n_components or n_features # Step 1: Centering self.mean_ = np.mean(X, axis=0) X_centered = X - self.mean_ # Step 2: Whitening (via PCA) cov = np.cov(X_centered.T) eigenvalues, eigenvectors = linalg.eigh(cov) # Sort by eigenvalue descending idx = np.argsort(eigenvalues)[::-1] eigenvalues = eigenvalues[idx][:n_components] eigenvectors = eigenvectors[:, idx][:, :n_components] # Whitening matrix: V = D^(-1/2) @ E.T self.whitening_ = np.diag(1.0 / np.sqrt(eigenvalues + 1e-10)) @ eigenvectors.T # Whitened data: Z = V @ X_centered.T, shape (n_components, n_samples) Z = self.whitening_ @ X_centered.T # Step 3: FastICA if self.algorithm == 'symmetric': W_ica = self._fastica_symmetric(Z) else: W_ica = self._fastica_deflation(Z, n_components) # Step 4: Complete demixing matrix # Sources = W_ica @ Z = W_ica @ whitening @ X_centered.T self.components_ = W_ica @ self.whitening_ # (n_components, n_features) # Mixing matrix (pseudo-inverse of components) self.mixing_ = linalg.pinv(self.components_) return self def transform(self, X): """Project X onto independent components.""" X_centered = X - self.mean_ return X_centered @ self.components_.T def fit_transform(self, X): """Fit and transform in one step.""" self.fit(X) return self.transform(X) def _fastica_symmetric(self, Z): """Symmetric FastICA on whitened data.""" n_components, n_samples = Z.shape W = np.random.randn(n_components, n_components) W = self._symmetric_orthogonalize(W) for _ in range(self.max_iter): W_old = W.copy() Y = W @ Z gY = self.g(Y) g_primeY = self.g_prime(Y) W_new = (Z @ gY.T) / n_samples W_new = W_new.T - np.diag(np.mean(g_primeY, axis=1)) @ W W = self._symmetric_orthogonalize(W_new) if self._converged(W, W_old): break return W def _fastica_deflation(self, Z, n_components): """Deflation FastICA on whitened data.""" n_features, n_samples = Z.shape W = np.zeros((n_components, n_features)) for p in range(n_components): w = np.random.randn(n_features) w = w / np.linalg.norm(w) for _ in range(self.max_iter): w_old = w.copy() y = w @ Z w = np.mean(Z * self.g(y), axis=1) - np.mean(self.g_prime(y)) * w for j in range(p): w = w - np.dot(w, W[j]) * W[j] w = w / np.linalg.norm(w) if np.abs(np.dot(w, w_old)) > 1 - self.tol: break W[p] = w return W def _symmetric_orthogonalize(self, W): """Symmetric orthogonalization.""" eigenvalues, eigenvectors = linalg.eigh(W @ W.T) eigenvalues = np.maximum(eigenvalues, 1e-10) sqrt_inv = eigenvectors @ np.diag(1.0/np.sqrt(eigenvalues)) @ eigenvectors.T return sqrt_inv @ W def _converged(self, W, W_old): """Check convergence.""" for i in range(W.shape[0]): if np.abs(np.dot(W[i], W_old[i])) < 1 - self.tol: return False return TrueUnderstanding FastICA's convergence properties is essential for both theoretical insight and practical troubleshooting.
Local Convergence
FastICA is a fixed-point iteration derived from Newton's method. Near a solution, it inherits Newton's rapid convergence:
This is faster than gradient descent methods, which converge linearly and require careful learning rate tuning.
Global Convergence
Global convergence (from any starting point) is not guaranteed. Like most nonlinear optimization:
Multiple Fixed Points
The optimization landscape has multiple stationary points:
For well-separated, strongly non-Gaussian sources, the true IC directions are typically the only stable fixed points.
FastICA may fail to converge or converge to spurious solutions when: (1) Sources are near-Gaussian (weak non-Gaussianity signal), (2) Sample size is too small, (3) Sources are highly correlated in non-stationarity, (4) The contrast function is mismatched to source type. If convergence fails, try different initializations, more samples, or alternative contrast functions.
Reproducibility and Initialization
Due to the non-convex optimization landscape:
For reproducibility:
Practical Tips for Robust Convergence
| Issue | Solution |
|---|---|
| Slow convergence | Try different contrast function |
| Non-convergence | Increase max iterations; check data quality |
| Inconsistent results | Multiple random restarts |
| Noisy components | More samples; regularization |
| One component dominates | Check for outliers; use robust contrast |
Monitoring Convergence
Beyond the standard dot-product criterion, monitor:
This page has provided a complete development of the FastICA algorithm, from theoretical derivation through practical implementation.
You now have complete understanding of the FastICA algorithm—its derivation, implementation, and practical usage. You can implement FastICA from scratch and understand the design choices in production implementations. In the next page, we'll explore the fascinating applications of ICA, particularly blind source separation.
What's Next:
The next page explores applications of ICA, focusing on the celebrated blind source separation problem. We'll see how ICA recovers speech signals from cocktail party mixtures, extracts artifacts from EEG recordings, and finds independent features in images. These applications demonstrate the power and versatility of the theoretical framework we've developed.