Machine LearningDimensionality Reduction

Independent Component Analysis

LevelAdvanced

Duration105 mins

TopicDimensionality Reduction

3 / 5

FastICA Algorithm

Fast and Efficient Source Separation

The theoretical foundations of ICA—independence, non-Gaussianity, and negentropy—lead to a natural question: how do we actually compute the demixing matrix in practice? Many algorithms have been proposed, but one stands out for its elegance, efficiency, and robustness: FastICA.

Developed by Aapo Hyvärinen and Erkki Oja in the late 1990s, FastICA has become the de facto standard for Independent Component Analysis. It achieves source separation through a remarkably simple fixed-point iteration that converges rapidly to components that maximize non-Gaussianity (approximated negentropy).

FastICA's appeal comes from several properties:

Speed: Cubic convergence (typically 5-10 iterations per component)
Simplicity: Each iteration is a single matrix-vector operation plus nonlinearity
Robustness: Works across a wide range of source types
No learning rate: Unlike gradient methods, no tuning of step sizes required
Theoretical guarantees: Convergence properties are well-understood

This page develops FastICA from first principles, deriving the fixed-point iteration, explaining deflation and symmetric extraction strategies, and providing complete implementation details.

What You Will Learn

By the end of this page, you will understand the mathematical derivation of the FastICA fixed-point iteration, be able to implement both deflation and symmetric variants, understand convergence properties and stopping criteria, and be equipped to apply FastICA to real source separation problems.

The Optimization Objective

FastICA works on whitened data. Recall from our preprocessing discussion that whitening transforms the observed data $\mathbf{x}$ into $\mathbf{z}$ with identity covariance:

$$\mathbf{z} = \mathbf{V}\mathbf{x}, \quad E[\mathbf{z}\mathbf{z}^T] = \mathbf{I}$$

After whitening, the effective mixing matrix $\tilde{\mathbf{A}} = \mathbf{V}\mathbf{A}$ becomes orthogonal. The ICA problem reduces to finding an orthogonal demixing matrix $\tilde{\mathbf{W}}$.

Single Component Extraction

To find one independent component, we seek a vector $\mathbf{w}$ (a row of $\tilde{\mathbf{W}}$) such that:

$$y = \mathbf{w}^T\mathbf{z}$$

is maximally non-Gaussian, subject to the constraint $|\mathbf{w}| = 1$.

The Negentropy Objective

Using the negentropy approximation from the previous page:

$$J(y) \approx [E[G(y)] - E[G( u)]]^2$$

where $ u \sim N(0,1)$ and $G$ is a non-quadratic function (e.g., $G(u) = \log\cosh(u)$).

Since $E[G( u)]$ is a constant, maximizing $J(y)$ is equivalent to:

$$\max_{\mathbf{w}: |\mathbf{w}|=1} |E[G(\mathbf{w}^T\mathbf{z})]|$$

For symmetric distributions (typical case), we can drop the absolute value and maximize $E[G(\mathbf{w}^T\mathbf{z})]$ for super-Gaussian sources or minimize for sub-Gaussian.

Why Whitening Simplifies Optimization

After whitening, the constraint set becomes the unit sphere (|\mathbf{w}| = 1) rather than a more complex manifold. This regularizes the optimization and ensures that all directions have equal "scale" in terms of variance, so we directly compare non-Gaussianity without variance confounds.

The Lagrangian Formulation

The constrained optimization problem:

$$\max_{\mathbf{w}} E[G(\mathbf{w}^T\mathbf{z})] \quad \text{subject to } |\mathbf{w}|^2 = 1$$

has Lagrangian:

$$\mathcal{L}(\mathbf{w}, \lambda) = E[G(\mathbf{w}^T\mathbf{z})] - \lambda(|\mathbf{w}|^2 - 1)$$

Gradient and Stationarity

Taking the gradient with respect to $\mathbf{w}$:

$$ abla_{\mathbf{w}} \mathcal{L} = E[\mathbf{z}g(\mathbf{w}^T\mathbf{z})] - 2\lambda\mathbf{w}$$

where $g(u) = G'(u)$ is the derivative of $G$.

At a stationary point:

$$E[\mathbf{z}g(\mathbf{w}^T\mathbf{z})] = \lambda\mathbf{w}$$

(absorbing the factor of 2 into $\lambda$).

This is a nonlinear equation in $\mathbf{w}$—the expectation on the left depends on $\mathbf{w}$ through the nonlinearity $g$.

Newton's Method Approach

FastICA uses an approximate Newton method to solve this equation. The key insight is that for whitened data, the Hessian simplifies, leading to a particularly efficient fixed-point iteration.

Deriving the Fixed-Point Iteration

The FastICA iteration emerges from applying Newton's method to the stationarity condition, with a crucial simplification enabled by the whitening assumption.

Newton's Method for the KKT Condition

We want to find $\mathbf{w}$ satisfying:

$$\mathbf{F}(\mathbf{w}) = E[\mathbf{z}g(\mathbf{w}^T\mathbf{z})] - \lambda\mathbf{w} = \mathbf{0}$$

Newton's method would update:

$$\mathbf{w}_{\text{new}} = \mathbf{w} - [\mathbf{J}_F(\mathbf{w})]^{-1}\mathbf{F}(\mathbf{w})$$

where $\mathbf{J}_F$ is the Jacobian of $\mathbf{F}$.

Computing the Jacobian

The Jacobian is:

$$\mathbf{J}_F(\mathbf{w}) = E[\mathbf{z}\mathbf{z}^T g'(\mathbf{w}^T\mathbf{z})] - \lambda\mathbf{I}$$

For whitened data, $E[\mathbf{z}\mathbf{z}^T] = \mathbf{I}$. A key approximation (valid for large samples and well-behaved $g'$):

$$E[\mathbf{z}\mathbf{z}^T g'(\mathbf{w}^T\mathbf{z})] \approx E[g'(\mathbf{w}^T\mathbf{z})] \cdot E[\mathbf{z}\mathbf{z}^T] = E[g'(\mathbf{w}^T\mathbf{z})] \cdot \mathbf{I}$$

This decorrelation approximation uses the fact that for independent components, $\mathbf{z}\mathbf{z}^T$ is approximately independent of $g'(\mathbf{w}^T\mathbf{z})$ when averaged.

With this approximation:

$$\mathbf{J}_F(\mathbf{w}) \approx (E[g'(\mathbf{w}^T\mathbf{z})] - \lambda)\mathbf{I}$$

The Scalar Jacobian Approximation

The approximation makes the Jacobian proportional to the identity matrix—a scalar times identity! This dramatically simplifies Newton's method: instead of inverting a matrix, we just divide by a scalar. This is the key to FastICA's speed.

The FastICA Update

Substituting back into Newton's method:

$$\mathbf{w}_{\text{new}} = \mathbf{w} - \frac{E[\mathbf{z}g(\mathbf{w}^T\mathbf{z})] - \lambda\mathbf{w}}{E[g'(\mathbf{w}^T\mathbf{z})] - \lambda}$$

Multiplying through and simplifying (noting that $\lambda$ cancels appropriately):

$$\mathbf{w}_{\text{new}} = E[\mathbf{z}g(\mathbf{w}^T\mathbf{z})] - E[g'(\mathbf{w}^T\mathbf{z})]\mathbf{w}$$

After this update, we renormalize to maintain unit length:

$$\mathbf{w}{\text{new}} \leftarrow \frac{\mathbf{w}{\text{new}}}{|\mathbf{w}_{\text{new}}|}$$

The Complete FastICA Fixed-Point Rule

$$\boxed{\mathbf{w} \leftarrow E[\mathbf{z}g(\mathbf{w}^T\mathbf{z})] - E[g'(\mathbf{w}^T\mathbf{z})]\mathbf{w}}$$ $$\boxed{\mathbf{w} \leftarrow \frac{\mathbf{w}}{|\mathbf{w}|}}$$

Repeat until convergence.

Intuition

The first term $E[\mathbf{z}g(\mathbf{w}^T\mathbf{z})]$ pushes $\mathbf{w}$ toward directions of high non-Gaussianity (where $g$ outputs large values).

The second term $-E[g'(\mathbf{w}^T\mathbf{z})]\mathbf{w}$ is a stabilizing correction that prevents divergence and ensures convergence to a fixed point.

Normalization ensures the constraint $|\mathbf{w}| = 1$ is maintained.

Common Contrast Functions for FastICA
Name	$G(u)$	$g(u) = G'(u)$	$g'(u)$	Use Case
logcosh	$\log\cosh(u)$	$\tanh(u)$	$1 - \tanh^2(u)$	General purpose, robust
exp	$-\exp(-u^2/2)$	$u\exp(-u^2/2)$	$(1-u^2)\exp(-u^2/2)$	Outlier-robust
cube	$u^4/4$	$u^3$	$3u^2$	Super-Gaussian, fast
skew	$u^3/3$	$u^2$	$2u$	Asymmetric sources

Deflation: Extracting Components One by One

The fixed-point iteration finds one independent component. To find all $n$ components, we need a strategy. Deflation extracts components sequentially, removing the influence of already-found components before finding the next.

The Deflation Strategy

Find the first component $\mathbf{w}_1$ using the basic FastICA iteration
To find $\mathbf{w}_2$, run FastICA but after each update, remove the component in the direction of $\mathbf{w}_1$
Continue for $\mathbf{w}_3, \ldots, \mathbf{w}_n$, each time removing all previously found directions

Gram-Schmidt Orthogonalization

After each FastICA update step, we orthogonalize against previously found components:

$$\mathbf{w}_p \leftarrow \mathbf{w}p - \sum{j=1}^{p-1} (\mathbf{w}_p^T \mathbf{w}_j) \mathbf{w}_j$$ $$\mathbf{w}_p \leftarrow \frac{\mathbf{w}_p}{|\mathbf{w}_p|}$$

This ensures that the $p$-th component is orthogonal to all previously found components.

Why Orthogonality? (On Whitened Data)

For whitened data, the demixing matrix should be orthogonal. If we've found components $\mathbf{w}1, \ldots, \mathbf{w}{p-1}$ corresponding to sources $s_1, \ldots, s_{p-1}$, the remaining sources live in the orthogonal complement. Gram-Schmidt ensures we search in this subspace.

Error Accumulation in Deflation

Deflation propagates errors: if $\mathbf{w}_1$ is slightly wrong, subsequent components inherit this error. The first few components are most accurate; later components may degrade. For applications requiring uniform accuracy across all components, symmetric approaches are preferred.

Deflation FastICA Algorithm

Input: Whitened data Z ∈ ℝⁿˣᵀ, number of components n, contrast function g
Output: Demixing matrix W ∈ ℝⁿˣⁿ

1. Initialize W = empty matrix
2. For p = 1 to n:
   a. Initialize wₚ randomly, normalize: wₚ ← wₚ / ||wₚ||
   b. Repeat:
      i.   wₚ ← (1/T) Σₜ z(t) g(wₚᵀz(t)) - (1/T) Σₜ g'(wₚᵀz(t)) · wₚ
      ii.  Orthogonalize: wₚ ← wₚ - Σⱼ₌₁ᵖ⁻¹ (wₚᵀwⱼ) wⱼ
      iii. Normalize: wₚ ← wₚ / ||wₚ||
      iv.  Check convergence: if |wₚᵀwₚ_old| > 1 - ε, break
   c. Add wₚ to W
3. Return W

Convergence Criterion

We converge when $\mathbf{w}$ stops changing. Since $\mathbf{w}$ and $-\mathbf{w}$ are equivalent (sign ambiguity), we check:

$$|\mathbf{w}{\text{new}}^T \mathbf{w}{\text{old}}| > 1 - \epsilon$$

where $\epsilon \approx 10^{-6}$ is a tolerance. If the angle between successive iterates is nearly zero (dot product near ±1), we've converged.

Properties of Deflation

Advantage	Disadvantage
Simple to implement	Errors accumulate in later components
Can stop after $k < n$ components	Sequential, less parallelizable
Converges to local solutions	Sensitive to initialization for later components
Intuitive one-at-a-time approach	Ordering of components is arbitrary

fastica_deflation.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
import numpy as np
 
def fastica_deflation(Z, n_components, g, g_prime, tol=1e-6, max_iter=200):
    """
    FastICA using deflation approach.
    
    Parameters:
    -----------
    Z : ndarray (n_features, n_samples)
        Whitened data matrix
    n_components : int
        Number of independent components to extract
    g : callable
        Contrast function g(u) = G'(u)
    g_prime : callable
        Derivative of contrast function g'(u) = G''(u)
    tol : float
        Convergence tolerance
    max_iter : int
        Maximum iterations per component
        
    Returns:
    --------
    W : ndarray (n_components, n_features)
        Demixing matrix (rows are component directions)
    """
    n_features, n_samples = Z.shape
    W = np.zeros((n_components, n_features))
    
    for p in range(n_components):
        # Random initialization
        w = np.random.randn(n_features)
        w = w / np.linalg.norm(w)
        
        for iteration in range(max_iter):
            w_old = w.copy()
            
            # FastICA fixed-point update
            # y = w^T z for all samples
            y = w @ Z  # (n_samples,)
            
            # Update: E[z*g(y)] - E[g'(y)] * w
            w = np.mean(Z * g(y), axis=1) - np.mean(g_prime(y)) * w
            
            # Gram-Schmidt orthogonalization against previous components
            for j in range(p):
                w = w - np.dot(w, W[j]) * W[j]
            
            # Normalize
            w = w / np.linalg.norm(w)
            
            # Check convergence (using absolute value for sign ambiguity)
            if np.abs(np.dot(w, w_old)) > 1 - tol:
                break
        
        W[p] = w
    
    return W
 
# Contrast functions
def g_logcosh(u, alpha=1.0):
    return np.tanh(alpha * u)
 
def g_prime_logcosh(u, alpha=1.0):
    return alpha * (1 - np.tanh(alpha * u) ** 2)
 
def g_exp(u):
    return u * np.exp(-u**2 / 2)
 
def g_prime_exp(u):
    return (1 - u**2) * np.exp(-u**2 / 2)

Symmetric Approach: Extracting All Components Simultaneously

The symmetric approach finds all independent components simultaneously, avoiding the error accumulation problem of deflation. All component vectors are estimated in parallel and orthogonalized as a batch.

The Symmetric Update

Let $\mathbf{W} = [\mathbf{w}_1, \ldots, \mathbf{w}_n]^T$ be the demixing matrix (rows are component directions). The symmetric FastICA update is:

Apply fixed-point update to all rows: $$\mathbf{w}_i \leftarrow E[\mathbf{z}g(\mathbf{w}_i^T\mathbf{z})] - E[g'(\mathbf{w}_i^T\mathbf{z})]\mathbf{w}_i, \quad i = 1, \ldots, n$$
Symmetric orthogonalization: $$\mathbf{W} \leftarrow (\mathbf{W}\mathbf{W}^T)^{-1/2}\mathbf{W}$$

The symmetric orthogonalization ensures $\mathbf{W}\mathbf{W}^T = \mathbf{I}$ (orthonormal rows).

Computing the Matrix Square Root Inverse

Given $\mathbf{W}\mathbf{W}^T$, compute its eigen-decomposition: $$\mathbf{W}\mathbf{W}^T = \mathbf{U}\mathbf{\Lambda}\mathbf{U}^T$$

Then: $$(\mathbf{W}\mathbf{W}^T)^{-1/2} = \mathbf{U}\mathbf{\Lambda}^{-1/2}\mathbf{U}^T$$

Why Symmetric Orthogonalization?

Unlike Gram-Schmidt (which privileges early components), symmetric orthogonalization treats all components equally. The transformation $(\mathbf{W}\mathbf{W}^T)^{-1/2}\mathbf{W}$ finds the closest orthonormal matrix to $\mathbf{W}$ in a symmetric sense—no component is "more orthogonalized" than another.

Symmetric FastICA Algorithm

Input: Whitened data Z ∈ ℝⁿˣᵀ, contrast function g
Output: Demixing matrix W ∈ ℝⁿˣⁿ

1. Initialize W randomly, apply symmetric orthogonalization
2. Repeat:
   a. For each row wᵢ of W:
      wᵢ ← (1/T) Σₜ z(t) g(wᵢᵀz(t)) - (1/T) Σₜ g'(wᵢᵀz(t)) · wᵢ
   b. Symmetric orthogonalization:
      W ← (WWᵀ)^(-1/2) W
   c. Check convergence: if ||W - W_old||_∞ < ε or ||W + W_old||_∞ < ε, break
3. Return W

Advantages of Symmetric FastICA

Property	Benefit
All components estimated equally	No error accumulation
Parallel updates	Faster on parallel hardware
More stable convergence	Less sensitive to initialization
Global orthogonality	Better overall solution quality

Disadvantages

Property	Drawback
Must estimate all $n$ components	Cannot stop early
Matrix square root	Slightly more computation per iteration
May converge to different local minimum	Depends on initialization

Convergence Check for Symmetric

Since all rows might flip signs, we check: $$\max_i \min(|\mathbf{w}_i^{\text{new}} - \mathbf{w}_i^{\text{old}}|, |\mathbf{w}_i^{\text{new}} + \mathbf{w}_i^{\text{old}}|) < \epsilon$$

or equivalently, that $|\mathbf{w}_i^{\text{new}} \cdot \mathbf{w}_i^{\text{old}}| \approx 1$ for all $i$.

fastica_symmetric.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
import numpy as np
from scipy import linalg
 
def fastica_symmetric(Z, g, g_prime, tol=1e-6, max_iter=200):
    """
    FastICA using symmetric approach.
    
    Parameters:
    -----------
    Z : ndarray (n_features, n_samples)
        Whitened data matrix
    g : callable
        Contrast function g(u) = G'(u)
    g_prime : callable
        Derivative of contrast function g'(u)
    tol : float
        Convergence tolerance
    max_iter : int
        Maximum iterations
        
    Returns:
    --------
    W : ndarray (n_features, n_features)
        Demixing matrix
    """
    n_features, n_samples = Z.shape
    
    # Random initialization
    W = np.random.randn(n_features, n_features)
    W = symmetric_orthogonalize(W)
    
    for iteration in range(max_iter):
        W_old = W.copy()
        
        # Fixed-point update for all components simultaneously
        # Y = W @ Z, shape (n_features, n_samples)
        Y = W @ Z
        
        # Update each row: E[z*g(y)] - E[g'(y)] * w
        # Vectorized over all rows
        gY = g(Y)  # (n_features, n_samples)
        g_primeY = g_prime(Y)  # (n_features, n_samples)
        
        # E[z * g(w^T z)] for each row
        W_new = (Z @ gY.T) / n_samples  # (n_features, n_features)
        
        # Subtract E[g'(w^T z)] * w for each row
        W_new = W_new.T - np.diag(np.mean(g_primeY, axis=1)) @ W
        
        # Symmetric orthogonalization
        W = symmetric_orthogonalize(W_new)
        
        # Check convergence
        # Using max absolute correlation should be near 1
        converged = True
        for i in range(n_features):
            # Check if row converged (accounting for sign flip)
            correlation = np.abs(np.dot(W[i], W_old[i]))
            if correlation < 1 - tol:
                converged = False
                break
        
        if converged:
            break
    
    return W
 
def symmetric_orthogonalize(W):
    """
    Symmetric orthogonalization: W = (W @ W.T)^(-1/2) @ W
    
    Finds the closest orthogonal matrix to W in the Frobenius norm.
    """
    # Eigen-decomposition of W @ W.T
    eigenvalues, eigenvectors = linalg.eigh(W @ W.T)
    
    # Compute (W @ W.T)^(-1/2)
    # eigenvalues should be positive; add small epsilon for stability
    eigenvalues = np.maximum(eigenvalues, 1e-10)
    sqrt_inv = eigenvectors @ np.diag(1.0 / np.sqrt(eigenvalues)) @ eigenvectors.T
    
    return sqrt_inv @ W

The Complete FastICA Pipeline

A production FastICA implementation includes preprocessing, the core algorithm, and post-processing. Here we present the complete pipeline.

Step 1: Centering

Remove the mean from each observed signal: $$\tilde{\mathbf{x}}(t) = \mathbf{x}(t) - \frac{1}{T}\sum_{\tau=1}^{T}\mathbf{x}(\tau)$$

Step 2: Whitening

Compute the covariance matrix and its eigendecomposition: $$\mathbf{C} = \frac{1}{T}\sum_t \tilde{\mathbf{x}}(t)\tilde{\mathbf{x}}(t)^T = \mathbf{E}\mathbf{D}\mathbf{E}^T$$

Construct the whitening matrix: $$\mathbf{V} = \mathbf{D}^{-1/2}\mathbf{E}^T$$

Apply whitening: $$\mathbf{z}(t) = \mathbf{V}\tilde{\mathbf{x}}(t)$$

Step 3: FastICA

Apply deflation or symmetric FastICA to find demixing matrix $\tilde{\mathbf{W}}$ on whitened data.

Step 4: Recover Original-Space Demixing

The complete demixing matrix (for original centered data): $$\mathbf{W} = \tilde{\mathbf{W}}\mathbf{V}$$

Step 5: Recover Sources

$$\mathbf{s}(t) = \mathbf{W}\tilde{\mathbf{x}}(t)$$

Step 6: Recover Mixing Matrix (Optional)

$$\mathbf{A} = \mathbf{W}^{-1}$$

The columns of $\mathbf{A}$ show how each source contributes to each observation.

Dimensionality Reduction

Often we work with more observations than sources (overcomplete). A common approach: apply PCA first to reduce to $k$ dimensions, then apply ICA to find $k$ independent components. This is ICA with dimensionality reduction: $\mathbf{z} = \mathbf{V}_k \tilde{\mathbf{x}}$ where $\mathbf{V}_k$ is the whitening matrix using only top $k$ eigenvalues.

fastica_complete.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
import numpy as np
from scipy import linalg
 
class FastICA:
    """
    Complete FastICA implementation with preprocessing and postprocessing.
    """
    
    def __init__(self, n_components=None, algorithm='symmetric', 
                 fun='logcosh', tol=1e-4, max_iter=200, random_state=None):
        """
        Parameters:
        -----------
        n_components : int or None
            Number of components. If None, use all.
        algorithm : str, 'symmetric' or 'deflation'
            ICA algorithm variant
        fun : str, 'logcosh', 'exp', or 'cube'
            Contrast function
        tol : float
            Convergence tolerance
        max_iter : int
            Maximum iterations
        random_state : int or None
            Random seed for reproducibility
        """
        self.n_components = n_components
        self.algorithm = algorithm
        self.fun = fun
        self.tol = tol
        self.max_iter = max_iter
        self.random_state = random_state
        
        # Set contrast function and derivative
        if fun == 'logcosh':
            self.g = lambda u: np.tanh(u)
            self.g_prime = lambda u: 1 - np.tanh(u)**2
        elif fun == 'exp':
            self.g = lambda u: u * np.exp(-u**2 / 2)
            self.g_prime = lambda u: (1 - u**2) * np.exp(-u**2 / 2)
        elif fun == 'cube':
            self.g = lambda u: u**3
            self.g_prime = lambda u: 3 * u**2
        else:
            raise ValueError(f"Unknown contrast function: {fun}")
    
    def fit(self, X):
        """
        Fit the ICA model to data X.
        
        Parameters:
        -----------
        X : ndarray (n_samples, n_features)
            Observed data matrix
        """
        if self.random_state is not None:
            np.random.seed(self.random_state)
        
        n_samples, n_features = X.shape
        n_components = self.n_components or n_features
        
        # Step 1: Centering
        self.mean_ = np.mean(X, axis=0)
        X_centered = X - self.mean_
        
        # Step 2: Whitening (via PCA)
        cov = np.cov(X_centered.T)
        eigenvalues, eigenvectors = linalg.eigh(cov)
        
        # Sort by eigenvalue descending
        idx = np.argsort(eigenvalues)[::-1]
        eigenvalues = eigenvalues[idx][:n_components]
        eigenvectors = eigenvectors[:, idx][:, :n_components]
        
        # Whitening matrix: V = D^(-1/2) @ E.T
        self.whitening_ = np.diag(1.0 / np.sqrt(eigenvalues + 1e-10)) @ eigenvectors.T
        
        # Whitened data: Z = V @ X_centered.T, shape (n_components, n_samples)
        Z = self.whitening_ @ X_centered.T
        
        # Step 3: FastICA
        if self.algorithm == 'symmetric':
            W_ica = self._fastica_symmetric(Z)
        else:
            W_ica = self._fastica_deflation(Z, n_components)
        
        # Step 4: Complete demixing matrix
        # Sources = W_ica @ Z = W_ica @ whitening @ X_centered.T
        self.components_ = W_ica @ self.whitening_  # (n_components, n_features)
        
        # Mixing matrix (pseudo-inverse of components)
        self.mixing_ = linalg.pinv(self.components_)
        
        return self
    
    def transform(self, X):
        """Project X onto independent components."""
        X_centered = X - self.mean_
        return X_centered @ self.components_.T
    
    def fit_transform(self, X):
        """Fit and transform in one step."""
        self.fit(X)
        return self.transform(X)
    
    def _fastica_symmetric(self, Z):
        """Symmetric FastICA on whitened data."""
        n_components, n_samples = Z.shape
        
        W = np.random.randn(n_components, n_components)
        W = self._symmetric_orthogonalize(W)
        
        for _ in range(self.max_iter):
            W_old = W.copy()
            
            Y = W @ Z
            gY = self.g(Y)
            g_primeY = self.g_prime(Y)
            
            W_new = (Z @ gY.T) / n_samples
            W_new = W_new.T - np.diag(np.mean(g_primeY, axis=1)) @ W
            
            W = self._symmetric_orthogonalize(W_new)
            
            if self._converged(W, W_old):
                break
        
        return W
    
    def _fastica_deflation(self, Z, n_components):
        """Deflation FastICA on whitened data."""
        n_features, n_samples = Z.shape
        W = np.zeros((n_components, n_features))
        
        for p in range(n_components):
            w = np.random.randn(n_features)
            w = w / np.linalg.norm(w)
            
            for _ in range(self.max_iter):
                w_old = w.copy()
                
                y = w @ Z
                w = np.mean(Z * self.g(y), axis=1) - np.mean(self.g_prime(y)) * w
                
                for j in range(p):
                    w = w - np.dot(w, W[j]) * W[j]
                
                w = w / np.linalg.norm(w)
                
                if np.abs(np.dot(w, w_old)) > 1 - self.tol:
                    break
            
            W[p] = w
        
        return W
    
    def _symmetric_orthogonalize(self, W):
        """Symmetric orthogonalization."""
        eigenvalues, eigenvectors = linalg.eigh(W @ W.T)
        eigenvalues = np.maximum(eigenvalues, 1e-10)
        sqrt_inv = eigenvectors @ np.diag(1.0/np.sqrt(eigenvalues)) @ eigenvectors.T
        return sqrt_inv @ W
    
    def _converged(self, W, W_old):
        """Check convergence."""
        for i in range(W.shape[0]):
            if np.abs(np.dot(W[i], W_old[i])) < 1 - self.tol:
                return False
        return True

Convergence Properties

Understanding FastICA's convergence properties is essential for both theoretical insight and practical troubleshooting.

Local Convergence

FastICA is a fixed-point iteration derived from Newton's method. Near a solution, it inherits Newton's rapid convergence:

Cubic convergence: Near a fixed point, the error decreases as $|\mathbf{w}_{k+1} - \mathbf{w}^| = O(|\mathbf{w}_k - \mathbf{w}^|^3)$
Practical implication: Typically 5-15 iterations suffice

This is faster than gradient descent methods, which converge linearly and require careful learning rate tuning.

Global Convergence

Global convergence (from any starting point) is not guaranteed. Like most nonlinear optimization:

The algorithm converges to a local maximum of the contrast function
Different initializations may find different local maxima
In practice, most local maxima correspond to true independent components

Multiple Fixed Points

The optimization landscape has multiple stationary points:

True IC directions: These are the desired solutions
Saddle points: Unstable; the algorithm typically escapes these
Minima of |kurtosis|: For sub-Gaussian sources with kurtosis objective

For well-separated, strongly non-Gaussian sources, the true IC directions are typically the only stable fixed points.

Convergence Failure Modes

FastICA may fail to converge or converge to spurious solutions when: (1) Sources are near-Gaussian (weak non-Gaussianity signal), (2) Sample size is too small, (3) Sources are highly correlated in non-stationarity, (4) The contrast function is mismatched to source type. If convergence fails, try different initializations, more samples, or alternative contrast functions.

Reproducibility and Initialization

Due to the non-convex optimization landscape:

Different runs may give different orderings (permutation ambiguity is inherent)
Sign flips are normal (sign ambiguity is inherent)
Truly different solutions are rare if sources are well-separated

For reproducibility:

Set a random seed
Use multiple runs and compare
Check that solutions are stable across random seeds

Practical Tips for Robust Convergence

Issue	Solution
Slow convergence	Try different contrast function
Non-convergence	Increase max iterations; check data quality
Inconsistent results	Multiple random restarts
Noisy components	More samples; regularization
One component dominates	Check for outliers; use robust contrast

Monitoring Convergence

Beyond the standard dot-product criterion, monitor:

Objective value: $E[G(\mathbf{w}^T\mathbf{z})]$ should increase (for super-Gaussian contrast)
Non-Gaussianity: Kurtosis or negentropy of extracted components
Independence: Mutual information between extracted components should be low

FastICA Convergence Checklist

•Verify whitening: Check that $E[\mathbf{z}\mathbf{z}^T] \approx \mathbf{I}$ on your data
•Check extracted component kurtosis: Should be non-zero for successfully separated sources
•Compare multiple runs: Solutions should be consistent (up to permutation/sign)
•Examine residual correlations: Extracted components should be approximately uncorrelated
•Plot components: Visual inspection often reveals separation quality
•Monitor iteration count: If hitting max_iter repeatedly, investigate data or parameters

Summary: The FastICA Algorithm

This page has provided a complete development of the FastICA algorithm, from theoretical derivation through practical implementation.

Key Takeaways

•Objective: FastICA maximizes non-Gaussianity (negentropy approximation) under orthogonality constraints on whitened data
•Fixed-Point Rule: $\mathbf{w} \leftarrow E[\mathbf{z}g(\mathbf{w}^T\mathbf{z})] - E[g'(\mathbf{w}^T\mathbf{z})]\mathbf{w}$ followed by normalization
•Deflation: Extracts components one-by-one with Gram-Schmidt orthogonalization. Simple but accumulates errors.
•Symmetric: Extracts all components simultaneously with symmetric orthogonalization. More robust, avoids error accumulation.
•Contrast Functions: logcosh (robust, general), exp (outlier-resistant), cube (fast for super-Gaussian)
•Convergence: Cubic local convergence; typically 5-15 iterations. Global convergence to true ICs for well-separated sources.
•Pipeline: Center → Whiten → FastICA → Recover demixing matrix → Extract sources

Algorithm Mastered

You now have complete understanding of the FastICA algorithm—its derivation, implementation, and practical usage. You can implement FastICA from scratch and understand the design choices in production implementations. In the next page, we'll explore the fascinating applications of ICA, particularly blind source separation.

What's Next:

The next page explores applications of ICA, focusing on the celebrated blind source separation problem. We'll see how ICA recovers speech signals from cocktail party mixtures, extracts artifacts from EEG recordings, and finds independent features in images. These applications demonstrate the power and versatility of the theoretical framework we've developed.

3 / 5

Loading learning content...

Machine LearningDimensionality Reduction

Independent Component Analysis

LevelAdvanced

Duration105 mins

TopicDimensionality Reduction

3 / 5

FastICA Algorithm

Fast and Efficient Source Separation

FastICA's appeal comes from several properties:

Speed: Cubic convergence (typically 5-10 iterations per component)
Simplicity: Each iteration is a single matrix-vector operation plus nonlinearity
Robustness: Works across a wide range of source types
No learning rate: Unlike gradient methods, no tuning of step sizes required
Theoretical guarantees: Convergence properties are well-understood

This page develops FastICA from first principles, deriving the fixed-point iteration, explaining deflation and symmetric extraction strategies, and providing complete implementation details.

What You Will Learn

The Optimization Objective

FastICA works on whitened data. Recall from our preprocessing discussion that whitening transforms the observed data $\mathbf{x}$ into $\mathbf{z}$ with identity covariance:

$$\mathbf{z} = \mathbf{V}\mathbf{x}, \quad E[\mathbf{z}\mathbf{z}^T] = \mathbf{I}$$

After whitening, the effective mixing matrix $\tilde{\mathbf{A}} = \mathbf{V}\mathbf{A}$ becomes orthogonal. The ICA problem reduces to finding an orthogonal demixing matrix $\tilde{\mathbf{W}}$.

Single Component Extraction

To find one independent component, we seek a vector $\mathbf{w}$ (a row of $\tilde{\mathbf{W}}$) such that:

$$y = \mathbf{w}^T\mathbf{z}$$

is maximally non-Gaussian, subject to the constraint $|\mathbf{w}| = 1$.

The Negentropy Objective

Using the negentropy approximation from the previous page:

$$J(y) \approx [E[G(y)] - E[G( u)]]^2$$

where $ u \sim N(0,1)$ and $G$ is a non-quadratic function (e.g., $G(u) = \log\cosh(u)$).

Since $E[G( u)]$ is a constant, maximizing $J(y)$ is equivalent to:

$$\max_{\mathbf{w}: |\mathbf{w}|=1} |E[G(\mathbf{w}^T\mathbf{z})]|$$

For symmetric distributions (typical case), we can drop the absolute value and maximize $E[G(\mathbf{w}^T\mathbf{z})]$ for super-Gaussian sources or minimize for sub-Gaussian.

Why Whitening Simplifies Optimization

The Lagrangian Formulation

The constrained optimization problem:

$$\max_{\mathbf{w}} E[G(\mathbf{w}^T\mathbf{z})] \quad \text{subject to } |\mathbf{w}|^2 = 1$$

has Lagrangian:

$$\mathcal{L}(\mathbf{w}, \lambda) = E[G(\mathbf{w}^T\mathbf{z})] - \lambda(|\mathbf{w}|^2 - 1)$$

Gradient and Stationarity

Taking the gradient with respect to $\mathbf{w}$:

$$ abla_{\mathbf{w}} \mathcal{L} = E[\mathbf{z}g(\mathbf{w}^T\mathbf{z})] - 2\lambda\mathbf{w}$$

where $g(u) = G'(u)$ is the derivative of $G$.

At a stationary point:

$$E[\mathbf{z}g(\mathbf{w}^T\mathbf{z})] = \lambda\mathbf{w}$$

(absorbing the factor of 2 into $\lambda$).

This is a nonlinear equation in $\mathbf{w}$—the expectation on the left depends on $\mathbf{w}$ through the nonlinearity $g$.

Newton's Method Approach

FastICA uses an approximate Newton method to solve this equation. The key insight is that for whitened data, the Hessian simplifies, leading to a particularly efficient fixed-point iteration.

Deriving the Fixed-Point Iteration

The FastICA iteration emerges from applying Newton's method to the stationarity condition, with a crucial simplification enabled by the whitening assumption.

Newton's Method for the KKT Condition

We want to find $\mathbf{w}$ satisfying:

$$\mathbf{F}(\mathbf{w}) = E[\mathbf{z}g(\mathbf{w}^T\mathbf{z})] - \lambda\mathbf{w} = \mathbf{0}$$

Newton's method would update:

$$\mathbf{w}_{\text{new}} = \mathbf{w} - [\mathbf{J}_F(\mathbf{w})]^{-1}\mathbf{F}(\mathbf{w})$$

where $\mathbf{J}_F$ is the Jacobian of $\mathbf{F}$.

Computing the Jacobian

The Jacobian is:

$$\mathbf{J}_F(\mathbf{w}) = E[\mathbf{z}\mathbf{z}^T g'(\mathbf{w}^T\mathbf{z})] - \lambda\mathbf{I}$$

For whitened data, $E[\mathbf{z}\mathbf{z}^T] = \mathbf{I}$. A key approximation (valid for large samples and well-behaved $g'$):

$$E[\mathbf{z}\mathbf{z}^T g'(\mathbf{w}^T\mathbf{z})] \approx E[g'(\mathbf{w}^T\mathbf{z})] \cdot E[\mathbf{z}\mathbf{z}^T] = E[g'(\mathbf{w}^T\mathbf{z})] \cdot \mathbf{I}$$

This decorrelation approximation uses the fact that for independent components, $\mathbf{z}\mathbf{z}^T$ is approximately independent of $g'(\mathbf{w}^T\mathbf{z})$ when averaged.

With this approximation:

$$\mathbf{J}_F(\mathbf{w}) \approx (E[g'(\mathbf{w}^T\mathbf{z})] - \lambda)\mathbf{I}$$

The Scalar Jacobian Approximation

The FastICA Update

Substituting back into Newton's method:

$$\mathbf{w}_{\text{new}} = \mathbf{w} - \frac{E[\mathbf{z}g(\mathbf{w}^T\mathbf{z})] - \lambda\mathbf{w}}{E[g'(\mathbf{w}^T\mathbf{z})] - \lambda}$$

Multiplying through and simplifying (noting that $\lambda$ cancels appropriately):

$$\mathbf{w}_{\text{new}} = E[\mathbf{z}g(\mathbf{w}^T\mathbf{z})] - E[g'(\mathbf{w}^T\mathbf{z})]\mathbf{w}$$

After this update, we renormalize to maintain unit length:

$$\mathbf{w}{\text{new}} \leftarrow \frac{\mathbf{w}{\text{new}}}{|\mathbf{w}_{\text{new}}|}$$

The Complete FastICA Fixed-Point Rule

$$\boxed{\mathbf{w} \leftarrow E[\mathbf{z}g(\mathbf{w}^T\mathbf{z})] - E[g'(\mathbf{w}^T\mathbf{z})]\mathbf{w}}$$ $$\boxed{\mathbf{w} \leftarrow \frac{\mathbf{w}}{|\mathbf{w}|}}$$

Repeat until convergence.

Intuition

The first term $E[\mathbf{z}g(\mathbf{w}^T\mathbf{z})]$ pushes $\mathbf{w}$ toward directions of high non-Gaussianity (where $g$ outputs large values).

The second term $-E[g'(\mathbf{w}^T\mathbf{z})]\mathbf{w}$ is a stabilizing correction that prevents divergence and ensures convergence to a fixed point.

Normalization ensures the constraint $|\mathbf{w}| = 1$ is maintained.

Common Contrast Functions for FastICA
Name	$G(u)$	$g(u) = G'(u)$	$g'(u)$	Use Case
logcosh	$\log\cosh(u)$	$\tanh(u)$	$1 - \tanh^2(u)$	General purpose, robust
exp	$-\exp(-u^2/2)$	$u\exp(-u^2/2)$	$(1-u^2)\exp(-u^2/2)$	Outlier-robust
cube	$u^4/4$	$u^3$	$3u^2$	Super-Gaussian, fast
skew	$u^3/3$	$u^2$	$2u$	Asymmetric sources

Deflation: Extracting Components One by One

The Deflation Strategy

Find the first component $\mathbf{w}_1$ using the basic FastICA iteration
To find $\mathbf{w}_2$, run FastICA but after each update, remove the component in the direction of $\mathbf{w}_1$
Continue for $\mathbf{w}_3, \ldots, \mathbf{w}_n$, each time removing all previously found directions

Gram-Schmidt Orthogonalization

After each FastICA update step, we orthogonalize against previously found components:

$$\mathbf{w}_p \leftarrow \mathbf{w}p - \sum{j=1}^{p-1} (\mathbf{w}_p^T \mathbf{w}_j) \mathbf{w}_j$$ $$\mathbf{w}_p \leftarrow \frac{\mathbf{w}_p}{|\mathbf{w}_p|}$$

This ensures that the $p$-th component is orthogonal to all previously found components.

Why Orthogonality? (On Whitened Data)

Error Accumulation in Deflation

Deflation FastICA Algorithm

Input: Whitened data Z ∈ ℝⁿˣᵀ, number of components n, contrast function g
Output: Demixing matrix W ∈ ℝⁿˣⁿ

1. Initialize W = empty matrix
2. For p = 1 to n:
   a. Initialize wₚ randomly, normalize: wₚ ← wₚ / ||wₚ||
   b. Repeat:
      i.   wₚ ← (1/T) Σₜ z(t) g(wₚᵀz(t)) - (1/T) Σₜ g'(wₚᵀz(t)) · wₚ
      ii.  Orthogonalize: wₚ ← wₚ - Σⱼ₌₁ᵖ⁻¹ (wₚᵀwⱼ) wⱼ
      iii. Normalize: wₚ ← wₚ / ||wₚ||
      iv.  Check convergence: if |wₚᵀwₚ_old| > 1 - ε, break
   c. Add wₚ to W
3. Return W

Convergence Criterion

We converge when $\mathbf{w}$ stops changing. Since $\mathbf{w}$ and $-\mathbf{w}$ are equivalent (sign ambiguity), we check:

$$|\mathbf{w}{\text{new}}^T \mathbf{w}{\text{old}}| > 1 - \epsilon$$

where $\epsilon \approx 10^{-6}$ is a tolerance. If the angle between successive iterates is nearly zero (dot product near ±1), we've converged.

Properties of Deflation

Advantage	Disadvantage
Simple to implement	Errors accumulate in later components
Can stop after $k < n$ components	Sequential, less parallelizable
Converges to local solutions	Sensitive to initialization for later components
Intuitive one-at-a-time approach	Ordering of components is arbitrary

fastica_deflation.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
import numpy as np
 
def fastica_deflation(Z, n_components, g, g_prime, tol=1e-6, max_iter=200):
    """
    FastICA using deflation approach.
    
    Parameters:
    -----------
    Z : ndarray (n_features, n_samples)
        Whitened data matrix
    n_components : int
        Number of independent components to extract
    g : callable
        Contrast function g(u) = G'(u)
    g_prime : callable
        Derivative of contrast function g'(u) = G''(u)
    tol : float
        Convergence tolerance
    max_iter : int
        Maximum iterations per component
        
    Returns:
    --------
    W : ndarray (n_components, n_features)
        Demixing matrix (rows are component directions)
    """
    n_features, n_samples = Z.shape
    W = np.zeros((n_components, n_features))
    
    for p in range(n_components):
        # Random initialization
        w = np.random.randn(n_features)
        w = w / np.linalg.norm(w)
        
        for iteration in range(max_iter):
            w_old = w.copy()
            
            # FastICA fixed-point update
            # y = w^T z for all samples
            y = w @ Z  # (n_samples,)
            
            # Update: E[z*g(y)] - E[g'(y)] * w
            w = np.mean(Z * g(y), axis=1) - np.mean(g_prime(y)) * w
            
            # Gram-Schmidt orthogonalization against previous components
            for j in range(p):
                w = w - np.dot(w, W[j]) * W[j]
            
            # Normalize
            w = w / np.linalg.norm(w)
            
            # Check convergence (using absolute value for sign ambiguity)
            if np.abs(np.dot(w, w_old)) > 1 - tol:
                break
        
        W[p] = w
    
    return W
 
# Contrast functions
def g_logcosh(u, alpha=1.0):
    return np.tanh(alpha * u)
 
def g_prime_logcosh(u, alpha=1.0):
    return alpha * (1 - np.tanh(alpha * u) ** 2)
 
def g_exp(u):
    return u * np.exp(-u**2 / 2)
 
def g_prime_exp(u):
    return (1 - u**2) * np.exp(-u**2 / 2)

Symmetric Approach: Extracting All Components Simultaneously

The Symmetric Update

Let $\mathbf{W} = [\mathbf{w}_1, \ldots, \mathbf{w}_n]^T$ be the demixing matrix (rows are component directions). The symmetric FastICA update is:

Apply fixed-point update to all rows: $$\mathbf{w}_i \leftarrow E[\mathbf{z}g(\mathbf{w}_i^T\mathbf{z})] - E[g'(\mathbf{w}_i^T\mathbf{z})]\mathbf{w}_i, \quad i = 1, \ldots, n$$
Symmetric orthogonalization: $$\mathbf{W} \leftarrow (\mathbf{W}\mathbf{W}^T)^{-1/2}\mathbf{W}$$

The symmetric orthogonalization ensures $\mathbf{W}\mathbf{W}^T = \mathbf{I}$ (orthonormal rows).

Computing the Matrix Square Root Inverse

Given $\mathbf{W}\mathbf{W}^T$, compute its eigen-decomposition: $$\mathbf{W}\mathbf{W}^T = \mathbf{U}\mathbf{\Lambda}\mathbf{U}^T$$

Then: $$(\mathbf{W}\mathbf{W}^T)^{-1/2} = \mathbf{U}\mathbf{\Lambda}^{-1/2}\mathbf{U}^T$$

Why Symmetric Orthogonalization?

Symmetric FastICA Algorithm

Input: Whitened data Z ∈ ℝⁿˣᵀ, contrast function g
Output: Demixing matrix W ∈ ℝⁿˣⁿ

1. Initialize W randomly, apply symmetric orthogonalization
2. Repeat:
   a. For each row wᵢ of W:
      wᵢ ← (1/T) Σₜ z(t) g(wᵢᵀz(t)) - (1/T) Σₜ g'(wᵢᵀz(t)) · wᵢ
   b. Symmetric orthogonalization:
      W ← (WWᵀ)^(-1/2) W
   c. Check convergence: if ||W - W_old||_∞ < ε or ||W + W_old||_∞ < ε, break
3. Return W

Advantages of Symmetric FastICA

Property	Benefit
All components estimated equally	No error accumulation
Parallel updates	Faster on parallel hardware
More stable convergence	Less sensitive to initialization
Global orthogonality	Better overall solution quality

Disadvantages

Property	Drawback
Must estimate all $n$ components	Cannot stop early
Matrix square root	Slightly more computation per iteration
May converge to different local minimum	Depends on initialization

Convergence Check for Symmetric

Since all rows might flip signs, we check: $$\max_i \min(|\mathbf{w}_i^{\text{new}} - \mathbf{w}_i^{\text{old}}|, |\mathbf{w}_i^{\text{new}} + \mathbf{w}_i^{\text{old}}|) < \epsilon$$

or equivalently, that $|\mathbf{w}_i^{\text{new}} \cdot \mathbf{w}_i^{\text{old}}| \approx 1$ for all $i$.

fastica_symmetric.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
import numpy as np
from scipy import linalg
 
def fastica_symmetric(Z, g, g_prime, tol=1e-6, max_iter=200):
    """
    FastICA using symmetric approach.
    
    Parameters:
    -----------
    Z : ndarray (n_features, n_samples)
        Whitened data matrix
    g : callable
        Contrast function g(u) = G'(u)
    g_prime : callable
        Derivative of contrast function g'(u)
    tol : float
        Convergence tolerance
    max_iter : int
        Maximum iterations
        
    Returns:
    --------
    W : ndarray (n_features, n_features)
        Demixing matrix
    """
    n_features, n_samples = Z.shape
    
    # Random initialization
    W = np.random.randn(n_features, n_features)
    W = symmetric_orthogonalize(W)
    
    for iteration in range(max_iter):
        W_old = W.copy()
        
        # Fixed-point update for all components simultaneously
        # Y = W @ Z, shape (n_features, n_samples)
        Y = W @ Z
        
        # Update each row: E[z*g(y)] - E[g'(y)] * w
        # Vectorized over all rows
        gY = g(Y)  # (n_features, n_samples)
        g_primeY = g_prime(Y)  # (n_features, n_samples)
        
        # E[z * g(w^T z)] for each row
        W_new = (Z @ gY.T) / n_samples  # (n_features, n_features)
        
        # Subtract E[g'(w^T z)] * w for each row
        W_new = W_new.T - np.diag(np.mean(g_primeY, axis=1)) @ W
        
        # Symmetric orthogonalization
        W = symmetric_orthogonalize(W_new)
        
        # Check convergence
        # Using max absolute correlation should be near 1
        converged = True
        for i in range(n_features):
            # Check if row converged (accounting for sign flip)
            correlation = np.abs(np.dot(W[i], W_old[i]))
            if correlation < 1 - tol:
                converged = False
                break
        
        if converged:
            break
    
    return W
 
def symmetric_orthogonalize(W):
    """
    Symmetric orthogonalization: W = (W @ W.T)^(-1/2) @ W
    
    Finds the closest orthogonal matrix to W in the Frobenius norm.
    """
    # Eigen-decomposition of W @ W.T
    eigenvalues, eigenvectors = linalg.eigh(W @ W.T)
    
    # Compute (W @ W.T)^(-1/2)
    # eigenvalues should be positive; add small epsilon for stability
    eigenvalues = np.maximum(eigenvalues, 1e-10)
    sqrt_inv = eigenvectors @ np.diag(1.0 / np.sqrt(eigenvalues)) @ eigenvectors.T
    
    return sqrt_inv @ W

The Complete FastICA Pipeline

A production FastICA implementation includes preprocessing, the core algorithm, and post-processing. Here we present the complete pipeline.

Step 1: Centering

Remove the mean from each observed signal: $$\tilde{\mathbf{x}}(t) = \mathbf{x}(t) - \frac{1}{T}\sum_{\tau=1}^{T}\mathbf{x}(\tau)$$

Step 2: Whitening

Compute the covariance matrix and its eigendecomposition: $$\mathbf{C} = \frac{1}{T}\sum_t \tilde{\mathbf{x}}(t)\tilde{\mathbf{x}}(t)^T = \mathbf{E}\mathbf{D}\mathbf{E}^T$$

Construct the whitening matrix: $$\mathbf{V} = \mathbf{D}^{-1/2}\mathbf{E}^T$$

Apply whitening: $$\mathbf{z}(t) = \mathbf{V}\tilde{\mathbf{x}}(t)$$

Step 3: FastICA

Apply deflation or symmetric FastICA to find demixing matrix $\tilde{\mathbf{W}}$ on whitened data.

Step 4: Recover Original-Space Demixing

The complete demixing matrix (for original centered data): $$\mathbf{W} = \tilde{\mathbf{W}}\mathbf{V}$$

Step 5: Recover Sources

$$\mathbf{s}(t) = \mathbf{W}\tilde{\mathbf{x}}(t)$$

Step 6: Recover Mixing Matrix (Optional)

$$\mathbf{A} = \mathbf{W}^{-1}$$

The columns of $\mathbf{A}$ show how each source contributes to each observation.

Dimensionality Reduction

fastica_complete.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
import numpy as np
from scipy import linalg
 
class FastICA:
    """
    Complete FastICA implementation with preprocessing and postprocessing.
    """
    
    def __init__(self, n_components=None, algorithm='symmetric', 
                 fun='logcosh', tol=1e-4, max_iter=200, random_state=None):
        """
        Parameters:
        -----------
        n_components : int or None
            Number of components. If None, use all.
        algorithm : str, 'symmetric' or 'deflation'
            ICA algorithm variant
        fun : str, 'logcosh', 'exp', or 'cube'
            Contrast function
        tol : float
            Convergence tolerance
        max_iter : int
            Maximum iterations
        random_state : int or None
            Random seed for reproducibility
        """
        self.n_components = n_components
        self.algorithm = algorithm
        self.fun = fun
        self.tol = tol
        self.max_iter = max_iter
        self.random_state = random_state
        
        # Set contrast function and derivative
        if fun == 'logcosh':
            self.g = lambda u: np.tanh(u)
            self.g_prime = lambda u: 1 - np.tanh(u)**2
        elif fun == 'exp':
            self.g = lambda u: u * np.exp(-u**2 / 2)
            self.g_prime = lambda u: (1 - u**2) * np.exp(-u**2 / 2)
        elif fun == 'cube':
            self.g = lambda u: u**3
            self.g_prime = lambda u: 3 * u**2
        else:
            raise ValueError(f"Unknown contrast function: {fun}")
    
    def fit(self, X):
        """
        Fit the ICA model to data X.
        
        Parameters:
        -----------
        X : ndarray (n_samples, n_features)
            Observed data matrix
        """
        if self.random_state is not None:
            np.random.seed(self.random_state)
        
        n_samples, n_features = X.shape
        n_components = self.n_components or n_features
        
        # Step 1: Centering
        self.mean_ = np.mean(X, axis=0)
        X_centered = X - self.mean_
        
        # Step 2: Whitening (via PCA)
        cov = np.cov(X_centered.T)
        eigenvalues, eigenvectors = linalg.eigh(cov)
        
        # Sort by eigenvalue descending
        idx = np.argsort(eigenvalues)[::-1]
        eigenvalues = eigenvalues[idx][:n_components]
        eigenvectors = eigenvectors[:, idx][:, :n_components]
        
        # Whitening matrix: V = D^(-1/2) @ E.T
        self.whitening_ = np.diag(1.0 / np.sqrt(eigenvalues + 1e-10)) @ eigenvectors.T
        
        # Whitened data: Z = V @ X_centered.T, shape (n_components, n_samples)
        Z = self.whitening_ @ X_centered.T
        
        # Step 3: FastICA
        if self.algorithm == 'symmetric':
            W_ica = self._fastica_symmetric(Z)
        else:
            W_ica = self._fastica_deflation(Z, n_components)
        
        # Step 4: Complete demixing matrix
        # Sources = W_ica @ Z = W_ica @ whitening @ X_centered.T
        self.components_ = W_ica @ self.whitening_  # (n_components, n_features)
        
        # Mixing matrix (pseudo-inverse of components)
        self.mixing_ = linalg.pinv(self.components_)
        
        return self
    
    def transform(self, X):
        """Project X onto independent components."""
        X_centered = X - self.mean_
        return X_centered @ self.components_.T
    
    def fit_transform(self, X):
        """Fit and transform in one step."""
        self.fit(X)
        return self.transform(X)
    
    def _fastica_symmetric(self, Z):
        """Symmetric FastICA on whitened data."""
        n_components, n_samples = Z.shape
        
        W = np.random.randn(n_components, n_components)
        W = self._symmetric_orthogonalize(W)
        
        for _ in range(self.max_iter):
            W_old = W.copy()
            
            Y = W @ Z
            gY = self.g(Y)
            g_primeY = self.g_prime(Y)
            
            W_new = (Z @ gY.T) / n_samples
            W_new = W_new.T - np.diag(np.mean(g_primeY, axis=1)) @ W
            
            W = self._symmetric_orthogonalize(W_new)
            
            if self._converged(W, W_old):
                break
        
        return W
    
    def _fastica_deflation(self, Z, n_components):
        """Deflation FastICA on whitened data."""
        n_features, n_samples = Z.shape
        W = np.zeros((n_components, n_features))
        
        for p in range(n_components):
            w = np.random.randn(n_features)
            w = w / np.linalg.norm(w)
            
            for _ in range(self.max_iter):
                w_old = w.copy()
                
                y = w @ Z
                w = np.mean(Z * self.g(y), axis=1) - np.mean(self.g_prime(y)) * w
                
                for j in range(p):
                    w = w - np.dot(w, W[j]) * W[j]
                
                w = w / np.linalg.norm(w)
                
                if np.abs(np.dot(w, w_old)) > 1 - self.tol:
                    break
            
            W[p] = w
        
        return W
    
    def _symmetric_orthogonalize(self, W):
        """Symmetric orthogonalization."""
        eigenvalues, eigenvectors = linalg.eigh(W @ W.T)
        eigenvalues = np.maximum(eigenvalues, 1e-10)
        sqrt_inv = eigenvectors @ np.diag(1.0/np.sqrt(eigenvalues)) @ eigenvectors.T
        return sqrt_inv @ W
    
    def _converged(self, W, W_old):
        """Check convergence."""
        for i in range(W.shape[0]):
            if np.abs(np.dot(W[i], W_old[i])) < 1 - self.tol:
                return False
        return True

Convergence Properties

Understanding FastICA's convergence properties is essential for both theoretical insight and practical troubleshooting.

Local Convergence

FastICA is a fixed-point iteration derived from Newton's method. Near a solution, it inherits Newton's rapid convergence:

Cubic convergence: Near a fixed point, the error decreases as $|\mathbf{w}_{k+1} - \mathbf{w}^| = O(|\mathbf{w}_k - \mathbf{w}^|^3)$
Practical implication: Typically 5-15 iterations suffice

This is faster than gradient descent methods, which converge linearly and require careful learning rate tuning.

Global Convergence

Global convergence (from any starting point) is not guaranteed. Like most nonlinear optimization:

The algorithm converges to a local maximum of the contrast function
Different initializations may find different local maxima
In practice, most local maxima correspond to true independent components

Multiple Fixed Points

The optimization landscape has multiple stationary points:

True IC directions: These are the desired solutions
Saddle points: Unstable; the algorithm typically escapes these
Minima of |kurtosis|: For sub-Gaussian sources with kurtosis objective

For well-separated, strongly non-Gaussian sources, the true IC directions are typically the only stable fixed points.

Convergence Failure Modes

Reproducibility and Initialization

Due to the non-convex optimization landscape:

Different runs may give different orderings (permutation ambiguity is inherent)
Sign flips are normal (sign ambiguity is inherent)
Truly different solutions are rare if sources are well-separated

For reproducibility:

Set a random seed
Use multiple runs and compare
Check that solutions are stable across random seeds

Practical Tips for Robust Convergence

Issue	Solution
Slow convergence	Try different contrast function
Non-convergence	Increase max iterations; check data quality
Inconsistent results	Multiple random restarts
Noisy components	More samples; regularization
One component dominates	Check for outliers; use robust contrast

Monitoring Convergence

Beyond the standard dot-product criterion, monitor:

Objective value: $E[G(\mathbf{w}^T\mathbf{z})]$ should increase (for super-Gaussian contrast)
Non-Gaussianity: Kurtosis or negentropy of extracted components
Independence: Mutual information between extracted components should be low

FastICA Convergence Checklist

•Verify whitening: Check that $E[\mathbf{z}\mathbf{z}^T] \approx \mathbf{I}$ on your data
•Check extracted component kurtosis: Should be non-zero for successfully separated sources
•Compare multiple runs: Solutions should be consistent (up to permutation/sign)
•Examine residual correlations: Extracted components should be approximately uncorrelated
•Plot components: Visual inspection often reveals separation quality
•Monitor iteration count: If hitting max_iter repeatedly, investigate data or parameters

Summary: The FastICA Algorithm

This page has provided a complete development of the FastICA algorithm, from theoretical derivation through practical implementation.

Key Takeaways

•Objective: FastICA maximizes non-Gaussianity (negentropy approximation) under orthogonality constraints on whitened data
•Fixed-Point Rule: $\mathbf{w} \leftarrow E[\mathbf{z}g(\mathbf{w}^T\mathbf{z})] - E[g'(\mathbf{w}^T\mathbf{z})]\mathbf{w}$ followed by normalization
•Deflation: Extracts components one-by-one with Gram-Schmidt orthogonalization. Simple but accumulates errors.
•Symmetric: Extracts all components simultaneously with symmetric orthogonalization. More robust, avoids error accumulation.
•Contrast Functions: logcosh (robust, general), exp (outlier-resistant), cube (fast for super-Gaussian)
•Convergence: Cubic local convergence; typically 5-15 iterations. Global convergence to true ICs for well-separated sources.
•Pipeline: Center → Whiten → FastICA → Recover demixing matrix → Extract sources

Algorithm Mastered

What's Next:

3 / 5