Machine LearningDimensionality Reduction

Independent Component Analysis

LevelAdvanced

Duration105 mins

TopicDimensionality Reduction

1 / 5

The ICA Model

Discovering Hidden Sources from Mixed Observations

Imagine you are at a cocktail party. Dozens of conversations happen simultaneously, music plays in the background, and glasses clink. Multiple microphones scattered around the room each record a different mixture of all these sounds. Given only these mixed recordings—where every microphone captures overlapping voices and noises—can you recover the original, individual sound sources? Can you isolate the voice of a single speaker from the cacophony?

This is the cocktail party problem, and it represents one of the most elegant and practically important challenges in signal processing. What makes it remarkable is that the problem seems fundamentally underdetermined: we observe mixtures without knowing how the sources were combined, and we seek to recover sources we've never heard in isolation.

Independent Component Analysis (ICA) provides a mathematically principled solution to this seemingly impossible problem. ICA recognizes that if the original sources are statistically independent—and crucially, if they are non-Gaussian—then the mixing process can be inverted using only the observed mixtures. No knowledge of the mixing process is required. No templates or training examples of the sources are needed. The statistical structure of independence, combined with non-Gaussianity, provides enough information to untangle the mixture.

This module develops ICA from its mathematical foundations through practical implementation and application. We begin here by establishing the ICA model: what we assume, what we can recover, and why these assumptions enable source separation.

What You Will Learn

By the end of this page, you will understand the complete mathematical formulation of the ICA model, including the generative process, the statistical independence assumption, the critical role of non-Gaussianity, identifiability conditions, and the precise relationship between ICA and related techniques like PCA. You will be equipped to formulate ICA problems and understand what the algorithm can and cannot recover.

The Generative Model

ICA is built upon a generative model—a mathematical description of how the observed data is assumed to have been produced. Understanding this model precisely is essential, as every aspect of ICA derives from it.

The Linear Mixing Model

We assume that our observations arise from a linear combination of underlying source signals. Formally:

$$\mathbf{x} = \mathbf{A}\mathbf{s}$$

where:

$\mathbf{x} = (x_1, x_2, \ldots, x_n)^T$ is the observed signal vector of $n$ measurements
$\mathbf{s} = (s_1, s_2, \ldots, s_n)^T$ is the source signal vector of $n$ independent sources
$\mathbf{A}$ is an $n \times n$ mixing matrix that combines sources into observations

Each observation $x_i$ is a weighted sum of all sources:

$$x_i = a_{i1}s_1 + a_{i2}s_2 + \cdots + a_{in}s_n$$

In the cocktail party analogy:

Each $s_j$ is an individual speaker's voice (the independent source)
Each $x_i$ is a microphone recording (a mixture)
Each $a_{ij}$ represents how much source $j$ contributes to microphone $i$ (depends on spatial positions, room acoustics, etc.)

The Square Mixing Assumption

In the basic ICA formulation, we assume $\mathbf{A}$ is square ($n \times n$) and invertible. This means the number of observations equals the number of sources, and no information is lost in mixing. Extensions exist for overcomplete (more sources than observations) and undercomplete (fewer sources) cases, but they require additional assumptions.

The Inverse Problem

Given only the observed mixtures $\mathbf{x}$, our goal is to find a demixing matrix $\mathbf{W}$ such that:

$$\mathbf{y} = \mathbf{W}\mathbf{x} = \mathbf{W}\mathbf{A}\mathbf{s}$$

recovering the original sources (up to certain ambiguities we'll discuss).

If we find $\mathbf{W} = \mathbf{A}^{-1}$, then: $$\mathbf{y} = \mathbf{A}^{-1}\mathbf{A}\mathbf{s} = \mathbf{s}$$

But here's the remarkable aspect: we never observe $\mathbf{A}$ or $\mathbf{s}$ directly. We only see $\mathbf{x}$. Yet, under the right conditions, we can recover both the mixing matrix and the sources from the mixtures alone.

Time Series Extension

In practice, we typically have multiple observations over time:

$$\mathbf{x}(t) = \mathbf{A}\mathbf{s}(t), \quad t = 1, 2, \ldots, T$$

where $t$ indexes time (or any other sample index). We observe $T$ samples of the $n$-dimensional mixture vector and seek to recover $T$ samples of the $n$-dimensional source vector. The mixing matrix $\mathbf{A}$ is assumed constant across time—this is the instantaneous mixing assumption (no delays or convolutions).

ICA Model Components
Symbol	Name	Dimensions	Role in Model
$\mathbf{s}(t)$	Source signals	$n \times 1$	Unknown independent latent variables to recover
$\mathbf{A}$	Mixing matrix	$n \times n$	Unknown linear transformation combining sources
$\mathbf{x}(t)$	Observed signals	$n \times 1$	Known mixed measurements (our data)
$\mathbf{W}$	Demixing matrix	$n \times n$	To be estimated; ideally $\mathbf{W} = \mathbf{A}^{-1}$
$\mathbf{y}(t)$	Estimated sources	$n \times 1$	$\mathbf{y} = \mathbf{W}\mathbf{x}$, our estimate of $\mathbf{s}$

The Statistical Independence Assumption

The cornerstone of ICA is the assumption that source signals are mutually statistically independent. This is a much stronger condition than uncorrelatedness, and understanding this distinction is essential.

Independence vs. Uncorrelatedness

Two random variables $X$ and $Y$ are uncorrelated if: $$\text{Cov}(X, Y) = E[XY] - E[X]E[Y] = 0$$

Uncorrelatedness means there is no linear relationship between the variables. Their covariance vanishes.

Two random variables are statistically independent if their joint probability density factorizes: $$p(x, y) = p(x) \cdot p(y)$$

For all values of $x$ and $y$. Independence means there is no relationship—linear or nonlinear—between the variables. Knowledge of one tells you nothing about the other.

Key Insight: Independence implies uncorrelatedness, but uncorrelatedness does not imply independence.

The Critical Distinction

Consider two variables $X$ and $Y$ where $X \sim \text{Uniform}(-1, 1)$ and $Y = X^2$. These variables are perfectly dependent (knowing $X$ determines $Y$ exactly), yet they are uncorrelated: $E[XY] = E[X \cdot X^2] = E[X^3] = 0$ (for symmetric distributions). PCA exploits uncorrelatedness; ICA requires and exploits full independence.

Mathematical Formulation of Independence

For ICA with $n$ sources, we require that the source components $s_1, s_2, \ldots, s_n$ are mutually independent:

$$p(s_1, s_2, \ldots, s_n) = \prod_{i=1}^{n} p_i(s_i)$$

The joint density equals the product of marginals. This means:

Any source $s_i$ is independent of any other source $s_j$ (pairwise independence)
Any source $s_i$ is independent of any subset of other sources (full mutual independence)
Any function $f(s_i)$ is independent of any function $g(s_j)$ for $i \neq j$

Why Independence Enables Source Separation

The linear mixing $\mathbf{x} = \mathbf{A}\mathbf{s}$ introduces dependencies among the observed signals. Even if sources are independent, the mixtures $x_i$ are generally not—each mixture contains contributions from multiple sources, creating correlations.

ICA works by finding the demixing matrix $\mathbf{W}$ that restores independence. Among all possible linear transformations of $\mathbf{x}$, only $\mathbf{W} = \mathbf{A}^{-1}$ (or equivalents) produces outputs that are statistically independent.

The ICA Objective (Conceptual)

Find $\mathbf{W}$ such that the components of $\mathbf{y} = \mathbf{W}\mathbf{x}$ are as independent as possible:

$$\mathbf{W}^* = \arg\max_{\mathbf{W}} \text{Independence}(\mathbf{y}_1, \mathbf{y}_2, \ldots, \mathbf{y}_n)$$

Different ICA algorithms differ in how they measure and optimize independence—topics we'll explore in subsequent pages.

Properties of Statistical Independence

•Joint PDF Factorization: $p(s_1, \ldots, s_n) = \prod_i p_i(s_i)$ — knowledge of any source tells nothing about others
•Zero Covariance: $\text{Cov}(s_i, s_j) = 0$ for $i \neq j$ — independence implies uncorrelatedness
•Higher-Order Moment Factorization: $E[f(s_i)g(s_j)] = E[f(s_i)]E[g(s_j)]$ for any functions $f, g$
•Conditional Equals Marginal: $p(s_i | s_j) = p(s_i)$ — conditioning on other sources doesn't change the distribution
•Mutual Information Vanishes: $I(s_i; s_j) = 0$ — no shared information between any pair of sources

The Non-Gaussianity Requirement

Statistical independence is necessary but not sufficient for ICA. A second, equally critical assumption is that the sources must be non-Gaussian. This requirement is not arbitrary—it is mathematically fundamental to the identifiability of the ICA model.

Why Gaussianity Breaks ICA

For Gaussian distributions, uncorrelatedness is equivalent to independence. This seemingly positive property is actually catastrophic for ICA.

Consider two independent Gaussian sources: $$s_1 \sim N(0, 1), \quad s_2 \sim N(0, 1)$$

Their joint distribution is: $$p(s_1, s_2) = \frac{1}{2\pi} \exp\left(-\frac{s_1^2 + s_2^2}{2}\right)$$

This is a spherically symmetric 2D Gaussian—it looks identical from every angle. Now apply any orthogonal transformation $\mathbf{Q}$ (rotation):

$$\mathbf{y} = \mathbf{Q}\mathbf{s}$$

The distribution of $\mathbf{y}$ is still the same spherically symmetric Gaussian: $$p(y_1, y_2) = \frac{1}{2\pi} \exp\left(-\frac{y_1^2 + y_2^2}{2}\right)$$

The components $y_1$ and $y_2$ are still independent! Any rotation preserves independence for Gaussian sources.

The Gaussian Identifiability Problem

If sources are Gaussian, there are infinitely many demixing matrices that produce independent outputs. Given mixing $\mathbf{A}$, any matrix $\mathbf{W} = \mathbf{Q}\mathbf{A}^{-1}$ where $\mathbf{Q}$ is orthogonal produces independent Gaussian outputs. The original sources cannot be uniquely identified—all rotations are equally valid solutions.

Non-Gaussianity Breaks the Symmetry

Non-Gaussian distributions do not have spherical symmetry. Consider two independent sources with uniform distributions:

$$s_1 \sim U(-1, 1), \quad s_2 \sim U(-1, 1)$$

Their joint distribution fills a square in the $(s_1, s_2)$ plane—clearly not rotationally symmetric! If we rotate this distribution by 45°, the result is a diamond shape, and the marginal distributions are no longer uniform. The rotated components are still uncorrelated, but they are no longer independent with their original distributions.

This asymmetry is what makes ICA possible. For non-Gaussian sources:

Only specific rotations preserve independence
The maximally independent direction corresponds to the true sources
The mixing matrix can be (almost) uniquely identified

The Central Limit Theorem Connection

The Central Limit Theorem states that sums of independent random variables tend toward Gaussianity. This provides a powerful intuition:

Each observation $x_i = \sum_j a_{ij} s_j$ is a weighted sum of sources
Sums are more Gaussian than their components
To find sources, we seek directions that are maximally non-Gaussian
The most non-Gaussian directions correspond to individual sources

This is why many ICA algorithms work by maximizing non-Gaussianity (measured by kurtosis, negentropy, etc.)—a topic we'll explore in detail later.

Gaussian vs. Non-Gaussian Sources in ICA
Property	Gaussian Sources	Non-Gaussian Sources
Orthogonal transformation effect	Preserves independence	Destroys independence (generally)
Uniqueness of solution	Infinitely many valid demixings	Unique (up to scale/permutation)
Identifiability	Not identifiable	Identifiable
Distribution symmetry	Spherically symmetric	Non-symmetric shapes
ICA applicability	Cannot apply standard ICA	Standard ICA works

At Most One Gaussian Source

A weaker condition suffices: ICA works as long as at most one source is Gaussian. If exactly one source is Gaussian, that source cannot be separated from Gaussian noise in other sources, but the non-Gaussian sources can still be recovered.

Intuitively: the Gaussian component contributes a "sphere" to the joint distribution, which can be rotated freely. But the non-Gaussian components contribute distinctive, non-spherical shapes that pin down their directions.

Common Non-Gaussian Distributions in Practice

Many real-world signals are naturally non-Gaussian:

Speech signals: Super-Gaussian (heavy-tailed, sparse—many near-zero values with occasional peaks)
Images: Sparse representations (edges, textures) are super-Gaussian
Financial data: Fat-tailed distributions (more extreme events than Gaussian)
Biomedical signals: Often mixture of sparse components
Music: Typically super-Gaussian due to rhythmic structure

This natural non-Gaussianity of real signals is why ICA has found such broad applicability.

Identifiability: What Can and Cannot Be Recovered

Even with independence and non-Gaussianity, ICA cannot recover sources with perfect uniqueness. Certain inherent ambiguities exist that no algorithm can resolve. Understanding these ambiguities is essential for interpreting ICA results correctly.

Ambiguity 1: Sign (Polarity) Ambiguity

If $s_i$ is a source, then $-s_i$ is equally valid. From the model: $$\mathbf{x} = \mathbf{A}\mathbf{s}$$

we can write: $$\mathbf{x} = (\mathbf{A}\mathbf{D})(\mathbf{D}^{-1}\mathbf{s})$$

where $\mathbf{D}$ is a diagonal matrix with entries $\pm 1$. The transformation $\mathbf{D}^{-1}\mathbf{s}$ simply flips the sign of some sources, and $\mathbf{A}\mathbf{D}$ is an equally valid mixing matrix.

There is no way to determine the "true" sign of sources from the observed mixtures alone—both $s_i$ and $-s_i$ produce the same statistical relationships.

Sign Ambiguity in Practice

In audio separation, inverting the sign of a recovered speech signal is inaudible—we hear the same sound. In brain imaging, the sign of a component is typically assigned by convention (e.g., positive values represent activation). The sign ambiguity is usually benign in applications.

Ambiguity 2: Scale (Amplitude) Ambiguity

If $s_i$ is a source with some variance, scaling it by any non-zero constant $c_i$ produces: $$\mathbf{x} = (\mathbf{A}\mathbf{D}{\text{scale}})(\mathbf{D}{\text{scale}}^{-1}\mathbf{s})$$

where $\mathbf{D}_{\text{scale}}$ is diagonal with entries $c_i$. The scaling of sources can be absorbed into the mixing matrix.

This means we can only recover sources up to arbitrary scaling. Typically, we adopt a convention like unit variance: $$\text{Var}(s_i) = 1 \quad \text{for all } i$$

and correspondingly adjust the mixing matrix columns.

Ambiguity 3: Order (Permutation) Ambiguity

The labeling of sources as $s_1, s_2, \ldots, s_n$ is arbitrary. Any permutation of the sources corresponds to a permutation of the columns of $\mathbf{A}$: $$\mathbf{x} = (\mathbf{A}\mathbf{P})(\mathbf{P}^T\mathbf{s})$$

where $\mathbf{P}$ is a permutation matrix. There is no intrinsic ordering of independent sources.

The Identifiability Theorem

Combining these observations, we have the fundamental ICA identifiability result:

Theorem: Under the ICA model with at most one Gaussian source, the mixing matrix $\mathbf{A}$ and sources $\mathbf{s}$ are identifiable up to:

Permutation of sources (reordering)

Scaling of sources (absorbed into mixing matrix)

Sign flips of sources (polarity)

Formally, if $\mathbf{A}$ is the true mixing matrix, then $\mathbf{A}'$ is also a valid solution if and only if $\mathbf{A}' = \mathbf{A}\mathbf{P}\mathbf{D}$ where $\mathbf{P}$ is a permutation matrix and $\mathbf{D}$ is a diagonal scaling/sign matrix.

Complete ICA Assumptions for Identifiability

•Independence: Source components $s_1, \ldots, s_n$ are statistically independent
•Non-Gaussianity: At most one source may be Gaussian
•Invertibility: The mixing matrix $\mathbf{A}$ is square and invertible
•Linear Mixing: The observed signals are linear combinations of sources (no nonlinear mixing)
•Instantaneous Mixing: No temporal delays or convolutions (can be relaxed in convolutive ICA)

What These Ambiguities Mean Practically

The ambiguities are typically inconsequential:

Permutation: We don't care which source is labeled "first"—all are equally recovered
Scaling: We normalize sources to unit variance by convention
Sign: In most applications, the sign is either irrelevant or can be set by domain conventions

What matters is that the subspace structure and independence relationships are uniquely determined. We recover the true independent sources, just without fixed labels, scales, or signs.

The ICA Problem: Formal Statement

Having established the model and assumptions, we can now precisely state the ICA problem.

The ICA Problem

Given: Observations $\mathbf{x}(1), \mathbf{x}(2), \ldots, \mathbf{x}(T)$ assumed to be generated by $\mathbf{x}(t) = \mathbf{A}\mathbf{s}(t)$ where:

$\mathbf{A}$ is an unknown $n \times n$ invertible mixing matrix
$\mathbf{s}(t)$ are unknown independent, non-Gaussian source signals

Find: A demixing matrix $\mathbf{W}$ such that $\mathbf{y}(t) = \mathbf{W}\mathbf{x}(t)$ recovers the independent sources up to permutation, scaling, and sign.

Preprocessing: Centering

As with PCA, we typically center the data by subtracting the mean: $$\tilde{\mathbf{x}}(t) = \mathbf{x}(t) - E[\mathbf{x}]$$

This is equivalent to assuming $E[\mathbf{s}] = \mathbf{0}$ (zero-mean sources). The mean can be absorbed into the model if needed but is typically removed for simplicity.

Preprocessing: Whitening

A crucial preprocessing step is whitening (or sphering): transforming the data so that it has identity covariance matrix. Whitening removes second-order correlations and reduces the ICA problem to finding an orthogonal matrix. This dramatically simplifies optimization and is standard practice in ICA implementations.

Whitening: Reducing to Orthogonal ICA

Let $\mathbf{C}_x = E[\mathbf{x}\mathbf{x}^T]$ be the covariance matrix of the centered observations. We compute the whitening transformation:

$$\mathbf{V} = \mathbf{C}_x^{-1/2}$$

using eigendecomposition: if $\mathbf{C}_x = \mathbf{E}\mathbf{D}\mathbf{E}^T$, then $\mathbf{V} = \mathbf{E}\mathbf{D}^{-1/2}\mathbf{E}^T$.

Applying whitening: $$\mathbf{z} = \mathbf{V}\mathbf{x} = \mathbf{V}\mathbf{A}\mathbf{s}$$

The whitened data $\mathbf{z}$ has covariance: $$E[\mathbf{z}\mathbf{z}^T] = \mathbf{V}\mathbf{A}E[\mathbf{s}\mathbf{s}^T]\mathbf{A}^T\mathbf{V}^T = \mathbf{V}\mathbf{A}\mathbf{A}^T\mathbf{V}^T = \mathbf{I}$$

(assuming unit-variance sources: $E[\mathbf{s}\mathbf{s}^T] = \mathbf{I}$).

Now, the effective mixing matrix $\tilde{\mathbf{A}} = \mathbf{V}\mathbf{A}$ satisfies $\tilde{\mathbf{A}}\tilde{\mathbf{A}}^T = \mathbf{I}$, meaning $\tilde{\mathbf{A}}$ is orthogonal!

The Simplified ICA Problem

After whitening, ICA reduces to finding an orthogonal demixing matrix:

$$\mathbf{z} = \tilde{\mathbf{A}}\mathbf{s}, \quad \mathbf{y} = \tilde{\mathbf{W}}\mathbf{z}$$

where $\tilde{\mathbf{W}} = \tilde{\mathbf{A}}^T$ is also orthogonal. The search space shrinks from all invertible matrices to orthogonal matrices—a much smaller manifold with nice geometric properties.

For $n$ dimensions, the space of orthogonal matrices has dimension $\frac{n(n-1)}{2}$ (compare to $n^2$ for general invertible matrices). For example:

$n=2$: 1 parameter (rotation angle)
$n=3$: 3 parameters (Euler angles)
$n=10$: 45 parameters vs. 100 for general matrices

ICA Preprocessing Pipeline
Step	Operation	Purpose	Result
Centering	$\tilde{\mathbf{x}} = \mathbf{x} - E[\mathbf{x}]$	Remove mean	Zero-mean observations
Covariance	$\mathbf{C}_x = \frac{1}{T}\sum_t \tilde{\mathbf{x}}(t)\tilde{\mathbf{x}}(t)^T$	Estimate second-order statistics	Sample covariance matrix
Eigendecompose	$\mathbf{C}_x = \mathbf{E}\mathbf{D}\mathbf{E}^T$	Find principal directions	Eigenvalues and eigenvectors
Whitening matrix	$\mathbf{V} = \mathbf{D}^{-1/2}\mathbf{E}^T$	Construct sphering transform	Decorrelating transformation
Whiten data	$\mathbf{z} = \mathbf{V}\tilde{\mathbf{x}}$	Apply whitening	Unit covariance, uncorrelated data
ICA	Find orthogonal $\tilde{\mathbf{W}}$	Maximize independence	Estimated independent components

ICA in Context: Relationship to PCA and Factor Analysis

ICA is often confused with or compared to other dimensionality reduction and latent variable methods. Understanding the precise relationships clarifies when each method is appropriate.

ICA vs. PCA

PCA and ICA both seek linear transformations of data but with fundamentally different objectives:

Aspect	PCA	ICA
Objective	Maximize variance	Maximize independence
Constraint	Orthogonality	Independence
Statistical order	Second-order (covariance)	Higher-order (beyond covariance)
Components	Uncorrelated	Independent
Ordering	Ranked by variance	Unordered (equivalent)
Gaussian data	Fully works	Cannot separate sources

When to Use PCA

•Data reduction while preserving variance
•Visualization (first 2-3 components)
•Decorrelation without assuming independence
•When components should be ranked by importance
•As preprocessing before ICA (whitening)
•When data may be approximately Gaussian

When to Use ICA

•Source separation (blind separation)
•When true independent sources are assumed
•Signals are known to be non-Gaussian
•Feature extraction with statistical independence
•When independence matters more than variance
•Denoising by separating noise from signal sources

The Whitening-Then-Rotation View

An illuminating perspective connects PCA and ICA:

PCA performs the whitening step—it uncorrelates data and equalizes variances
ICA finds an additional rotation (orthogonal transformation) that makes components independent

After whitening, both PCA and ICA components are uncorrelated with unit variance. But PCA stops here (or ranks by original variance), while ICA continues to find the rotation that separates independent sources.

For Gaussian data: All rotations are equivalent (all produce independent components), so ICA is undefined. PCA's specific rotation (aligning with original variance) is as good as any other.

For non-Gaussian data: One rotation is special—the one that separates true independent sources. ICA finds this rotation.

ICA vs. Factor Analysis

Factor Analysis (FA) posits a generative model similar to ICA: $$\mathbf{x} = \mathbf{\Lambda}\mathbf{f} + \boldsymbol{\epsilon}$$

where $\mathbf{f}$ are latent factors, $\mathbf{\Lambda}$ is the loading matrix, and $\boldsymbol{\epsilon}$ is noise.

Key differences:

Aspect	Factor Analysis	ICA
Factor distribution	Gaussian (typically)	Non-Gaussian (required)
Noise model	Diagonal-covariance Gaussian	Usually noise-free model
Uniqueness	Rotational indeterminacy	Unique (up to sign/permutation)
Estimation	Maximum likelihood	Independence maximization
Interpretation	Correlated factors (oblique rotation)	Strictly independent sources

The Hierarchy of Latent Variable Models

A useful mental model: PCA finds uncorrelated directions. Factor Analysis models correlations with Gaussian factors plus noise. ICA finds truly independent directions by exploiting non-Gaussianity. Each adds constraints/assumptions that enable stronger conclusions about the latent structure.

Summary: The ICA Framework

We have established the complete theoretical foundation of Independent Component Analysis. The ICA model is elegant in its simplicity yet powerful in its implications.

Key Takeaways

•The ICA Model: Observed signals are linear mixtures ($\mathbf{x} = \mathbf{A}\mathbf{s}$) of unknown, independent, non-Gaussian sources
•Independence Assumption: Sources must be statistically independent (much stronger than uncorrelated)—this enables separation
•Non-Gaussianity Requirement: At most one source may be Gaussian; non-Gaussianity breaks rotational symmetry and enables identifiability
•Identifiability: Sources are recoverable up to permutation, scaling, and sign—unavoidable ambiguities that are typically benign
•Whitening Preprocessing: Reduces ICA to finding an orthogonal matrix, dramatically simplifying the optimization landscape
•Comparison to PCA: PCA maximizes variance (second-order); ICA maximizes independence (higher-order). PCA is preprocessing for ICA

Foundation Established

You now understand the complete mathematical framework of Independent Component Analysis. The generative model, the critical assumptions of independence and non-Gaussianity, the identifiability theorem, and the relationship to PCA form the foundation for everything that follows. In the next page, we'll explore why non-Gaussianity is the key to ICA and how it can be measured and maximized.

What's Next:

The next page develops the theory of non-Gaussianity in depth. We'll explore multiple ways to measure departure from Gaussianity—kurtosis, negentropy, and mutual information—and understand how these measures connect to the ICA objective. This will lead directly to the algorithmic approaches for solving ICA, including the celebrated FastICA algorithm.

1 / 5

Loading learning content...

Machine LearningDimensionality Reduction

Independent Component Analysis

LevelAdvanced

Duration105 mins

TopicDimensionality Reduction

1 / 5

The ICA Model

Discovering Hidden Sources from Mixed Observations

What You Will Learn

The Generative Model

The Linear Mixing Model

We assume that our observations arise from a linear combination of underlying source signals. Formally:

$$\mathbf{x} = \mathbf{A}\mathbf{s}$$

where:

$\mathbf{x} = (x_1, x_2, \ldots, x_n)^T$ is the observed signal vector of $n$ measurements
$\mathbf{s} = (s_1, s_2, \ldots, s_n)^T$ is the source signal vector of $n$ independent sources
$\mathbf{A}$ is an $n \times n$ mixing matrix that combines sources into observations

Each observation $x_i$ is a weighted sum of all sources:

$$x_i = a_{i1}s_1 + a_{i2}s_2 + \cdots + a_{in}s_n$$

In the cocktail party analogy:

Each $s_j$ is an individual speaker's voice (the independent source)
Each $x_i$ is a microphone recording (a mixture)
Each $a_{ij}$ represents how much source $j$ contributes to microphone $i$ (depends on spatial positions, room acoustics, etc.)

The Square Mixing Assumption

The Inverse Problem

Given only the observed mixtures $\mathbf{x}$, our goal is to find a demixing matrix $\mathbf{W}$ such that:

$$\mathbf{y} = \mathbf{W}\mathbf{x} = \mathbf{W}\mathbf{A}\mathbf{s}$$

recovering the original sources (up to certain ambiguities we'll discuss).

If we find $\mathbf{W} = \mathbf{A}^{-1}$, then: $$\mathbf{y} = \mathbf{A}^{-1}\mathbf{A}\mathbf{s} = \mathbf{s}$$

Time Series Extension

In practice, we typically have multiple observations over time:

$$\mathbf{x}(t) = \mathbf{A}\mathbf{s}(t), \quad t = 1, 2, \ldots, T$$

ICA Model Components
Symbol	Name	Dimensions	Role in Model
$\mathbf{s}(t)$	Source signals	$n \times 1$	Unknown independent latent variables to recover
$\mathbf{A}$	Mixing matrix	$n \times n$	Unknown linear transformation combining sources
$\mathbf{x}(t)$	Observed signals	$n \times 1$	Known mixed measurements (our data)
$\mathbf{W}$	Demixing matrix	$n \times n$	To be estimated; ideally $\mathbf{W} = \mathbf{A}^{-1}$
$\mathbf{y}(t)$	Estimated sources	$n \times 1$	$\mathbf{y} = \mathbf{W}\mathbf{x}$, our estimate of $\mathbf{s}$

The Statistical Independence Assumption

Independence vs. Uncorrelatedness

Two random variables $X$ and $Y$ are uncorrelated if: $$\text{Cov}(X, Y) = E[XY] - E[X]E[Y] = 0$$

Uncorrelatedness means there is no linear relationship between the variables. Their covariance vanishes.

Two random variables are statistically independent if their joint probability density factorizes: $$p(x, y) = p(x) \cdot p(y)$$

For all values of $x$ and $y$. Independence means there is no relationship—linear or nonlinear—between the variables. Knowledge of one tells you nothing about the other.

Key Insight: Independence implies uncorrelatedness, but uncorrelatedness does not imply independence.

The Critical Distinction

Mathematical Formulation of Independence

For ICA with $n$ sources, we require that the source components $s_1, s_2, \ldots, s_n$ are mutually independent:

$$p(s_1, s_2, \ldots, s_n) = \prod_{i=1}^{n} p_i(s_i)$$

The joint density equals the product of marginals. This means:

Any source $s_i$ is independent of any other source $s_j$ (pairwise independence)
Any source $s_i$ is independent of any subset of other sources (full mutual independence)
Any function $f(s_i)$ is independent of any function $g(s_j)$ for $i \neq j$

Why Independence Enables Source Separation

The ICA Objective (Conceptual)

Find $\mathbf{W}$ such that the components of $\mathbf{y} = \mathbf{W}\mathbf{x}$ are as independent as possible:

$$\mathbf{W}^* = \arg\max_{\mathbf{W}} \text{Independence}(\mathbf{y}_1, \mathbf{y}_2, \ldots, \mathbf{y}_n)$$

Different ICA algorithms differ in how they measure and optimize independence—topics we'll explore in subsequent pages.

Properties of Statistical Independence

•Joint PDF Factorization: $p(s_1, \ldots, s_n) = \prod_i p_i(s_i)$ — knowledge of any source tells nothing about others
•Zero Covariance: $\text{Cov}(s_i, s_j) = 0$ for $i \neq j$ — independence implies uncorrelatedness
•Higher-Order Moment Factorization: $E[f(s_i)g(s_j)] = E[f(s_i)]E[g(s_j)]$ for any functions $f, g$
•Conditional Equals Marginal: $p(s_i | s_j) = p(s_i)$ — conditioning on other sources doesn't change the distribution
•Mutual Information Vanishes: $I(s_i; s_j) = 0$ — no shared information between any pair of sources

The Non-Gaussianity Requirement

Why Gaussianity Breaks ICA

For Gaussian distributions, uncorrelatedness is equivalent to independence. This seemingly positive property is actually catastrophic for ICA.

Consider two independent Gaussian sources: $$s_1 \sim N(0, 1), \quad s_2 \sim N(0, 1)$$

Their joint distribution is: $$p(s_1, s_2) = \frac{1}{2\pi} \exp\left(-\frac{s_1^2 + s_2^2}{2}\right)$$

This is a spherically symmetric 2D Gaussian—it looks identical from every angle. Now apply any orthogonal transformation $\mathbf{Q}$ (rotation):

$$\mathbf{y} = \mathbf{Q}\mathbf{s}$$

The distribution of $\mathbf{y}$ is still the same spherically symmetric Gaussian: $$p(y_1, y_2) = \frac{1}{2\pi} \exp\left(-\frac{y_1^2 + y_2^2}{2}\right)$$

The components $y_1$ and $y_2$ are still independent! Any rotation preserves independence for Gaussian sources.

The Gaussian Identifiability Problem

Non-Gaussianity Breaks the Symmetry

Non-Gaussian distributions do not have spherical symmetry. Consider two independent sources with uniform distributions:

$$s_1 \sim U(-1, 1), \quad s_2 \sim U(-1, 1)$$

This asymmetry is what makes ICA possible. For non-Gaussian sources:

Only specific rotations preserve independence
The maximally independent direction corresponds to the true sources
The mixing matrix can be (almost) uniquely identified

The Central Limit Theorem Connection

The Central Limit Theorem states that sums of independent random variables tend toward Gaussianity. This provides a powerful intuition:

Each observation $x_i = \sum_j a_{ij} s_j$ is a weighted sum of sources
Sums are more Gaussian than their components
To find sources, we seek directions that are maximally non-Gaussian
The most non-Gaussian directions correspond to individual sources

This is why many ICA algorithms work by maximizing non-Gaussianity (measured by kurtosis, negentropy, etc.)—a topic we'll explore in detail later.

Gaussian vs. Non-Gaussian Sources in ICA
Property	Gaussian Sources	Non-Gaussian Sources
Orthogonal transformation effect	Preserves independence	Destroys independence (generally)
Uniqueness of solution	Infinitely many valid demixings	Unique (up to scale/permutation)
Identifiability	Not identifiable	Identifiable
Distribution symmetry	Spherically symmetric	Non-symmetric shapes
ICA applicability	Cannot apply standard ICA	Standard ICA works

At Most One Gaussian Source

Common Non-Gaussian Distributions in Practice

Many real-world signals are naturally non-Gaussian:

Speech signals: Super-Gaussian (heavy-tailed, sparse—many near-zero values with occasional peaks)
Images: Sparse representations (edges, textures) are super-Gaussian
Financial data: Fat-tailed distributions (more extreme events than Gaussian)
Biomedical signals: Often mixture of sparse components
Music: Typically super-Gaussian due to rhythmic structure

This natural non-Gaussianity of real signals is why ICA has found such broad applicability.

Identifiability: What Can and Cannot Be Recovered

Ambiguity 1: Sign (Polarity) Ambiguity

If $s_i$ is a source, then $-s_i$ is equally valid. From the model: $$\mathbf{x} = \mathbf{A}\mathbf{s}$$

we can write: $$\mathbf{x} = (\mathbf{A}\mathbf{D})(\mathbf{D}^{-1}\mathbf{s})$$

There is no way to determine the "true" sign of sources from the observed mixtures alone—both $s_i$ and $-s_i$ produce the same statistical relationships.

Sign Ambiguity in Practice

Ambiguity 2: Scale (Amplitude) Ambiguity

If $s_i$ is a source with some variance, scaling it by any non-zero constant $c_i$ produces: $$\mathbf{x} = (\mathbf{A}\mathbf{D}{\text{scale}})(\mathbf{D}{\text{scale}}^{-1}\mathbf{s})$$

where $\mathbf{D}_{\text{scale}}$ is diagonal with entries $c_i$. The scaling of sources can be absorbed into the mixing matrix.

This means we can only recover sources up to arbitrary scaling. Typically, we adopt a convention like unit variance: $$\text{Var}(s_i) = 1 \quad \text{for all } i$$

and correspondingly adjust the mixing matrix columns.

Ambiguity 3: Order (Permutation) Ambiguity

where $\mathbf{P}$ is a permutation matrix. There is no intrinsic ordering of independent sources.

The Identifiability Theorem

Combining these observations, we have the fundamental ICA identifiability result:

Theorem: Under the ICA model with at most one Gaussian source, the mixing matrix $\mathbf{A}$ and sources $\mathbf{s}$ are identifiable up to:

Permutation of sources (reordering)

Scaling of sources (absorbed into mixing matrix)

Sign flips of sources (polarity)

Formally, if $\mathbf{A}$ is the true mixing matrix, then $\mathbf{A}'$ is also a valid solution if and only if $\mathbf{A}' = \mathbf{A}\mathbf{P}\mathbf{D}$ where $\mathbf{P}$ is a permutation matrix and $\mathbf{D}$ is a diagonal scaling/sign matrix.

Complete ICA Assumptions for Identifiability

•Independence: Source components $s_1, \ldots, s_n$ are statistically independent
•Non-Gaussianity: At most one source may be Gaussian
•Invertibility: The mixing matrix $\mathbf{A}$ is square and invertible
•Linear Mixing: The observed signals are linear combinations of sources (no nonlinear mixing)
•Instantaneous Mixing: No temporal delays or convolutions (can be relaxed in convolutive ICA)

What These Ambiguities Mean Practically

The ambiguities are typically inconsequential:

Permutation: We don't care which source is labeled "first"—all are equally recovered
Scaling: We normalize sources to unit variance by convention
Sign: In most applications, the sign is either irrelevant or can be set by domain conventions

What matters is that the subspace structure and independence relationships are uniquely determined. We recover the true independent sources, just without fixed labels, scales, or signs.

The ICA Problem: Formal Statement

Having established the model and assumptions, we can now precisely state the ICA problem.

The ICA Problem

Given: Observations $\mathbf{x}(1), \mathbf{x}(2), \ldots, \mathbf{x}(T)$ assumed to be generated by $\mathbf{x}(t) = \mathbf{A}\mathbf{s}(t)$ where:

$\mathbf{A}$ is an unknown $n \times n$ invertible mixing matrix
$\mathbf{s}(t)$ are unknown independent, non-Gaussian source signals

Find: A demixing matrix $\mathbf{W}$ such that $\mathbf{y}(t) = \mathbf{W}\mathbf{x}(t)$ recovers the independent sources up to permutation, scaling, and sign.

Preprocessing: Centering

As with PCA, we typically center the data by subtracting the mean: $$\tilde{\mathbf{x}}(t) = \mathbf{x}(t) - E[\mathbf{x}]$$

This is equivalent to assuming $E[\mathbf{s}] = \mathbf{0}$ (zero-mean sources). The mean can be absorbed into the model if needed but is typically removed for simplicity.

Preprocessing: Whitening

Whitening: Reducing to Orthogonal ICA

Let $\mathbf{C}_x = E[\mathbf{x}\mathbf{x}^T]$ be the covariance matrix of the centered observations. We compute the whitening transformation:

$$\mathbf{V} = \mathbf{C}_x^{-1/2}$$

using eigendecomposition: if $\mathbf{C}_x = \mathbf{E}\mathbf{D}\mathbf{E}^T$, then $\mathbf{V} = \mathbf{E}\mathbf{D}^{-1/2}\mathbf{E}^T$.

Applying whitening: $$\mathbf{z} = \mathbf{V}\mathbf{x} = \mathbf{V}\mathbf{A}\mathbf{s}$$

(assuming unit-variance sources: $E[\mathbf{s}\mathbf{s}^T] = \mathbf{I}$).

Now, the effective mixing matrix $\tilde{\mathbf{A}} = \mathbf{V}\mathbf{A}$ satisfies $\tilde{\mathbf{A}}\tilde{\mathbf{A}}^T = \mathbf{I}$, meaning $\tilde{\mathbf{A}}$ is orthogonal!

The Simplified ICA Problem

After whitening, ICA reduces to finding an orthogonal demixing matrix:

$$\mathbf{z} = \tilde{\mathbf{A}}\mathbf{s}, \quad \mathbf{y} = \tilde{\mathbf{W}}\mathbf{z}$$

For $n$ dimensions, the space of orthogonal matrices has dimension $\frac{n(n-1)}{2}$ (compare to $n^2$ for general invertible matrices). For example:

$n=2$: 1 parameter (rotation angle)
$n=3$: 3 parameters (Euler angles)
$n=10$: 45 parameters vs. 100 for general matrices

ICA Preprocessing Pipeline
Step	Operation	Purpose	Result
Centering	$\tilde{\mathbf{x}} = \mathbf{x} - E[\mathbf{x}]$	Remove mean	Zero-mean observations
Covariance	$\mathbf{C}_x = \frac{1}{T}\sum_t \tilde{\mathbf{x}}(t)\tilde{\mathbf{x}}(t)^T$	Estimate second-order statistics	Sample covariance matrix
Eigendecompose	$\mathbf{C}_x = \mathbf{E}\mathbf{D}\mathbf{E}^T$	Find principal directions	Eigenvalues and eigenvectors
Whitening matrix	$\mathbf{V} = \mathbf{D}^{-1/2}\mathbf{E}^T$	Construct sphering transform	Decorrelating transformation
Whiten data	$\mathbf{z} = \mathbf{V}\tilde{\mathbf{x}}$	Apply whitening	Unit covariance, uncorrelated data
ICA	Find orthogonal $\tilde{\mathbf{W}}$	Maximize independence	Estimated independent components

ICA in Context: Relationship to PCA and Factor Analysis

ICA is often confused with or compared to other dimensionality reduction and latent variable methods. Understanding the precise relationships clarifies when each method is appropriate.

ICA vs. PCA

PCA and ICA both seek linear transformations of data but with fundamentally different objectives:

Aspect	PCA	ICA
Objective	Maximize variance	Maximize independence
Constraint	Orthogonality	Independence
Statistical order	Second-order (covariance)	Higher-order (beyond covariance)
Components	Uncorrelated	Independent
Ordering	Ranked by variance	Unordered (equivalent)
Gaussian data	Fully works	Cannot separate sources

When to Use PCA

•Data reduction while preserving variance
•Visualization (first 2-3 components)
•Decorrelation without assuming independence
•When components should be ranked by importance
•As preprocessing before ICA (whitening)
•When data may be approximately Gaussian

When to Use ICA

•Source separation (blind separation)
•When true independent sources are assumed
•Signals are known to be non-Gaussian
•Feature extraction with statistical independence
•When independence matters more than variance
•Denoising by separating noise from signal sources

The Whitening-Then-Rotation View

An illuminating perspective connects PCA and ICA:

PCA performs the whitening step—it uncorrelates data and equalizes variances
ICA finds an additional rotation (orthogonal transformation) that makes components independent

For Gaussian data: All rotations are equivalent (all produce independent components), so ICA is undefined. PCA's specific rotation (aligning with original variance) is as good as any other.

For non-Gaussian data: One rotation is special—the one that separates true independent sources. ICA finds this rotation.

ICA vs. Factor Analysis

Factor Analysis (FA) posits a generative model similar to ICA: $$\mathbf{x} = \mathbf{\Lambda}\mathbf{f} + \boldsymbol{\epsilon}$$

where $\mathbf{f}$ are latent factors, $\mathbf{\Lambda}$ is the loading matrix, and $\boldsymbol{\epsilon}$ is noise.

Key differences:

Aspect	Factor Analysis	ICA
Factor distribution	Gaussian (typically)	Non-Gaussian (required)
Noise model	Diagonal-covariance Gaussian	Usually noise-free model
Uniqueness	Rotational indeterminacy	Unique (up to sign/permutation)
Estimation	Maximum likelihood	Independence maximization
Interpretation	Correlated factors (oblique rotation)	Strictly independent sources

The Hierarchy of Latent Variable Models

Summary: The ICA Framework

We have established the complete theoretical foundation of Independent Component Analysis. The ICA model is elegant in its simplicity yet powerful in its implications.

Key Takeaways

•The ICA Model: Observed signals are linear mixtures ($\mathbf{x} = \mathbf{A}\mathbf{s}$) of unknown, independent, non-Gaussian sources
•Independence Assumption: Sources must be statistically independent (much stronger than uncorrelated)—this enables separation
•Non-Gaussianity Requirement: At most one source may be Gaussian; non-Gaussianity breaks rotational symmetry and enables identifiability
•Identifiability: Sources are recoverable up to permutation, scaling, and sign—unavoidable ambiguities that are typically benign
•Whitening Preprocessing: Reduces ICA to finding an orthogonal matrix, dramatically simplifying the optimization landscape
•Comparison to PCA: PCA maximizes variance (second-order); ICA maximizes independence (higher-order). PCA is preprocessing for ICA

Foundation Established

What's Next:

1 / 5