Loading content...
In the previous page, we established that Independent Component Analysis requires sources to be non-Gaussian. This is not merely a technical assumption—it is the very mechanism that makes ICA possible. Without non-Gaussianity, the mixing matrix cannot be uniquely identified; with sufficient non-Gaussianity, a simple optimization can recover the original sources from their mixtures.
This page develops the theory of non-Gaussianity in depth. We will explore multiple ways to quantify how much a distribution deviates from Gaussian: kurtosis (based on fourth-order moments), negentropy (based on information theory), and connections to mutual information. These measures are not merely theoretical curiosities—they form the objective functions that ICA algorithms maximize.
Understanding non-Gaussianity deeply provides insight into:
By the end of this page, you will understand why the Gaussian distribution is the unique fixed point under linear mixing, how kurtosis and negentropy quantify non-Gaussianity, the mathematical relationship between independence maximization and non-Gaussianity maximization, and how these concepts lead to practical ICA algorithms.
The Gaussian distribution occupies a unique position in probability theory and statistics. For ICA, the critical property is encapsulated in the Central Limit Theorem (CLT).
The Central Limit Theorem
Let $X_1, X_2, \ldots, X_n$ be independent random variables with finite mean $\mu_i$ and variance $\sigma_i^2$. Under mild conditions, the sum:
$$S_n = \sum_{i=1}^{n} X_i$$
converges in distribution to a Gaussian as $n \to \infty$:
$$\frac{S_n - \sum_i \mu_i}{\sqrt{\sum_i \sigma_i^2}} \xrightarrow{d} N(0, 1)$$
The CLT tells us that sums of independent variables tend toward Gaussianity, regardless of the original distributions.
Implications for ICA
Each observed signal in ICA is a weighted sum of sources: $$x_j = \sum_{i} a_{ji} s_i$$
By the CLT intuition:
The Key Insight: If mixtures are more Gaussian than sources, then to recover sources, we should undo the mixing by maximizing non-Gaussianity. The direction of maximum non-Gaussianity corresponds to an individual source rather than a mixture.
Think of mixing as "Gaussianizing." Sources are non-Gaussian; mixing (summing) makes them more Gaussian. ICA inverts this by finding directions (linear combinations) that are maximally non-Gaussian—these directions correspond to the original sources.
Gaussians as Fixed Points
Another way to understand the Gaussian's special status: the Gaussian distribution is the unique fixed point of the mixing operation (up to scaling).
If $s_1, s_2 \sim N(0, 1)$ independently, then for any coefficients $a_1, a_2$: $$a_1 s_1 + a_2 s_2 \sim N(0, a_1^2 + a_2^2)$$
The result is still Gaussian. No matter how we mix Gaussians, we get Gaussians. The distribution shape is invariant under linear combination.
For non-Gaussian sources, this invariance breaks:
The non-Gaussian "signature" of sources degrades under mixing and is maximized when we recover the true sources.
Quantifying the Effect
For super-Gaussian (heavy-tailed) sources with high kurtosis:
For sub-Gaussian (light-tailed) sources with negative kurtosis:
In both cases, the magnitude of kurtosis (the absolute departure from Gaussianity) decreases with mixing.
| Property | Individual Source | Mixture of Sources | Gaussian Limit |
|---|---|---|---|
| Kurtosis (super-Gaussian) | Large positive | Reduced positive | Zero |
| Kurtosis (sub-Gaussian) | Negative | Less negative | Zero |
| Negentropy | Positive | Reduced | Zero |
| Sparsity | High | Reduced | None |
| Heavy tails | Present | Diminished | Light tails (exp decay) |
Kurtosis is the simplest and most classical measure of non-Gaussianity. It quantifies the "tailedness" and "peakedness" of a distribution using the fourth central moment.
Definition of Kurtosis
For a zero-mean random variable $Y$, the fourth cumulant (excess kurtosis) is:
$$\text{kurt}(Y) = E[Y^4] - 3(E[Y^2])^2$$
For a standardized variable with $E[Y] = 0$ and $\text{Var}(Y) = 1$:
$$\text{kurt}(Y) = E[Y^4] - 3$$
The subtraction of 3 normalizes kurtosis so that a Gaussian distribution has kurtosis = 0.
Properties of Kurtosis
Gaussian reference: $\text{kurt}(Y) = 0$ if and only if $Y$ matches the fourth moment of a Gaussian with the same variance
Sign indicates tail behavior:
Scale invariance: For any constant $c \neq 0$: $$\text{kurt}(cY) = \text{kurt}(Y)$$ Kurtosis is dimensionless and invariant to scaling.
Additivity under independence: For independent $Y_1, Y_2$: $$\text{kurt}(Y_1 + Y_2) = \frac{\text{kurt}(Y_1)\sigma_1^4 + \text{kurt}(Y_2)\sigma_2^4}{(\sigma_1^2 + \sigma_2^2)^2}$$
The kurtosis of a sum is a weighted average (by fourth power of variance), with weights that favor lower kurtosis (regression toward zero).
Super-Gaussian distributions have heavier tails and/or sharper peaks than Gaussian. Examples: Laplace, Student's t, sparse distributions. Sub-Gaussian distributions have lighter tails and/or flatter peaks. Examples: Uniform, bounded distributions. Both are non-Gaussian and usable in ICA—the sign of kurtosis affects algorithmic details but not fundamental capability.
Common Distributions and Their Kurtosis
| Distribution | Kurtosis | Type |
|---|---|---|
| Laplace | 3 | Super-Gaussian |
| Student's t (ν=5) | 6 | Super-Gaussian |
| Exponential | 6 | Super-Gaussian |
| Gaussian | 0 | Reference |
| Uniform | -1.2 | Sub-Gaussian |
| Bernoulli (p=0.5) | -2 | Sub-Gaussian |
| Arcsine | -1.5 | Sub-Gaussian |
Kurtosis as ICA Objective
Since independent sources have higher |kurtosis| than their mixtures, we can formulate ICA as:
Maximize $|\text{kurt}(\mathbf{w}^T\mathbf{x})|$ over unit vectors $\mathbf{w}$
where $\mathbf{x}$ is the whitened observed data. The directions of maximum |kurtosis| correspond to the independent sources.
Caution: We maximize the absolute value because sources may be super- or sub-Gaussian. A super-Gaussian source has positive kurtosis; a sub-Gaussian source has negative kurtosis. Both are far from Gaussian.
Estimating Kurtosis from Data
Given samples $y_1, y_2, \ldots, y_T$ (zero-mean):
$$\widehat{\text{kurt}}(Y) = \frac{1}{T}\sum_{t=1}^{T} y_t^4 - 3\left(\frac{1}{T}\sum_{t=1}^{T} y_t^2\right)^2$$
For standardized data (unit variance): $\widehat{\text{kurt}}(Y) = \frac{1}{T}\sum_{t=1}^{T} y_t^4 - 3$
Negentropy provides a more principled, robust measure of non-Gaussianity based on information theory. It quantifies how much a distribution differs from the "most random" (maximum entropy) distribution with the same variance—the Gaussian.
Entropy and the Gaussian Maximum
The differential entropy of a random variable $Y$ with density $p(y)$ is:
$$H(Y) = -\int p(y) \log p(y) , dy$$
Entropy measures uncertainty or "randomness." For a fixed variance $\sigma^2$, the Gaussian distribution has maximum entropy among all distributions. This is a fundamental mathematical result:
$$H(Y) \leq H(Y_{\text{Gauss}}) = \frac{1}{2}\log(2\pi e \sigma^2)$$
with equality if and only if $Y$ is Gaussian.
Definition of Negentropy
Negentropy is the gap between Gaussian entropy and actual entropy:
$$J(Y) = H(Y_{\text{Gauss}}) - H(Y)$$
where $Y_{\text{Gauss}}$ is a Gaussian with the same mean and variance as $Y$.
Properties of Negentropy:
Negentropy measures how much more "structured" or "predictable" a distribution is compared to maximum-entropy Gaussian noise. High negentropy means the signal has distinctive non-Gaussian structure—exactly what we seek in ICA. Negentropy is sometimes called "information-theoretic non-Gaussianity."
Why Negentropy is Optimal for ICA
Negentropy has several theoretical advantages:
Directly measures deviation from Gaussianity: Unlike kurtosis (fourth moment only), negentropy captures all aspects of non-Gaussianity.
Relates to independence via mutual information: For a random vector $\mathbf{Y} = (Y_1, \ldots, Y_n)$: $$I(Y_1; Y_2; \ldots; Y_n) = \sum_i H(Y_i) - H(\mathbf{Y})$$
Minimizing mutual information (maximizing independence) is equivalent to maximizing the sum of marginal negentropies under whitening constraints.
Scale invariance: Negentropy doesn't depend on the variance of the variable.
Non-negative: No sign ambiguity—always maximize, never need absolute value.
The Problem: Estimation
Negentropy requires knowing the true density $p(y)$, which we don't have. Directly estimating density is difficult and data-intensive. We need approximations.
Approximation via Polynomial Expansions
A classical approximation uses Gram-Charlier or Edgeworth expansions. For standardized $Y$ (zero mean, unit variance):
$$J(Y) \approx \frac{1}{12}E[Y^3]^2 + \frac{1}{48}\text{kurt}(Y)^2$$
This shows that:
The polynomial approximation above uses only third and fourth cumulants. More accurate approximations include sixth cumulants and beyond. However, higher-order cumulants are increasingly difficult to estimate reliably from finite samples, leading to a bias-variance tradeoff.
Modern Negentropy Approximations
The FastICA algorithm uses approximations based on smooth, non-polynomial functions. For standardized $Y$:
$$J(Y) \approx [E[G(Y)] - E[G(\nu)]^2$$
where $\nu \sim N(0,1)$ is a standard Gaussian and $G$ is a carefully chosen non-quadratic function.
Common choices for $G$:
Logcosh: $G(u) = \log \cosh(u)$
Exponential: $G(u) = -\exp(-u^2/2)$
Cubic: $G(u) = u^4/4$ (related to kurtosis)
These non-polynomial approximations are:
| Function $G(u)$ | Derivative $g(u)$ | Best For | Properties |
|---|---|---|---|
| $\log\cosh(u)$ | $\tanh(u)$ | General purpose | Robust, smooth, bounded gradient |
| $-\exp(-u^2/2)$ | $u\exp(-u^2/2)$ | Super-Gaussian, outlier-robust | Very robust to extreme values |
| $u^4/4$ | $u^3$ | High-kurtosis sources | Fast, equivalent to kurtosis |
| $u^3/3$ | $u^2$ | Skewed sources | Captures asymmetry |
The deepest theoretical foundation for ICA comes from mutual information—an information-theoretic measure of dependence that directly quantifies statistical independence.
Mutual Information Definition
For random variables $Y_1$ and $Y_2$, mutual information is:
$$I(Y_1; Y_2) = H(Y_1) + H(Y_2) - H(Y_1, Y_2)$$
$$I(Y_1; Y_2) = \int\int p(y_1, y_2) \log \frac{p(y_1, y_2)}{p(y_1)p(y_2)} , dy_1 , dy_2$$
Mutual information is:
Generalization to Multiple Variables
For $n$ variables, the total correlation (multi-information) is:
$$I(Y_1; Y_2; \ldots; Y_n) = \sum_{i=1}^{n} H(Y_i) - H(Y_1, Y_2, \ldots, Y_n)$$
This equals the KL divergence from the joint distribution to the product of marginals:
$$I(\mathbf{Y}) = D_{KL}(p(\mathbf{y}) | \prod_i p(y_i))$$
ICA Objective: Minimize Mutual Information
Since mutual information is zero iff components are independent:
$$\mathbf{W}^* = \arg\min_{\mathbf{W}} I(y_1; y_2; \ldots; y_n)$$
where $\mathbf{y} = \mathbf{W}\mathbf{x}$. This directly seeks independence.
For whitened data with orthogonal demixing: $I(\mathbf{y}) = C - \sum_i J(y_i)$ where $C$ is a constant (depends only on the data, not on $\mathbf{W}$). Minimizing mutual information is equivalent to maximizing the sum of negentropies. This connects independence directly to non-Gaussianity!
Derivation of the Mutual Information-Negentropy Connection
Let $\mathbf{y} = \mathbf{W}\mathbf{x}$ where $\mathbf{x}$ is whitened and $\mathbf{W}$ is orthogonal.
The mutual information of $\mathbf{y}$ is: $$I(\mathbf{y}) = \sum_i H(y_i) - H(\mathbf{y})$$
For an orthogonal transformation: $$H(\mathbf{y}) = H(\mathbf{x}) + \log|\det(\mathbf{W})| = H(\mathbf{x})$$
(since $|\det(\mathbf{W})| = 1$ for orthogonal $\mathbf{W}$).
Also, each $y_i$ has unit variance (whitened + orthogonal), so: $$H(y_i) = H(\nu) - J(y_i)$$
where $\nu \sim N(0,1)$ and $J(y_i)$ is the negentropy of $y_i$.
Substituting: $$I(\mathbf{y}) = \sum_i [H(\nu) - J(y_i)] - H(\mathbf{x})$$ $$= n \cdot H(\nu) - H(\mathbf{x}) - \sum_i J(y_i)$$ $$= C - \sum_i J(y_i)$$
where $C = n \cdot H(\nu) - H(\mathbf{x})$ is constant for the given whitened data.
Therefore: Minimizing mutual information $\Leftrightarrow$ Maximizing sum of negentropies.
Algorithmic Implications
To find independent components:
The second approach is computationally simpler and is the basis of the FastICA algorithm.
Connection to Maximum Likelihood ICA
Another perspective: assume source densities $p_i(s_i)$ (typically unknown, approximated). The likelihood of observed data is:
$$L(\mathbf{W}) = \prod_t \prod_i p_i([\mathbf{W}\mathbf{x}(t)]_i) \cdot |\det \mathbf{W}|$$
Maximizing log-likelihood leads to gradient updates involving: $$\nabla_{\mathbf{W}} \log L = \mathbf{W}^{-T} + \frac{1}{T}\sum_t \mathbf{g}(\mathbf{y}(t))\mathbf{x}(t)^T$$
where $\mathbf{g}(\mathbf{y}) = (g(y_1), \ldots, g(y_n))^T$ and $g(y) = -\frac{d}{dy}\log p(y) = -\frac{p'(y)}{p(y)}$.
This connects to negentropy approximation: the function $g$ in FastICA corresponds to the score function of an assumed source density, and maximizing negentropy is equivalent to maximum likelihood under a super-Gaussian prior.
Different applications and signal types call for different non-Gaussianity measures. Here we provide practical guidance on selection and estimation.
When to Use Kurtosis
Kurtosis is simple and fast, but fragile:
When to Use Negentropy Approximations
Modern ICA implementations favor negentropy approximations:
Matching Contrast to Source Type
For optimal performance, match the contrast function to expected source distributions:
| Source Type | Distribution Examples | Best Contrast | Reason |
|---|---|---|---|
| Super-Gaussian (sparse) | Speech, sparse features | $G(u) = \log\cosh(u)$ | Matches peaked shape |
| Super-Gaussian (heavy-tailed) | Financial, impulsive | $G(u) = -\exp(-u^2/2)$ | Robust to extremes |
| Sub-Gaussian | Uniform, bounded | $G(u) = u^4$ or logcosh | Captures flatness |
| Mixed types | Unknown | $G(u) = \log\cosh(u)$ | General-purpose |
Non-Gaussianity estimation requires sufficient samples. For kurtosis: at least 1000 samples for reliable estimates with moderate tails. For negentropy approximations: typically more stable, but still need hundreds to thousands of samples. With limited data, regularization or Bayesian approaches may help.
Symmetric vs. Asymmetric Non-Gaussianity
Kurtosis and many common negentropy approximations are symmetric—they don't distinguish between a distribution and its negation. This is appropriate for most ICA applications (sign ambiguity is inherent anyway).
However, if sources are known to be asymmetric (skewed), using odd moments (like skewness) or asymmetric contrast functions can improve performance:
$$G(u) = u^3/3 \quad \text{(captures skewness)}$$
For EEG/MEG analysis, some brain sources are asymmetric, motivating skewness-aware ICA.
Non-Gaussianity Spectrum
Different sources in the same problem may have different non-Gaussianity levels:
One can estimate the "separability" of a source mixture by checking the excess kurtosis or negentropy of the recovered components. Low non-Gaussianity suggests potential issues with separation quality.
| Measure | Computation | Robustness | Theory | Recommendation |
|---|---|---|---|---|
| Kurtosis | Very fast | Poor (outlier sensitive) | Simple, clear | Only for clean, bounded data |
| Negentropy (exact) | Infeasible (needs density) | N/A | Optimal | Not practical |
| Negentropy (logcosh) | Fast | Good | Approximation, general | Default choice |
| Negentropy (exp) | Fast | Excellent | Approximation, robust | When outliers present |
| Mutual Information | Slow (needs density) | Moderate | Optimal, direct | Research, not routine use |
This page has developed the theoretical foundation of non-Gaussianity and its central role in Independent Component Analysis.
You now understand why non-Gaussianity is the key to ICA, how to measure it, and how it connects to the fundamental goal of independence. This theoretical foundation directly motivates the FastICA algorithm—an efficient, fixed-point iteration for finding maximally non-Gaussian directions—which we'll develop in the next page.
What's Next:
The next page develops the FastICA algorithm—the most widely used ICA implementation. We'll derive the fixed-point iteration that maximizes negentropy, discuss deflation and symmetric approaches, analyze convergence properties, and provide complete algorithmic details for implementation.