Intuition

Math

Examples

What is ICA?
If we have two unique signals A and B, we can generate new signals by mixing them. These mixed signals (x) are what we are usually measuring. In order to unmix them again we need to know, with what weights the source-signals have been mixed. If we can get the mixing matrix (A), we can invert it to get the unmixing matrix(\( A^{-1}\),sometimes W) with which we can undo the mixing and obtain the underlying sources.
\( n\): number of samples
\( i \): number of sources
\( j \): number of recorded mixtures \(x\)
\( x \): signals / recorder mixtures, \([j * n]\)
\( s \): original underlying source \([i * n]\)
\( A \): mixing matrix [j * i] ]
\( A^{-1}\) also \( W \): unmixing matrix \([i * j] \)
The ICA model: $$x = As$$
We are looking for the mixing matrix \( A\) , that minimizes the dependence of each row of the also unknown \( s \) .

Here we show that \( A^{-1} \) is equal to the unmixing matrix:
$$\begin{aligned} x &= As \\ A^{-1}x &= A^{-1}As [with M^{-1}M = I] \\ A^{-1}x &= s \end{aligned}$$

Audio

Sounds are waves and can easily recorded as digital signals with microphones. If you listen to two persons speaking (the sources s) at the same time, both source-signals are mixed together (by A). If you would now record these mixed signals (x) with two microphones, you can try to disentangle the two recorded mixtures x into their original sources s.

EEG

In EEG you assume to have many different synchronous active neuronal patches (the sources). But with electrodes you always record a summation of these sources and never one source alone. ICA can be used to try to get back to those original sources.

Arbitrary Case

In a more abstract case we have two unique signals A and B. We can generate new signals by addition.$$ \begin{aligned} C &=& 1*A &+ -2*B \\D &=& 1.73*A &+ 3.41*B \end{aligned}$$ We have two independent Signals,\( s_{1}\) and \(s_{2}\). Mixing them results in two signals (\( x_{1} \) and\( x_{2}\)): $$ \begin{aligned} x_{1} &= \sum_{i=1}^{2}a_{1i}*s_{i} \\ x_{1}&= a_{11} * s_{1} + * a_{12} s_{2} \\ x_{1}&= 1* s_{1} + -2* s_{2} \\ x_{2} &= 1.73*s_{1} + 3.41*s_{2} \end{aligned} $$ We have n independent Signals, s. Linear mixing is described by \( x = As \). In our case: $$ A = \left(\begin{array}{lr}1 -2 \\ 1.73 3.41 \end{array}\right)$$. $$ x_j = \sum_{i=1}^{m}a_{ji}*s_{i} (for j = 1,2, ... n)$$
Step 1. Demeaning
In a first step we demean the signal, that means from we subtract the mean value of each mixed signal.
Demeaning (also called centering)
$$x_{m} = x-E[x] $$
The mean amplitude of one channel of a microphone-signal is 65db. We simply subtract this value from the channel. The mean of this channel is now 0.
Step 2. Whitening
Every mixing is part stretching (e.g. \(2*A\)) and part rotation of all points in the signal (see interactive graphic and compare "mixture" and "original"). To unmix, we therefore have to undo the stretching and undo the rotation. In a first step, the "sphering" or "whitening" we try to undo the stretching. After whitening the variables are uncorrelated (that does not imply independence!) and also the mixing matrix is now orthogonal.
$$ \begin{aligned} Cov(N) &= \frac{1}{n-1}(N*N') \\ C &= cov(x_{m}) \\ V &= C^{-\frac{1}{2}} \\ Cov(P) = I &= Cov(V*x_{m}) \\ &= Cov(C^{-\frac{1}{2}}*x_{m}) \\ &= (C^{-\frac{1}{2}}*x_{m})(C^{-\frac{1}{2}}*x_{m})' \\ &= C^{-\frac{1}{2}}*x_{m}*x_{m}'*C^{-\frac{1}{2}} \\ &= C^{-\frac{1}{2}}CC^{-\frac{1}{2}} = I \end{aligned} $$ Multiplying \( Vx_{m} = C^{-\frac{1}{2}}x_{m} \) returns us the \( x_{w} \), the whitened signal.

Rotation matrices

Rotating the point \((0|1) \)clockwise 45° from will result in $$(\sqrt[2]{1/2}|\sqrt[2]{1/2}) $$ which is equal to a matrixmultiplikation $$ \left(\begin{array}{l} 0 \\ 1 \end{array}\right) \left( \begin{array}{ll} \phantom{123}0 \,\,\,\,\sqrt{1/2}\\ \sqrt{1/2} \phantom{123}0 \end{array}\right)$$

Relation of Covariance and 'Stretching'

It is very hard to get a better intuitive understanding of covariance than reading this and this .
Step 3. Rotation
There are two main approaches to get the right rotation. Maximizing statistical independence (e.g. infomax) and minimizing normality.

Maximizing statistical independence

Mutual information is closely related to statistical independence. It is a measure, of how much information one can get about an outcome of B if I know A already. Minimizing mutual information, maximizes independence (c.f. examples).

Minimizing normality

The second approach exploits the central limit theorem (CLT). Simplified the CLT states, that the mixture (signals) of two independent distributions (sources) is often more normal-distributed than then underlying signal. Therefore if we minimize the "normal-distributedness" of the projections (the histograms above), we may get more independence. As a measure of normal-distributiveness, or gaussianity, we have some different methods. A simple approach would be kurtosis, a non-robust measure used to asses normality. As you can see in the visualisation, the histograms in the end are not very gaussian at all, they seem to be uniform or reverse gaussians.

Mutual information:

You draw two cards from a deck, the first a King and the second a 7. These events share some information, as drawing a king, reduces the probability of drawing another King (from 4/52 to 3/51). Therefore mutual information (I) between the first and second drawing is not 0. \[ I(Second;First) \not= 0 \].
You throw two dices. The first shows a 5 the second a 6. These events are completly independent, they do not share mutual information. Knowing what the one dice rolled, does not give you information about the other one. \[ I(Second;First) = 0 \]
fastICA
The concept of negentropy can be used to estimate non-Gaussianity of a distribution. Intuitively entropy is high, if the probability density function is even, entropy is low, if there are clusters or spikes in the pdf. Negentropy in words: It is the difference between the entropy of a normal distribution with the variance and mean of X and the entropy of the distribution X itself. From all probability densitiy functions with the same variance, the gaussian function has the highest entropy. Therefore negentropy is always positive and only zero, if X is also normal.
  1. \( w_{i} \) is the i-th column of the unmixing matrix - we estimate each component iteratively.
  2. We try minimize negentropy, using this formular
  3. The weightvector is normalized (c.f. Limitations)
  4. If we want to estimate the 2nd, 3rd ... source, we need to make sure the unmixing weights are orthogonal a
  5. Make the weight-column vector orthogonal to the other vectors (why does this work?)
  6. Normalize again
  7. Start from the beginning, or finish
Negentropy: $$ H(X) = H(X_{Gauss}) - H(X) $$. Where \( X_{Gauss} \) has the same covariance matrix as X
  1. Initialize \( w_{i} \) randomly
  2. < \( w_{i}^{+} = E(\phi^{'}(w_{i}^{T}X))w_{i} - E(x\phi(w_i{^T}X)) \)
  3. \( w_{i} = \frac{w_{i}^{+}}{||w_{i}^{+}||} \)
  4. For i=1 go to step 7, else step 5
  5. \( w_i^+ = w_i - sum_{j=1}^{i-1}w_i^Tw_jw_j \)
  6. \( w_{i} = \frac{w_{i}^{+}}{||w_{i}^{+}||} \)
  7. If we extraced all sources finish, else \(i = i+1\)and go back to step 1
infomax
The infomax algorithm is an iterative algorithm, trying to minimize mutual information. Mutual information is the amount of information that you get about the random variable X when you already know Y. If X and Y are indepdenent, mutual information is 0. Therefore minimizing mutual information is one way to solve the ICA problem.
  1. Initialize \( W(0) \) random
  2. \( W(t+1) = W(t) + \epsilon(I - f(Wx)(Wx)^T)W(t)\)
  3. If not converged, go back to 2

\( f(x) \) is a nonlinear function, \( f(x) = Tanh(x)\)
\( I \) is the identity matrix.
\( x \) is the original mixed signal
When the algorithm converged, then \(I-f(Wx)(Wx)^T = 0\), that means \(f(Wx)(Wx)^T=I\) therefore \(f(Wx)\) is orthogonal to \((Wx)^T\).
The standard infomax algorithm with \(f(Y) = Tanh(Y)\) is only suitable for super-gaussian distributions, as in sub-gaussian distributions the function gets negative. Therfore an extension to the infomax algorithm "extended infomax" is used, where the function used is $$(I - K4*tanh(Y)*Y')W(T) $$. K4 is the sign of the kurtosis, and thereby allowing to switch between subgaussian and supergaussian source-learning.
Ambiguities and Limitations
a) It is not possible, to recover the original ordering of the sources. Every permutation of order can be matched by the weight matrix.

b) It is not possible, to recover the original scale of the sources. Every arbitrary scale \(k \) can be matched by dividing the weighting matrix by \(k \) and vice versa.

c) IT is not possible, to disentangle a mixture of two gaußian variables. An arbitrary rotation of such a mixture, will not change the shape of the histograms.

a) Let \(P \) be an arbitrary Permutationmatrix for s. $$ x_{p} = APs $$ There exists always a new mixing matrix $$A_{p}=AP^{-1}$$ to do: $$x=A_{p}Ps$$

b) Arbitrary scaling factor \(k \). $$ x = \frac{k}{k}As = (\frac{1}{k}A)(k*s) $$

c) The multivariate distribution of two gaussians is circular, any arbitratry rotation does not change the projected histograms
Assumptions
There are five main assumptions in ICA:
  1. Each source is statistical independent from the others
  2. The mixing matrix is square and of full rank, i.e. there must be the same number of signals to sources and they must be linearly independent from each other
  3. There is no external noise - the model is noise free
  4. The data are centered
  5. There is maximally one gaussian source.
Sources