Feeds:
Posts

## Where does the Sigmoid in Logistic Regression come from?

Note: The title of this post is circular. But I use/abuse it because of the post linked below.

I noticed on the Hacker News front page (and via multiple reshares on twitter), a discussion on why logistic regression uses a sigmoid. The article linked in the story talks about the log-odds ratio, and how it leads to the sigmoid (and gives a good intuitive plug on it).

However, I think that the more important question is – Why do you care about log-odds? Why do you use log-odds and not something else? The point of this quick post is to write out why using the log-odds is infact very well motivated in the first place, and once it is modeled by a linear function, what you get is the logistic function.

Beginning with log-odds would infact be begging the question, so let us try to understand.

____________________________

To motivate and in order to define the loss etc, suppose we had a linear classifier: $\hat{y} = h(\mathbf{x}) = sign(w_0 + \mathbf{w}^T\mathbf{x})$. This just means that for a vector input, we take the dot product with the parameters $\mathbf{w}$ and take the sign.

The learning problem would be to figure out a good direction $\mathbf{w}$ and a good location of the decision boundary $w_0$.

____________________________

We want to figure these out so as to minimize the expected 0-1 loss (or expected number of mistakes) for the classifier $h: \mathcal{X} \rightarrow \mathcal{Y}$. The 0-1 loss for a datapoint-label pair $\mathbf{x},y$ is simply:

$\displaystyle L(h(\mathbf{x}),y) = \begin{cases} 0, & \text{if } h(\mathbf{x}) = y \\ 1, & \text{if } h(\mathbf{x}) \neq y \end{cases}$

Now, the next question we would like to ask. What is the risk of this classifier that we want to minimize? The risk is the expected loss. That is, if we draw a random sample from the (unknown) distribution $p(\mathbf{x},y)$, what would be the expected error? More concretely:

$R(h) = \mathbb{E}_{\mathbf{x},y} [L(h(\mathbf{x}),y)]$

Writing out the expectation:

$\displaystyle R(h) =\int_x \sum_{c=1}^C L(h(\mathbf{x}),c) p(\mathbf{x}, y=c) d\mathbf{x}$

Using the chain rule this becomes:

$\displaystyle R(h) =\int_x \sum_{c=1}^C L(h(\mathbf{x}),c) p(y=c|\mathbf{x}) p(\mathbf{x})d\mathbf{x}$

It is important to understand this expression. This is not assuming anything about the data. However, it is this expression that we want to minimize if we want to get a good classifier. To minimize this expression, it suffices to simply minimize for the conditional risk for any point $\mathbf{x}$ (i.e. the middle part of the above expression):

$\displaystyle R(h|\mathbf{x}) =\sum_{c=1}^C L(h(\mathbf{x}),c) p(y=c|\mathbf{x})$

But this conditional risk can be written as:

$\displaystyle R(h|\mathbf{x}) =0 \times p(y=h(\mathbf{x})|\mathbf{x}) + 1 \times \sum_{c \neq h(\mathbf{x})} p(y=c|\mathbf{x})$

Note that, $\displaystyle \sum_{c \neq h(\mathbf{x})} p(y=c|\mathbf{x}) = 1 - p(y=h(\mathbf{x})|\mathbf{x})$

Therefore, the conditional risk is simply:

$\displaystyle R(h|\mathbf{x}) = 1 - p(y=h(\mathbf{x})|\mathbf{x})$

Now, it is this conditional risk that we want to minimize given a point $\mathbf{x}$. And in order to do so, looking at the expression above, the classifier must make the following decision:

$\displaystyle h(\mathbf{x}) = \arg\max_c p(y=c| \mathbf{x})$

It is again important to note that so far we have made absolutely no assumptions about the data. So the above classifier is the best classifier that we can have in terms of generalization, in the sense of what might be the expected loss on a new sample point. Such a classifier is called the Bayes Classifier or sometimes called the Plug-in classifier.

But the optimal decision rule mentioned above i.e. $\displaystyle h(\mathbf{x}) = \arg\max_c p(y=c| \mathbf{x})$ is equivalent to saying that:

$\displaystyle h(\mathbf{x}) = c^\ast \iff \frac{p(y = c^\ast|\mathbf{x})}{p(y=c|\mathbf{x})} \geq 1 \text{ }\forall c$

by taking log, this would be:

$\displaystyle h(\mathbf{x}) = c^\ast \iff \log \frac{p(y = c^\ast|\mathbf{x})}{p(y=c|\mathbf{x})} \geq 0 \text{ }\forall c$

If, we were only dealing with binary classification, this would imply:

$\displaystyle h(\mathbf{x}) = 1 \iff \log \frac{p(y = 1|\mathbf{x})}{p(y=0|\mathbf{x})} \geq 0$

Notice that by making no assumptions about the data, simply by writing out the conditional risk, the log-odds ratio has fallen out directly. This is not an accident, because the optimal bayes classifier has this form for binary classification. But the question still remains, how do we model this log-odds ratio? The simplest option is to consider a linear model (there is no reason to stick to a linear model, but due to some reasons, one being convexity, we stick to a linear model):

$\displaystyle log \frac{p(y = 1|\mathbf{x})}{p(y=0|\mathbf{x})} = w_0 + \mathbf{w}^T \mathbf{x} = 0$

Now, we know that $p(y=1|\mathbf{x}) = 1 - p(y=0|\mathbf{x})$, plugging this in the above, and exponentiating, we have:

$\displaystyle log \frac{p(y = 1|\mathbf{x})}{1 - p(y=0|\mathbf{x})} = \exp(w_0 + \mathbf{w}^T \mathbf{x}) = 1$

Rearranging, yields the familiar logistic model (and the sigmoid):

$\displaystyle p(y = 1|\mathbf{x}) = \frac{1}{1+ \exp(- w_0 - \mathbf{w}^T\mathbf{x})} = \frac{1}{2}$.

As noted in the post linked in the beginning, the logistic model, $\sigma(x) = \frac{1}{1+ e^{-x}}$, which for any $x$ is, $0 \leq \sigma(x) \leq 1$, and is monotonic $\sigma(-\inf) = 0, \sigma(+\inf) = 1$.

____________________________

This derivation shows that the log-odds is not an arbitrary choice, infact a very natural choice. The sigmoid is simply a consequence of modeling the log-odds with a linear function (infact logistic regression is arguably the simplest example of a log-linear model, if we had structured outputs, a natural extension of such a model would be the Conditional Random Field. The choice of using a linear function is simply to make the optimization convex, amongst some other favourable properties).

____________________________

Note: This post was inspired by some very succinct notes by Gregory Shakhnarovich from his Machine Learning class, that I both took and served as a TA for.

## The Jacobian Inner Product

This post may be considered an extension of the previous post.

The setup and notation is the same as in the previous post (linked above). But to summarize: Earlier we had an unknown smooth regression function $f: \mathbb{R}^d \to \mathbb{R}$. The idea was to estimate at each training point, the gradient of this unknown function $f$, and then taking the sample expectation of the outerproduct of the gradient. This quantity has some interesting properties and applications.

However it has its limitations, for one, the mapping $f: \mathbb{R}^d \to \mathbb{R}$ restricts the Gradient Outer Product being helpful for only regression and binary classification (since for binary classification the problem can be thought of as regression). It is not clear if a similar operator can be constructed when one is dealing with classification, that is the unknown smooth function is a vector valued function $f: \mathbb{R}^d \to \mathbb{R}^c$ where $c$ is the number of classes (let us say for the purpose of this discussion, that for each data point we have a probability distribution over the classes, a $c$ dimensional vector).

In the case of the gradient outer product since we were working with a real valued function, it was possible to define the gradient at each point, which is simply:

$\displaystyle \Bigg[ \frac{\partial f}{\partial x_1}, \frac{\partial f}{\partial x_2}, \dots, \frac{\partial f}{\partial x_d} \Bigg]$

For a vector valued function $f: \mathbb{R}^d \to \mathbb{R}^c$, we can’t have the gradient, but instead can define the Jacobian at each point:

$\displaystyle \mathbf{J} = \begin{bmatrix} \frac{\partial f_1}{\partial x_1} & \frac{\partial f_1}{\partial x_2} & \dots & \frac{\partial f_1}{\partial x_d} \\ \vdots & \vdots & \ddots & \vdots \\ \frac{\partial f_c}{\partial x_1} & \frac{\partial f_c}{\partial x_2} & \dots & \frac{\partial f_c}{\partial x_d}\end{bmatrix}$

Note that $\mathbf{J}$ may be estimated in a similar manner as estimating gradients as in the previous posts. Which leads us to define the quantity $\mathbb{E}_X G(X) = \mathbb{E}_X ( \mathbf{J}^T \mathbf{J})$.

The first thing to note is that $\mathbb{E}_X G(X) = \mathbb{E}_X ( \nabla f(X)\nabla f(X)^T)$ defined in the previous post is simply the quantity for the special case when $f: \mathbb{R}^d \to \mathbb{R}$. Another note is also in order: The reason why we suffixed that quantity with “outer product” (as opposed to “inner product” here) is simply because we considered the gradient to be a column vector, otherwise they are similar in spirit.

Another thing to note is that it is easy to see that the quantity $\mathbb{E}_X G(X) = \mathbb{E}_X ( \mathbf{J}^T \mathbf{J})$ is a positive semi-definite matrix and hence is a Reimannian Metric, which is defined below:

Definition: A Reimannian Metric $G$ on a manifold $\mathcal{M}$ is a symmetric and positive semi-definite matrix, which defines a smoothly varying inner product in the tangent space $\mathbf{T}_x \mathcal{M}$, for each point $x \in \mathcal{M}$ and $a, b \in \mathbf{T}_x \mathcal{M}$. This associated p.s.d matrix is called the metric tensor. In the above case, since $\mathbb{E}_X G(X) = \mathbb{E}_X ( \mathbf{J}^T \mathbf{J})$ is p.s.d it defines a Reimannian metric:

$\langle a, b \rangle_x = a^T \mathbb{E}_X ( \mathbf{J}^T \mathbf{J}) b$

Thus, $\mathbb{E}_X ( \mathbf{J}^T \mathbf{J})$ is a specific metric (more general metrics are dealt with in areas such as metric learning).

Properties: We saw some properties of $\mathbb{E}_X G(X) = \mathbb{E}_X ( \nabla f(X)\nabla f(X)^T)$ in the previous post. In the same vein, does $\mathbb{E}_X G(X) = \mathbb{E}_X ( \mathbf{J}^T \mathbf{J})$ have similar properties? i.e. does the first eigenvector also correspond to the direction of highest average variation? What about the $k$-dimensional subspace? What difference does it make that we are looking at a vector valued function? Also what about the cases when $d > c$ and otherwise?

These are questions that I need to think about and should be the topic for a future post to be made soon, hopefully.

Recently, in course of a project that I had some involvement in, I came across an interesting quadratic form. It is called in the literature as the Gradient Outer Product. This operator, which has applications in supervised dimensionality reduction, inverse regression and metric learning can be motivated in two (related) ways, but before doing so, the following is the set up:

Setup: Suppose we have the usual set up as for nonparametric regression and (binary) classification i.e. let $Y \approx f(X)$ for some unknown smooth $f$, the input $X$ is $d$ dimensional $X = (X^i)_{i=1}^d$

1. Supervised Dimensionality Reduction: It is often the case that $f$ varies most along only some relevant coordinates. This is the main motivation behind variable selection.

The idea in variable selection is the following: That $f(X)$ may be written as $f(PX)$ where $P \in \{0,1\}^{k \times d}$. $P$ projects down the data to only $k$ relevant coordinates (i.e. some features are selected by $P$ while others are discarded).

This idea is generalized in Multi-Index Regression, where the goal is to recover a subspace most relevant to prediction. That is, now suppose the data varies significantly along all coordinates but it still depends on some subspace of smaller dimensionality. This might be achieved by letting $P$ from the above to be $P \in \mathbb{R}^{k \times d}$. It is important to note that $P$ is not any subspace, but rather the $k$-dimensional subspace to which if the data is projected, the regression error would be the least. This idea might be further generalized by means of mapping $X$ to some $P$ non-linearly, but for now we only stick to the relevant subspace.

How can we recover such a subspace?

________________

2. Average Variation of $f$: Another way to motivate this quantity is the following: Suppose we want to find the direction in which $f$ varies the most on average, or the direction in which $f$ varies the second fastest on average and so on. Or more generally, given any direction, we want to find the variation of $f$ along it. How can we recover these?

________________

The Expected Gradient Outer Product:  The expected gradient outer product of the unknown classification or regression function is the quantity: $\mathbb{E}_X G(X) = \mathbb{E}_X ( \nabla f(X)\nabla f(X)^T)$

The expected gradient outer product recovers the average variation of $f$ in all directions. This can be seen as follows: The directional derivative at $x$ along $v \in \mathbb{R}^d$ is given by $\displaystyle {f'}_v(x) = \nabla f(x)^T v$ or $\mathbb{E}_X |{f'}_v(X)|^2 = \mathbb{E}_X (v^T G(X) v) = v^T (\mathbb{E}_X G(X))v$.

From the above it follows that if $f$ does not vary along $v$ then $v$ must be in the null space of $\mathbb{E}_X (G(X))$.

Infact it is not hard to show that the relevant subspace $P$ as defined earlier can also be recovered from $\mathbb{E}_X (G(X))$. This fact is given in the following lemma.

Lemma: Under the assumed model i.e. $Y \approx f(PX)$, the gradient outer product matrix $\mathbb{E}_X (G(X))$ is of rank at most $k$. Let $\{v_1, v_2, \dots, v_k \}$ be the eigenvectors of $\mathbb{E}_X (G(X))$ corresponding to the top $k$ eigenvalues of $\mathbb{E}_X (G(X))$. Then the following is true:

$span(P) = span(v_1, v_2, \dots, v_k)$

This means that a spectral decomposition of $\mathbb{E}_X (G(X))$ recovers the relevant subspace. Also note that the Gradient Outer Product corresponds to a kind of a supervised version of Principal Component Analysis.

________________

Estimation: Ofcourse in real settings the function is unknown and we are only given points sampled from it. There are various estimators for $\mathbb{E}_X (G(X))$, which usually involve estimation of the derivatives. In one of them the idea is to estimate, at each point $x$ a linear approximation to $f$. The slope of this approximation approximates the gradient at that point. Repeating this at the $n$ sample points, gives a sample gradient outer product. There is some work that shows that some of these estimators are statistically consistent.

________________

Related: Gradient Based Diffusion Maps: The gradient outer product can not isolate local information or geometry and its spectral decomposition, as seen above, gives only a linear embedding. One way to obtain a non-linear dimensionality reduction would be to borrow from and extend the idea of diffusion maps, which are well established tools in semi supervised learning. The central quantity of interest for diffusion maps is the graph laplacian $L = I - D^{-\frac{1}{2}} W D^{-\frac{1}{2}}$, where $D$ is the degree matrix and $W$ the adjacency matrix of the nearest neighbor graph constructed on the data points. The non linear embedding is obtained by a spectral decomposition of the operator $L$ or its powers $L^t$.

As above, a similar diffusion operator may be constructed by using local gradient information. One such possible operator could be:

$\displaystyle W_{ij} = W_{f}(x_i, x_j) = exp \Big( - \frac{ \| x_i - x_j \| ^2}{\sigma_1} - \frac{ | \frac{1}{2} (\nabla f(x_i) + \nabla f(x_j)) (x_i - x_j) |^2 }{\sigma_2}\Big)$

Note that the first term is the same that is used in unsupervised dimension reduction techniques such as laplacian eigenmaps and diffusion maps. The second term can be interpreted as a diffusion on function values. This operator gives a way for non linear supervised dimension reduction using gradient information.

The above operator was defined here, however no consistency results for the same are provided.

Also see: The Jacobian Inner Product.

________________

## Blind Source Separation in Magnetic Resonance Images

I have been involved in a major project on contrast enhancement of Magnetic Resonance Images by using Independent Component Analysis (ICA) and Support Vector Machines (SVM) for the past couple of  months. It is an extremely exciting project and also something new for me, as I have worked on bio-medical images just once before. In the past, I have used ICA and SVM in face recognition/authentication, however this application is quite novel.

This post intends to introduce the problem, discuss a motivating example, some methods, expected work and some problems.

__________

A Simple Introduction and Motivating Example:

The simplest motivating example for this problem is the famous cocktail party problem:

You are at a cocktail party, and there are about 12 people present with each talking simultaneously. Add to that a music source. So that makes it 13.

Suppose you want to follow what each person was saying later and for doing so you place a number of tape recorders at different locations in the room (let’s not worry about the number of recorders right now). When you hear them later, the sounds would hardly be understandable as they would be mixed up.

Now you define an engineering problem : that using these recordings (which are basically mixtures), separate out the different sources with as little distortion as possible. In a real time cocktail party, the brain shows[1][2][3] a remarkable ability to follow one conversation. However such a problem has proved to be quite difficult in signal processing. Let’s just illustrate the cocktail party problem in a cartoon below :

The Cocktail Party Problem

Please listen to a demo of the cocktail party problem at the HUT ICA project page.

__________

The Logic Behind Constructing MR Images in Simple Terms:

Now, keeping the previous brief discussion in mind. Let’s introduce in simple words how MRI works. This is just a simplification to make the idea clearer, and not really how MRI works.  Discussing MRI in detail would divert the focus of the post. To look at how MRI works follow these highly recommended tutorials[4][5][6]:

Suppose your body is placed in a Magnetic Field (let’s not worry about specifics yet). Consider two contiguous tissues in your body – X and Y. When subject to a magnetic field, the particles (protons) in the tissues would get aligned according to the field. The amount of magnetization would depend on the tissue type. Now suppose we want to measure how much a tissue gets magnetized. One way to think about it is like this : First apply the magnetic field, after the application the particles would get excited. Once the field is removed, these particles would tend to relax to their ground state. By being able to measure the time it takes for the particles to return, we would get some measure of the magnetization of the tissue(s). This is because, the greater the time for relaxation, greater the magnetization.

An image is basically a measure of the energy distribution. Now suppose we have the measurements for tissues X and Y, and since they were of a different nature (composition, density of protons etc), their response to the field would have been different. Thus we would get some contrast between them and thus would get an image.

In very simplistic terms, this is how MRI scans are obtained. Though as mentioned above, please follow [4][5][6] for detailed tutorials on MRI.

__________

MRI scans of the Brain and the Cocktail Party Problem :

Now consider the above discussion in context of taking a MRI scan of the brain. The brain has a number of constituents. Some being : Gray Matter, White Matter, Cerebrospinal Fluid (CSF) Fat, Muscle/Skin, Glial Matter etc. Now since each is unique, they would exhibit unique characteristics under a magnetic field. However, while taking a scan, we get one MRI image of the entire brain.

These scans can be considered as an equivalent to the mixtures of the cocktail party example. If we apply blind source separation to these, we should be able to separate out the various constituents such as gray matter, white matter, CSF etc. These images of the independent sources can be used for better diagnosis. This would be something like this :

If suppose the Simulated MR scans (from the McGill Simulated brain Database) were as follows:

Simulated MR Scans

The “ground truth” images for these scans would be as follows :

Ground Truth Images of Different Brain Tissue Substances

__________

Restatement of the Broad Research Problem and Use of ICA and SVM:

Magnetic Resonance Imaging is superior to Computerised Tomography for brain imaging at least, for the reason that it can give much better soft tissue contrast (because even small changes in the proton density and composition in the tissue are well represented).

Like for most techniques, improvements to scans obtained by MRI are much desired to improve diagnosis. Blind source separation has been used to separate physiologically different components from EEG[7]/MEG[8] data (similar to the cocktail party problem), financial data[9] and even in fMRI[10][11]. But it has not received much attention for MRI. Nakai et al[12] used Independent Component Analysis for the purpose of separating physiologically independent components from MRI scans. They took MR images of 10 normal subjects, 3 subjects with brain tumour and 1 subject with multiple sclerosis and performed ICA on the data. They reported success in improving contrast for gray and white matter, which was beneficial for the diagnosis of brain tumour. The demylination in Multiple sclerosis cases was also enhanced in the images. They suggested that ICA could potentially separate out all the tissues which had different relaxation characteristics (different sources of the cocktail party example). This approach thus shows much promise.

In more technical terms : Consider a set of MR frames as a single multispectral image. Where each band is taken during a particular pulse sequence (will be discussed below). Then use ICA on the data to separate out the physiologically independent components. A classifier such as the SVM can improve the contrast further of the separated independent components.

However, using ICA for MRI has been tricky, something I would discuss towards the end of this post and also in future posts.

Before doing so, I intend to touch up on the basics for the sake of completeness.

__________

Magnetic Resonance Imaging:

I had been thinking of writing a detailed tutorial on MRI, mostly because it requires some basic physics. However I don’t think it is required. I would recommend [4][5][6] for a study of the same in sufficient depth. I have recently taken tutorials on MRI, and would be willing to write for the blog if there are requests.

__________

An Introduction to Independent Component Analysis:

Independent Component Analysis was developed initially to solve problems such as the cocktail party problem discussed above.

Let’s formalize a problem like the cocktail party example. For simplicity let us assume that there are only two sources and two mixtures (obtained by keeping two recorders at different locations in the party).

Let’s represent these two mixtures as $x_1$ and $x_2$, and let $s_1$ and $s_2$ be the two sources that were mixed. Since we are assuming that the two microphones were kept at different locations, the mixtures $x_1$ and $x_2$ would be different.

We could write this as:

$x_1 = a_{11}s_1 + a_{12}s_2 \quad \cdots \quad (1)$

$x_2 = a_{21}s_1 + a_{22}s_2 \quad \cdots \quad (2)$

The coefficients $a_{11}, a_{12}, a_{21}, a_{22}$ are basically some parameters that depend on the distance of the respective source from the microphones.

Let’s define our problem as : Using only the mixtures $x_i$ estimate the signal sources $s_i$. It is notable that you do not have any knowledge of the parameters $a_{ij}$.

This could be illustrated by this :

Consider three signals:

Suppose we have five mixtures obtained from these three signals.

Signals obtained by mixing source signals

If you only have the mixed signals available. And do not know how they were mixed (parameters $a_{ij}$ not known). And from these mixed signals ($x_{i}$) you have to estimate the source signals ($s_{i}$). This problem is of considerable difficulty.

One approach would be : Use the statistical properties of the signals ($s_i$) to estimate the parameters ($a_{ij}$). It is surprising that it is enough to assume that $s_1$ and $s_2$ are statistically independent. This assumption might not be valid in many scenarios. But works well in most situations.

We could write the above system of linear equations in matrix form as :

$x=As$

where, $A$ represents the mixing matrix, $x$ and $s$ represent the mixtures and the sources respectively.

The problem is to estimate $s$ from $x$ without knowing $A$. The assumption made is that the sources $s$ are statistically independent.

How we go about solving this problem is exciting and an area of active research.  ICA was originally developed for solving such problems. Please follow [12][13][14] for discussions on mutual information, measures of non-gaussianity such as Kurtosis and Negentropy and the fastICA algorithm.

__________

Why can ICA be used in MRI?

One limitation that ICA faces is that it can not work if more than one signal sources have a  Gaussian distribution. This can be illustrated as follows:

Again consider our equation for just two sources :

$\displaystyle \begin{bmatrix} x_1 \\ x_2 \end{bmatrix} = \begin{bmatrix} a_{11} & a_{12} \\ a_{21} & a_{22} \end{bmatrix} \begin{bmatrix} s_1 \\ s_2 \end{bmatrix}$

Our problem was : We have to estimate $s$ from $x$ without any knowledge of $A$. We would first need to estimate the parameters $A$ from $x$, assuming statistical independence of $s$. And then we could find $s$ as :

$s = Wx$, where $W=A^{-1}$ , or the inverse of the estimated mixing matrix $A$.

To understand how a solution would become impossible if both the sources had a Gaussian distribution, consider this :

Consider two independent components having the following uniform distributions:

$P(s_i) = \begin{cases} \frac{1}{2 \sqrt{3}} & \text{if} \quad |s_i| \leq \sqrt{3} \\ 0 & \text{otherwise} \end{cases}$

The joint density of the two sources would then be uniform on a square. This follows from the fact that the joint density would be the product of the two marginal densities.

The joint distribution for Si

[ Image Source : Reference [12][13] ]

Now if $s_1$ and $s_2$ were mixed by a mixing matrix $A$

$A = \begin{bmatrix} 2 & 3 \\ 2 & 1 \end{bmatrix}$

The mixtures obtained are $x_1$ and $x_2$. Now since the original sources had a joint distribution on a square, and they were transformed by using a mixing matrix, the joint distribution of the mixtures $x_1$ and $x_2$ will be a parallelogram. These mixtures are no longer independent.

Joint Distribution of the mixtures

[ Image Source : Reference [12][13] ]

Now consider the problem once again : We have to estimate the mixing matrix $A$ from the mixtures $x_i$, and using this estimated $A$ we have to estimate the sources $s_i$.

From the above joint distribution we have a way to estimate $A$. The edges of the parallelogram are in a direction given by the columns of $A$. This is an intuitive way of estimating the mixing matrix : obtain the joint distributions of the mixtures, estimate the columns of the mixing matrix by finding the directions of the edges of the parallelogram. This solution gives a good intuitive feel of a in-principle solution of the problem( however, it isn’t practical).

However, now instead of two independent sources having a uniform distribution consider two independent sources having a Gaussian distribution. The joint distribution would be :

Joint Distribution when both Independent sources are Gaussian

[ Image Source : Reference [12][13] ]

Now going by the above discussion, because of the nature of the above joint distribution, it is not possible to estimate the mixing matrix from it.

Thus ICA fails when one or more independent components have a a gaussian distribution.

Noise in MRI is non-gaussian[16], therefore ICA is suited for MRI.

__________

Problems in Using ICA for MRI Blind Source Separation:

The application of ICA for MRI faces a number of problems. I would discuss these in later blog posts. I would only discuss one major problem – the problem of Over-Complete ICA.

Over-Complete ICA in MRI:

The problem of over complete ICA occurs when there are lesser sensors (tape recorders from our above discussion) than sources. This problem can be understood by the following discussion. Suppose you have 3 mixtures $x_1$, $x_2$ and $x_3$ (imagine you have collected 3 tape recordings in a cocktail party of 6). Therefore you now have to estimate 6 sources from 3 mixtures.

Now the problem becomes something like this :

$x_1 = a_{11}s_1 + a_{12}s_2 + a_{13}s_3 + a_{14}s_4 + a_{15}s_5 + a_{16}s_6$

$x_2 = a_{21}s_1 + a_{22}s_2 + a_{23}s_3 + a_{24}s_4 + a_{25}s_5 + a_{26}s_6$

$x_3 = a_{31}s_1 + a_{32}s_2 + a_{33}s_3 + a_{34}s_4 + a_{35}s_5 + a_{36}s_6$

Assume for a second we can still estimate $a_{ij}$, still we can not find all the signal sources. As the number of linear equations is just three, while the number of unknowns is 6. This is a considerably harder problem and has been discussed by many groups such as [19][20][21].

Now dropping our assumption, the estimation of $a_{ij}$ is also harder in such a case.

The Case in MRI:

The problem of over-complete ICA doesn’t arise when it comes to functional-MRI. However it is a problem when it comes to MRI[17].

In MRI, by varying the parameters used for imaging, the three kind of images that can be obtained are T1 weighted, T2 weighted and Proton Density images. Going by our discussion in the section on MRI above. These three can be treated as mixtures.

Therefore, we have 3 mixtures at our disposal.  However, as the ground truth images above show: The number of different tissues in the brain exceeds 9. Thus this becomes a considerably difficult problem : We have to estimate 9-10 independent components from just 3 mixtures.

I would discuss methods that can help do that in later blog posts.

If only three mixtures are used, 3 ICs can be estimated. Since the actual number of ICs exceeds 9. It is obvious that the each of 3 ICs have atleast 2 ICs mixed, which means that a certain tissue type is not enhanced as much as it could have been had there been one IC for it. This can be understood by looking at this example.

3 ICs obtained by Applying Fast-ICA on MR scans

[I used FastICA for obtaining these Independent Components ]

To get more ICs, in simple words, we need more mixtures. However we can obtain more mixtures from the existing mixtures itself by a process of Band-Expansion[18].

I would discuss this problem of OC-ICA and it’s possible solutions in later posts.

__________

To Conclude:

A basic idea related to application of ICA to MR scans was discussed. It is clear that even with just three ICs significant tissue contrast enhancement is achieved. Problems related to OC-ICA would be discussed in later posts one by one. I would also discuss quantifying the results obtained using the Tanimoto/Jaccard coefficient of similarity.

__________

References and Resources:

Cocktail Party Problem

[1] “Some Experiments on the Recognition of Speech, with One and with Two Ears“; E. Colin Cherry; The Journal of the Acoustical Society of America; September 1953. (PDF)

[2] “The Attentive Brain“; Stephen Grossberg; Department of Cognitive and Neural Systemss – Boston University; American Scientist, 1995. (PDF)

[3] “The Cocktail Party Problem : A Primer“; Josh H. McDermott; Current Biology Vol 19. No. 22. (PDF)

Magnetic Resonance Imaging

[4] “Magnetic Resonance ImagingTutorial“; H Panepucci and A Tannus; Technical Report; USP, 1994. (PDF)

[5] “10 Video lessons on MRI by Paul Callaghan” (~ an hour in total). (Videos)

[6] “MRI Tutorial for Neuroscience Boot Camp” Melissa Saenz. (PDF)

Sample ICA Applications Similar to The Cocktail Party Problem

[7] “Independent Component Analysis of Electroencephalographic Data“; Makieng, Bell, Jung, Sejnowski; Advances in Neural Information Processing Systems, 1996. (PDF)

[8] “Application of ICA to MEG noise Reduction“; Masaki Kawakatsu; 4th International Symposium on Independent Component Analysis and Blind Source Separation; 2003. (PDF)

[9] “Independent Component Analysis in Financial Data” from the book Computational Finance; Yasser S. Abu-Mostafa; The MIT Press; 2000. (Book Link)

[10] “ICA of functional MRI data : An overview“; Calhoun, Adali, Hansen, Larsen, Pekar; 4th International Symposium on Independent Component Analysis and Blind Source Separation; 2003. (PDF)

[11] “Independent Component Analysis of fMRI Data – Examining the Assumptions“; McKeown, Sejnowski; Human Brain Mapping; 1998. (PDF)

Independent Component Analysis : Tutorials/Books

[12] “Independent Component Analysis : Algorithms and Applications“; Aapo Hyvärinen, Erkki Oja; Neural Networks; 2000. (PDF)

[13] “Independent Component Analysis“; Aapo Hyvärinen, Juha Karhunen, Erkki Oja; John Wiley Publications; 2001. (Book Link)

[14] ICA Tutorial at videolectures.net by Aapo Hyvärinen. (Videos)

Independent Component Analysis for Magnetic Resonance Imaging

[15] “Application of of Independent Component Analysis to Magnetic Resonance Imaging for enhancing the Contrast of Gray and White Matter“; Nakai, Muraki, Bagarinao, Miki, Takehara, Matsuo, Kato, Sakahara, Isoda; NeuroImage; 2004. (Journal Link)

[16] “Noise in MRI“; Albert Macovski; Magnetic Resonance in Medicine; 1996. (PDF)

[17] “Independent Component Analysis in Magnetic Resonance Image Analysis“;  Ouyang, Chen, Chai, Clayton Chen, Poon, Yang, Lee; EURASIP journal on Advances in Signal Processing; 2008 (Journal Link)

[18] “Band Expansion Based Over-Complete Independent Component Analysis for Multispectral Processing of Magnetic Resonance Images “; Ouyang, Chen, Chai, Clayton Chen, Poon, Yang, Lee; IEEE Transactions on Biomedical Imaging; June 2008. (PDF)

Over-Complete ICA:

[19] “Blind Source Separation of More Sources Than Mixtures Using Over Complete Representations“; Lee, Lewicki, Girolami, Sejnowski; IEEE Signal Processing Letters; 1999. (PDF)

[20] “Learning Overcomplete Representations“; Lewicki, Sejnowski. (PDF)

[21] “A Fast Algorithm for estimating over-complete ICA bases for Image Windows “; Hyvarinen, Cristescu, Oja; International Joint Conference on Neural Networks; 1999. (IEEE Xplore link)

__________

Onionesque Reality Home >>