Archive for the ‘Signal Processing’ Category

Deep Learning reads Wikipedia and discovers the meaning of life – Geoff Hinton.

The above quote is from a very interesting talk by Geoffrey Hinton I had the chance to attend recently.

I have been at a summer school on Deep Neural Nets and Unsupervised Featured Learning at the Institute for Pure and Applied Mathematics at UCLA since July 9 (till July 27). It has been organized by Geoff Hinton, Yoshua Bengio, Stan Osher, Andrew Ng and Yann LeCun.

I have always been a “fan” of Neural Nets and the recent spike in interest in them has made me excited, thus the school happened at just the right time. The objective of the summer school is to give a broad overview of some of the recent work in Deep Learning and Unsupervised Feature Learning with emphasis on optimization, deep architectures and sparse representations. I must add that after getting here and looking at the peer group I would consider myself lucky to have obtained funding for the event!

[Click on the above image to see slides for the talks. Videos will be added at this location after July 27 Videos are now available]

That aside, if you are interested in Deep Learning or Neural Networks in general, the slides for the talks are being uploaded over here (or click on the image above), videos will be added at the same location some time after the summer school ends so you might like to bookmark this link.

The school has been interesting given the wide range of people who are here. The diversity of opinions about Deep Learning itself has given a good perspective on the subject and the issues and strengths of it. There are quite a few here who are somewhat skeptical of deep learning but are curious, while there are some who have been actively working on the same for a while. Also, it has been enlightening to see completely divergent views between some of the speakers on key ideas such as sparsity. For example Geoff Hinton had a completely different view of why sparsity was useful in classification tasks than compared to Stéphane Mallat, who gave a very interesting talk today even joking that “Hinton and Yann LeCun told you why sparsity is useful, I’ll tell you why sparsity is useless. “. See the above link for more details.

Indeed, such opinions do tell you that there is a lot of fecund ground for research in these areas.

I have been compiling a reading list on some of this stuff and will make a blog-post on the same soon.


Onionesque Reality Home >>

Read Full Post »

I thought I understood Spectral Clustering well enough till I came across these two paragraphs:

Graph Laplacians are interesting linear operators attached to graphs. They are the discrete analogues of the Laplace-Beltrami operators that appear in the Hodge theory of Riemannian manifolds, whose null spaces provide particularly nice representative forms for de Rham cohomology. In particular, their eigenfunctions produce functions on the vertex set of the graph. They can be used, for example, to produce cluster decompositions of data sets when the graph is the 1-skeleton of a Vietoris-Rips complex. We find that these eigenfunctions (again applied to the 1-skeleton of the Vietoris-Rips complex of a point cloud) also can produce useful filters in the Mapper analysis of data sets

– From Prof. Gunnar Carlsson’s survey Topology and Data. (More on this survey as a manifesto for “Topology and Data” in a later post). That aside, I do like how the image on the wiki entry for Vietoris-Rips complex looks like:

A little less intimidating ( now this almost borders on “ofcourse that’s how it is”. I am interested in the same reaction for the paragraph above some months later):

A related application [of the graph laplacian] is “Spectral Clustering”, which is based on the observation that nodal domains of the first eigenvectors of the graph laplacian can be used as indicators for suitably size-balanced minimum cuts.

– From Laplacian Eigenvectors of Graphs linked in the previous post. While this isn’t really as compressed as the lines above, they made me think since I did not know about Courant’s Nodal domain theorem. Like I did in the previous blog post, I would highly recommend this (about 120 page) book. It soon covers the Nodal Domain theorem and things make sense (even in context of links between PCA and k-means and Non-Negative Matrix Factorization and Spectral Clustering, at least in an abstract sense).


Onionesque Reality Home >>

Read Full Post »

I have been involved in a major project on contrast enhancement of Magnetic Resonance Images by using Independent Component Analysis (ICA) and Support Vector Machines (SVM) for the past couple of  months. It is an extremely exciting project and also something new for me, as I have worked on bio-medical images just once before. In the past, I have used ICA and SVM in face recognition/authentication, however this application is quite novel.

This post intends to introduce the problem, discuss a motivating example, some methods, expected work and some problems.


A Simple Introduction and Motivating Example:

The simplest motivating example for this problem is the famous cocktail party problem:

You are at a cocktail party, and there are about 12 people present with each talking simultaneously. Add to that a music source. So that makes it 13.

Suppose you want to follow what each person was saying later and for doing so you place a number of tape recorders at different locations in the room (let’s not worry about the number of recorders right now). When you hear them later, the sounds would hardly be understandable as they would be mixed up.

Now you define an engineering problem : that using these recordings (which are basically mixtures), separate out the different sources with as little distortion as possible. In a real time cocktail party, the brain shows[1][2][3] a remarkable ability to follow one conversation. However such a problem has proved to be quite difficult in signal processing. Let’s just illustrate the cocktail party problem in a cartoon below :


The Cocktail Party Problem

Please listen to a demo of the cocktail party problem at the HUT ICA project page.


The Logic Behind Constructing MR Images in Simple Terms:

Now, keeping the previous brief discussion in mind. Let’s introduce in simple words how MRI works. This is just a simplification to make the idea clearer, and not really how MRI works.  Discussing MRI in detail would divert the focus of the post. To look at how MRI works follow these highly recommended tutorials[4][5][6]:

Suppose your body is placed in a Magnetic Field (let’s not worry about specifics yet). Consider two contiguous tissues in your body – X and Y. When subject to a magnetic field, the particles (protons) in the tissues would get aligned according to the field. The amount of magnetization would depend on the tissue type. Now suppose we want to measure how much a tissue gets magnetized. One way to think about it is like this : First apply the magnetic field, after the application the particles would get excited. Once the field is removed, these particles would tend to relax to their ground state. By being able to measure the time it takes for the particles to return, we would get some measure of the magnetization of the tissue(s). This is because, the greater the time for relaxation, greater the magnetization.

An image is basically a measure of the energy distribution. Now suppose we have the measurements for tissues X and Y, and since they were of a different nature (composition, density of protons etc), their response to the field would have been different. Thus we would get some contrast between them and thus would get an image.

In very simplistic terms, this is how MRI scans are obtained. Though as mentioned above, please follow [4][5][6] for detailed tutorials on MRI.


MRI scans of the Brain and the Cocktail Party Problem :

Now consider the above discussion in context of taking a MRI scan of the brain. The brain has a number of constituents. Some being : Gray Matter, White Matter, Cerebrospinal Fluid (CSF) Fat, Muscle/Skin, Glial Matter etc. Now since each is unique, they would exhibit unique characteristics under a magnetic field. However, while taking a scan, we get one MRI image of the entire brain.

These scans can be considered as an equivalent to the mixtures of the cocktail party example. If we apply blind source separation to these, we should be able to separate out the various constituents such as gray matter, white matter, CSF etc. These images of the independent sources can be used for better diagnosis. This would be something like this :

If suppose the Simulated MR scans (from the McGill Simulated brain Database) were as follows:


Simulated MR Scans



The “ground truth” images for these scans would be as follows :


Ground Truth Images of Different Brain Tissue Substances


Restatement of the Broad Research Problem and Use of ICA and SVM:

Magnetic Resonance Imaging is superior to Computerised Tomography for brain imaging at least, for the reason that it can give much better soft tissue contrast (because even small changes in the proton density and composition in the tissue are well represented).

Like for most techniques, improvements to scans obtained by MRI are much desired to improve diagnosis. Blind source separation has been used to separate physiologically different components from EEG[7]/MEG[8] data (similar to the cocktail party problem), financial data[9] and even in fMRI[10][11]. But it has not received much attention for MRI. Nakai et al[12] used Independent Component Analysis for the purpose of separating physiologically independent components from MRI scans. They took MR images of 10 normal subjects, 3 subjects with brain tumour and 1 subject with multiple sclerosis and performed ICA on the data. They reported success in improving contrast for gray and white matter, which was beneficial for the diagnosis of brain tumour. The demylination in Multiple sclerosis cases was also enhanced in the images. They suggested that ICA could potentially separate out all the tissues which had different relaxation characteristics (different sources of the cocktail party example). This approach thus shows much promise.

In more technical terms : Consider a set of MR frames as a single multispectral image. Where each band is taken during a particular pulse sequence (will be discussed below). Then use ICA on the data to separate out the physiologically independent components. A classifier such as the SVM can improve the contrast further of the separated independent components.

However, using ICA for MRI has been tricky, something I would discuss towards the end of this post and also in future posts.

Before doing so, I intend to touch up on the basics for the sake of completeness.


Magnetic Resonance Imaging:

I had been thinking of writing a detailed tutorial on MRI, mostly because it requires some basic physics. However I don’t think it is required. I would recommend [4][5][6] for a study of the same in sufficient depth. I have recently taken tutorials on MRI, and would be willing to write for the blog if there are requests.


An Introduction to Independent Component Analysis:

Independent Component Analysis was developed initially to solve problems such as the cocktail party problem discussed above.

Let’s formalize a problem like the cocktail party example. For simplicity let us assume that there are only two sources and two mixtures (obtained by keeping two recorders at different locations in the party).

Let’s represent these two mixtures as x_1 and x_2, and let s_1 and s_2 be the two sources that were mixed. Since we are assuming that the two microphones were kept at different locations, the mixtures x_1 and x_2 would be different.

We could write this as:

x_1 = a_{11}s_1 + a_{12}s_2 \quad \cdots \quad (1)

x_2 = a_{21}s_1 + a_{22}s_2 \quad \cdots \quad (2)

The coefficients a_{11}, a_{12}, a_{21}, a_{22} are basically some parameters that depend on the distance of the respective source from the microphones.

Let’s define our problem as : Using only the mixtures x_i estimate the signal sources s_i. It is notable that you do not have any knowledge of the parameters a_{ij}.

This could be illustrated by this :

Consider three signals:

Suppose we have five mixtures obtained from these three signals.

Signals obtained by mixing source signals

If you only have the mixed signals available. And do not know how they were mixed (parameters a_{ij} not known). And from these mixed signals (x_{i}) you have to estimate the source signals (s_{i}). This problem is of considerable difficulty.

One approach would be : Use the statistical properties of the signals (s_i) to estimate the parameters (a_{ij}). It is surprising that it is enough to assume that s_1 and s_2 are statistically independent. This assumption might not be valid in many scenarios. But works well in most situations.

We could write the above system of linear equations in matrix form as :


where, A represents the mixing matrix, x and s represent the mixtures and the sources respectively.

The problem is to estimate s from x without knowing A. The assumption made is that the sources s are statistically independent.

How we go about solving this problem is exciting and an area of active research.  ICA was originally developed for solving such problems. Please follow [12][13][14] for discussions on mutual information, measures of non-gaussianity such as Kurtosis and Negentropy and the fastICA algorithm.


Why can ICA be used in MRI?

One limitation that ICA faces is that it can not work if more than one signal sources have a  Gaussian distribution. This can be illustrated as follows:

Again consider our equation for just two sources :

\displaystyle \begin{bmatrix} x_1 \\ x_2 \end{bmatrix} = \begin{bmatrix} a_{11} & a_{12} \\ a_{21} & a_{22} \end{bmatrix} \begin{bmatrix} s_1 \\ s_2 \end{bmatrix}

Our problem was : We have to estimate s from x without any knowledge of A. We would first need to estimate the parameters A from x, assuming statistical independence of s. And then we could find s as :

s = Wx, where W=A^{-1} , or the inverse of the estimated mixing matrix A.

To understand how a solution would become impossible if both the sources had a Gaussian distribution, consider this :

Consider two independent components having the following uniform distributions:

P(s_i) = \begin{cases} \frac{1}{2 \sqrt{3}} & \text{if} \quad |s_i| \leq \sqrt{3} \\ 0 & \text{otherwise} \end{cases}

The joint density of the two sources would then be uniform on a square. This follows from the fact that the joint density would be the product of the two marginal densities.


The joint distribution for Si

[ Image Source : Reference [12][13] ]

Now if s_1 and s_2 were mixed by a mixing matrix A

A = \begin{bmatrix} 2 & 3 \\ 2 & 1 \end{bmatrix}

The mixtures obtained are x_1 and x_2. Now since the original sources had a joint distribution on a square, and they were transformed by using a mixing matrix, the joint distribution of the mixtures x_1 and x_2 will be a parallelogram. These mixtures are no longer independent.


Joint Distribution of the mixtures

[ Image Source : Reference [12][13] ]

Now consider the problem once again : We have to estimate the mixing matrix A from the mixtures x_i, and using this estimated A we have to estimate the sources s_i.

From the above joint distribution we have a way to estimate A. The edges of the parallelogram are in a direction given by the columns of A. This is an intuitive way of estimating the mixing matrix : obtain the joint distributions of the mixtures, estimate the columns of the mixing matrix by finding the directions of the edges of the parallelogram. This solution gives a good intuitive feel of a in-principle solution of the problem( however, it isn’t practical).

However, now instead of two independent sources having a uniform distribution consider two independent sources having a Gaussian distribution. The joint distribution would be :


Joint Distribution when both Independent sources are Gaussian

[ Image Source : Reference [12][13] ]

Now going by the above discussion, because of the nature of the above joint distribution, it is not possible to estimate the mixing matrix from it.

Thus ICA fails when one or more independent components have a a gaussian distribution.

Noise in MRI is non-gaussian[16], therefore ICA is suited for MRI.


Problems in Using ICA for MRI Blind Source Separation:

The application of ICA for MRI faces a number of problems. I would discuss these in later blog posts. I would only discuss one major problem – the problem of Over-Complete ICA.

Over-Complete ICA in MRI:

The problem of over complete ICA occurs when there are lesser sensors (tape recorders from our above discussion) than sources. This problem can be understood by the following discussion. Suppose you have 3 mixtures x_1, x_2 and x_3 (imagine you have collected 3 tape recordings in a cocktail party of 6). Therefore you now have to estimate 6 sources from 3 mixtures.

Now the problem becomes something like this :

x_1 = a_{11}s_1 + a_{12}s_2 + a_{13}s_3 + a_{14}s_4 + a_{15}s_5 + a_{16}s_6

x_2 = a_{21}s_1 + a_{22}s_2 + a_{23}s_3 + a_{24}s_4 + a_{25}s_5 + a_{26}s_6

x_3 = a_{31}s_1 + a_{32}s_2 + a_{33}s_3 + a_{34}s_4 + a_{35}s_5 + a_{36}s_6

Assume for a second we can still estimate a_{ij}, still we can not find all the signal sources. As the number of linear equations is just three, while the number of unknowns is 6. This is a considerably harder problem and has been discussed by many groups such as [19][20][21].

Now dropping our assumption, the estimation of a_{ij} is also harder in such a case.

The Case in MRI:

The problem of over-complete ICA doesn’t arise when it comes to functional-MRI. However it is a problem when it comes to MRI[17].

In MRI, by varying the parameters used for imaging, the three kind of images that can be obtained are T1 weighted, T2 weighted and Proton Density images. Going by our discussion in the section on MRI above. These three can be treated as mixtures.

Therefore, we have 3 mixtures at our disposal.  However, as the ground truth images above show: The number of different tissues in the brain exceeds 9. Thus this becomes a considerably difficult problem : We have to estimate 9-10 independent components from just 3 mixtures.

I would discuss methods that can help do that in later blog posts.

If only three mixtures are used, 3 ICs can be estimated. Since the actual number of ICs exceeds 9. It is obvious that the each of 3 ICs have atleast 2 ICs mixed, which means that a certain tissue type is not enhanced as much as it could have been had there been one IC for it. This can be understood by looking at this example.


3 ICs obtained by Applying Fast-ICA on MR scans

[I used FastICA for obtaining these Independent Components ]

To get more ICs, in simple words, we need more mixtures. However we can obtain more mixtures from the existing mixtures itself by a process of Band-Expansion[18].

I would discuss this problem of OC-ICA and it’s possible solutions in later posts.


To Conclude:

A basic idea related to application of ICA to MR scans was discussed. It is clear that even with just three ICs significant tissue contrast enhancement is achieved. Problems related to OC-ICA would be discussed in later posts one by one. I would also discuss quantifying the results obtained using the Tanimoto/Jaccard coefficient of similarity.


References and Resources:

Cocktail Party Problem

[1] “Some Experiments on the Recognition of Speech, with One and with Two Ears“; E. Colin Cherry; The Journal of the Acoustical Society of America; September 1953. (PDF)

[2] “The Attentive Brain“; Stephen Grossberg; Department of Cognitive and Neural Systemss – Boston University; American Scientist, 1995. (PDF)

[3] “The Cocktail Party Problem : A Primer“; Josh H. McDermott; Current Biology Vol 19. No. 22. (PDF)

Magnetic Resonance Imaging

[4] “Magnetic Resonance ImagingTutorial“; H Panepucci and A Tannus; Technical Report; USP, 1994. (PDF)

[5] “10 Video lessons on MRI by Paul Callaghan” (~ an hour in total). (Videos)

[6] “MRI Tutorial for Neuroscience Boot Camp” Melissa Saenz. (PDF)

Sample ICA Applications Similar to The Cocktail Party Problem

[7] “Independent Component Analysis of Electroencephalographic Data“; Makieng, Bell, Jung, Sejnowski; Advances in Neural Information Processing Systems, 1996. (PDF)

[8] “Application of ICA to MEG noise Reduction“; Masaki Kawakatsu; 4th International Symposium on Independent Component Analysis and Blind Source Separation; 2003. (PDF)

[9] “Independent Component Analysis in Financial Data” from the book Computational Finance; Yasser S. Abu-Mostafa; The MIT Press; 2000. (Book Link)

[10] “ICA of functional MRI data : An overview“; Calhoun, Adali, Hansen, Larsen, Pekar; 4th International Symposium on Independent Component Analysis and Blind Source Separation; 2003. (PDF)

[11] “Independent Component Analysis of fMRI Data – Examining the Assumptions“; McKeown, Sejnowski; Human Brain Mapping; 1998. (PDF)

Independent Component Analysis : Tutorials/Books

[12] “Independent Component Analysis : Algorithms and Applications“; Aapo Hyvärinen, Erkki Oja; Neural Networks; 2000. (PDF)

[13] “Independent Component Analysis“; Aapo Hyvärinen, Juha Karhunen, Erkki Oja; John Wiley Publications; 2001. (Book Link)

[14] ICA Tutorial at videolectures.net by Aapo Hyvärinen. (Videos)

Independent Component Analysis for Magnetic Resonance Imaging

[15] “Application of of Independent Component Analysis to Magnetic Resonance Imaging for enhancing the Contrast of Gray and White Matter“; Nakai, Muraki, Bagarinao, Miki, Takehara, Matsuo, Kato, Sakahara, Isoda; NeuroImage; 2004. (Journal Link)

[16] “Noise in MRI“; Albert Macovski; Magnetic Resonance in Medicine; 1996. (PDF)

[17] “Independent Component Analysis in Magnetic Resonance Image Analysis“;  Ouyang, Chen, Chai, Clayton Chen, Poon, Yang, Lee; EURASIP journal on Advances in Signal Processing; 2008 (Journal Link)

[18] “Band Expansion Based Over-Complete Independent Component Analysis for Multispectral Processing of Magnetic Resonance Images “; Ouyang, Chen, Chai, Clayton Chen, Poon, Yang, Lee; IEEE Transactions on Biomedical Imaging; June 2008. (PDF)

Over-Complete ICA:

[19] “Blind Source Separation of More Sources Than Mixtures Using Over Complete Representations“; Lee, Lewicki, Girolami, Sejnowski; IEEE Signal Processing Letters; 1999. (PDF)

[20] “Learning Overcomplete Representations“; Lewicki, Sejnowski. (PDF)

[21] “A Fast Algorithm for estimating over-complete ICA bases for Image Windows “; Hyvarinen, Cristescu, Oja; International Joint Conference on Neural Networks; 1999. (IEEE Xplore link)


Onionesque Reality Home >>

Read Full Post »


Two Reasons:

1. Eigenfaces is probably one of the simplest face recognition methods and also rather old, then why worry about it at all? Because, while it is simple it works quite well. And it’s simplicity also makes it a good way to understand how face recognition/dimensionality reduction etc works.

2. I was thinking of writing a post based on face recognition in Bees next, so this should serve as a basis for the next post too. The idea of this post is to give a simple introduction to the topic with an emphasis on building intuition. For more rigorous treatments, look at the references.



Like almost everything associated with the Human body –  The Brain, perceptive abilities, cognition and consciousness, face recognition in humans is a wonder. We are not yet even close to an understanding of how we manage to do it. What is known is that it is that the Temporal Lobe in the brain is partly responsible for this  ability. Damage to the temporal lobe can result in the condition in which the concerned person can lose the ability to recognize faces. This specific condition where  an otherwise normal person who suffered some damage to a specific region in the temporal lobe loses the ability to recognize faces is called prosopagnosia. It is a very interesting condition  as  the perception of faces remains normal (vision pathways and perception is fine) and the person can recognize people by their voice but not by faces. In one of my previous posts, which had links to a series of lectures by Dr Vilayanur Ramachandran, I did link to one lecture by him in which he talks in detail about this condition. All this aside, not much is known how the perceptual information for a face is coded in the brain too.



A Motivating Parallel

Eigenfaces has a parallel to one of the most fundamental ideas in mathematics and signal processing – The Fourier Series. This parallel is also very helpful to build an intuition to what Eigenfaces (or PCA) sort of does and hence must be exploited. Hence we review the Fourier Series in a few sentences.

Fourier series are named so in the honor of Jean Baptiste Joseph Fourier (Generally Fourier is pronounced as “fore-yay”, however the correct French pronunciation is “foor-yay”) who made important contributions to their development. Representation of a signal in the form of a linear combination of complex sinusoids is called the Fourier Series. What this means is that you can’t just split a periodic signal into simple sines and cosines, but you can also approximately reconstruct that signal given you have information how the sines and cosines that make it up are stacked.

More Formally: Put in more formal terms, suppose f(x) is a periodic function with period 2\pi defined in the interval c\leq x \leq c+2\pi and satisfies a set of conditions called the Dirichlet’s conditions:

1. f(x) is finite, single valued and its integral exists in the interval.

2. f(x) has a finite number of discontinuities in the interval.

3. f(x) has a finite number of extrema in that interval.

then f(x) can be represented by the trigonometric series

f(x) = \displaystyle\frac{a_0}{2} + \displaystyle \sum_{n=1}^\infty [a_n cos(nx) + b_n sin(nx) ]\qquad(1)

The above representation of f(x) is called the Fourier series and the coefficients a_0, a_n and b_n are called the fourier coefficients and are determined from f(x) by Euler’s formulae. The coefficients are given as :

a_0 = \displaystyle \frac{1}{\pi} \int_{c}^{c+2\pi} f(x)dx

a_n = \displaystyle\frac{1}{\pi}\int_{c}^{c+2\pi} f(x)cos(nx)dx

\displaystyle b_n = \frac{1}{\pi}\int_{c}^{c+2\pi} f(x)sin(nx)dx

Note: It is common to define the above using c = -\pi

An example that illustrates \qquad (1) or the Fourier series is:

fourier4[Image Source]

A square wave (given in black) can be approximated to by using a series of sines and cosines (result of this summation shown in blue). Clearly in the limiting case, we could reconstruct the square wave exactly with simply sines and cosines.


Though not exactly the same, the idea behind Eigenfaces is similar. The aim is to represent a face as a linear combination of a set of basis images (in the Fourier Series the bases were simply sines and cosines). That is :

\displaystyle\Phi_i = \displaystyle\sum_{j=1}^{K}w_j u_j

Where \displaystyle \Phi_i represents the i^{th} face with the mean subtracted from it, w_j represent weights and u_j the eigenvectors. If this makes somewhat sketchy sense then don’t worry. This was just like mentioning at the start what we have to do.

The big idea is that you want to find a set of images (called Eigenfaces, which are nothing but Eigenvectors of the training data) that if you weigh and add together should give you back a image that you are interested in (adding images together should give you back an image, Right?). The way you weight these basis images (i.e the weight vector) could be used as a sort of a code for that image-of-interest and could be used as features for recognition.

This can be represented aptly in a figure as:


Click to Enlarge

In the above figure, a face that was in the training database was reconstructed by taking a weighted summation of all the basis faces and then adding to them the mean face. Please note that in the figure the ghost like basis images (also called as Eigenfaces, we’ll see why they are called so) are not in order of their importance. They have just been picked randomly from a pool of 70 by me. These Eigenfaces were prepared using images  from the MIT-CBCL database (also I have adjusted the brightness of the Eigenfaces to make them clearer after obtaining them, therefore the brightness of the reconstructed image looks different than those of the basis images).


An Information Theory Approach:

First of all the idea of Eigenfaces considers face recognition as a 2-D recognition problem, this is based on the assumption that at the time of recognition, faces will be mostly upright and frontal. Because of this, detailed 3-D information about the face is not needed. This reduces complexity by a significant bit.

Before the method for face recognition using Eigenfaces was introduced, most of the face recognition literature dealt with local and intuitive features, such as distance between eyes, ears and similar other features. This wasn’t very effective. Eigenfaces inspired by a method used in an earlier paper was a significant departure from the idea of using only intuitive features. It uses an Information Theory appraoch wherein the most relevant face information is encoded in a group of faces that will best distinguish the faces. It transforms the face images in to a set of basis faces, which essentially are the principal components of the face images.

The Principal Components (or Eigenvectors) basically seek directions in which it is more efficient to  represent the data. This is particularly useful for reducing the computational effort. To understand this,  suppose we get 60 such directions, out of these about 40 might be insignificant and only 20 might represent the variation in data significantly, so for calculations it would work quite well to only use the 20 and leave out the rest.  This is illustrated by this figure:


Click to Enlarge

Such an information theory approach will encode not only the local features but also the global features. Such features may or may not be intuitively understandable. When we find the principal components or the Eigenvectors of the image set, each Eigenvector has some contribution from EACH face used in the training set. So the Eigenvectors also have a face like appearance. These look ghost like and are ghost images or Eigenfaces. Every image in the training set can be represented as a weighted linear combination of these basis faces.

The number of Eigenfaces that we would obtain therefore would be equal to the number of images in the training set. Let us take this number to be M. Like I mentioned one paragraph before, some of these Eigenfaces are more important in encoding the variation in face images, thus we could also approximate faces using only the K most significant Eigenfaces.



1. There are M images in the training set.

2. There are K most significant Eigenfaces using which we can satisfactorily approximate a face. Needless to say K < M.

3. All images are N \times N matrices, which can be represented as N^2 \times 1 dimensional vectors. The same logic would apply to images that are not of equal length and breadths. To take an example: An image of size 112 x 112 can be represented as a vector of dimension 12544 or simply as a point in a 12544 dimensional space.


Algorithm for Finding Eigenfaces:

1. Obtain M training images I_1, I_2I_M, it is very important that the images are centered.


2. Represent each image I_i as a vector \Gamma_i  as discussed above.

I_i = \begin{bmatrix} a_{11} & a_{12} &\ldots & a_{1N} \ a_{21} & a_{22} & \ldots & a_{2N} \ \vdots & \vdots & \ddots & \vdots \ a_{N1} & a_{N2} & \ldots & a_{NN}\end{bmatrix}_{N\times N} \xrightarrow{\rm concatenation} \begin{bmatrix}\ a_{11} \ \vdots \ a_{1N} \ \vdots\ a_{2N} \ \vdots \ a_{NN} \end{bmatrix}_{N^2\times 1} = \Gamma_i

Note: Due to a recent WordPress \LaTeX bug, there is some trouble with constructing matrices with multiple columns. To avoid confusion and to maintain continuity, for the time being I am posting an image for the above formula that’s showing an error message. Same goes for some formulae below in the post.

untitled3. Find the average face vector \Psi.

\Psi = \displaystyle\frac{1}{M}\sum_{i=1}^M\Gamma_i

4. Subtract the mean face from each face vector \Gamma_i to get a set of vectors \Phi_i. The purpose of subtracting the mean image from each image vector is to be left with only the distinguishing features from each face and “removing” in a way information that is common.

\Phi_i = \Gamma_i - \Psi

5. Find the Covariance matrix C:

C = AA^T, where A=[\Phi_1, \Phi_2 \ldots \Phi_M]

Note that the Covariance matrix has simply been made by putting one modified image vector obtained in  one column each.

Also note that C is a N^2 \times N^2 matrix and A is a N^2\times M matrix.

6. We now need to calculate the Eigenvectors u_i of C, However note that C is a N^2 \times N^2 matrix and it would return N^2 Eigenvectors each being N^2 dimensional. For an image this number is HUGE.  The computations required would easily make your system run out of memory. How do we get around this problem?

7. Instead of the Matrix AA^T consider the matrix A^TA. Remember A is a N^2\times M matrix, thus A^TA is a M\times M matrix. If we find the Eigenvectors of this matrix, it would return M Eigenvectors, each of Dimension M \times 1, let’s call these Eigenvectors v_i.

Now from some properties of matrices, it follows that: u_i = Av_i. We have found out v_i earlier. This implies that using v_i we can calculate the M largest Eigenvectors of AA^T. Remember that M\ll N^2 as M is simply the number of training images.

8. Find the best M Eigenvectors of C=AA^T by using the relation discussed above. That is: u_i = Av_i. Also keep in mind that \begin{Vmatrix}u_i\end{Vmatrix}=1.


[6 Eigenfaces for the training set chosen from the MIT-CBCL database, these are not in any order]

9. Select the best K Eigenvectors, the selection of these Eigenvectors is done heuristically.


Finding Weights:

The Eigenvectors found at the end of the previous section, u_i when converted to a matrix in a process that is reverse to that in STEP 2, have a face like appearance. Since these are Eigenvectors and have a face like appearance, they are called Eigenfaces. Sometimes, they are also called as Ghost Images because of their weird appearance.

Now each face in the training set (minus the mean), \Phi_i can be represented as a linear combination of these Eigenvectors u_i:

\Phi_i = \sum_{j=1}^{K}w_ju_jm, where u_j ‘s are Eigenfaces.

These weights can be calculated as :

w_j = u_j^T\Phi_i.

Each normalized training image is represented in this basis as a vector.

\Omega_i = \begin{bmatrix}w_1\w_2\ \vdots \w_k\end{bmatrix}

untitledwhere i = 1,2… M. This means we have to calculate such a vector corresponding to every image in the training set and store them as templates.


Recognition Task:

Now consider we have found out the Eigenfaces for the training images , their associated weights after selecting a set of most relevant Eigenfaces and have stored these vectors corresponding to each training image.

If an unknown probe face \Gamma is to be recognized then:

1. We normalize the incoming probe \Gamma as \Phi = \Gamma - \Psi.

2. We then project this normalized probe onto the Eigenspace (the collection of Eigenvectors/faces) and find out the weights.

w_i = u_i^T\Phi.

3. The normalized probe \Phi can then simply be represented as:

\Omega = \begin{bmatrix}w_1\w_2\vdots\w_K\end{bmatrix}


After the feature vector (weight vector) for the probe has been found out, we simply need to classify it. For the classification task we could simply use some distance measures or use some classifier like Support Vector Machines (something that I would cover in an upcoming post). In case we use distance measures, classification is done as:

Find e_r = min\begin{Vmatrix}\Omega - \Omega_i\end{Vmatrix}. This means we take the weight vector of the probe we have just found out and find its distance with the weight vectors associated with each of the training image.

And if e_r < \Theta, where \Theta is a threshold chosen heuristically, then we can say that the probe image is recognized as the image with which it gives the lowest score.

If however e_r > \Theta then the probe does not belong to the database. I will come to the point on how the threshold should be chosen.

For distance measures the most commonly used measure is the Euclidean Distance. The other being the Mahalanobis Distance. The Mahalanobis distance generally gives superior performance. Let’s take a brief digression and look at these two simple distance measures and then return to the task of choosing a threshold.


Distance Measures:

Euclidean Distance: The Euclidean Distance is probably the most widely used distance metric. It is a special case of a general class of norms and is given as:

\displaystyle\begin{Vmatrix}x-y\end{Vmatrix}_e = \displaystyle\sqrt{\begin{vmatrix}x_i-y_i\end{vmatrix}^2}

The Mahalanobis Distance: The Mahalanobis Distance is a better distance measure when it comes to pattern recognition problems. It takes into account the covariance between the variables and hence removes the problems related to scale and correlation that are inherent with the Euclidean Distance. It is given as:

d(x,y) =\sqrt{ (x-y)^TC^{-1}(x-y)}

Where C is the covariance between the variables involved.


Deciding on the Threshold:

Why is the threshold, \Theta important?

Consider for simplicity we have ONLY 5 images in the training set. And a probe that is not in the training set comes up for the recognition task. The score for each of the 5 images will be found out with the incoming probe. And even if an image of the probe is not in the database, it will still say the probe is recognized as the training image with which its score is the lowest. Clearly this is an anomaly that we need to look at. It is for this purpose that we decide the threshold. The threshold \Theta is decided heuristically.


Click to Enlarge

Now to illustrate what I just said, consider a simpson image as a non-face image, this image will be scored with each of the training images. Let’s say S_4 is the lowest score out of all. But the probe image is clearly not beloning to the database. To choose the threshold we choose a large set of random images (both face and non-face), we then calculate the scores for images of people in the database and also for this random set and set the threshold \Theta (which I have mentioned in the “recognition” part above) accordingly.


More on the Face Space:

To conclude this post, here is a brief discussion on the face space.

face_space[Image Source – [1]]

Consider a simplified representation of the face space as shown in the figure above. The images of a face, and in particular the faces in the training set should lie near the face space. Which in general describes images that are face like.

The projection distance e_r should be under a threshold \Theta as already seen. The images of known individual fall near some face class in the face space.

There are four possible combinations on where an input image can lie:

1. Near a face class and near the face space : This is the case when the probe is nothing but the facial image of a known individual (known = image of this person is already in the database).

2. Near face space but away from face class : This is the case when the probe image is of a person (i.e a facial image), but does not belong to anybody in the database i.e away from any face class.

3. Distant from face space and near face class : This happens when the probe image is not of a face however it still resembles a particular face class stored in the database.

4. Distant from both the face space and face class: When the probe is not a face image i.e is away from the face space and is nothing like any face class stored.

Out of the four, type 3 is responsible for most false positives. To avoid them, face detection is recommended to be a part of such a system.


References and Important Papers

1.Face Recognition Using Eigenfaces, Matthew A. Turk and Alex P. Pentland, MIT Vision and Modeling Lab, CVPR ’91.

2. Eigenfaces Versus Fischerfaces : Recognition using Class Specific Linear Projection, Belhumeur, Hespanha, Kreigman, PAMI ’97.

3. Eigenfaces for Recognition, Matthew A. Turk and Alex P. Pentland, Journal of Cognitive Neuroscience ’91.


Related Posts:

1. Face Recognition/Authentication using Support Vector Machines

2. Why are Support Vector Machines called so



I suppose the MATLAB codes for the above are available on atleast 2-3 locations around the internet.

However it might be useful to put out some starter code. Note that the variables have the same names as those in the description above.You may use this mat file for testing. These are the labels for that mat file.

%This is the starter code for only the part for finding eigenfaces.
load('features_faces.mat'); % load the file that has the images.
%In this case load the .mat filed attached above.
% use reshape command to see an image which is stored as a row in the above.

% The code is only skeletal.
% Find psi - mean image
Psi_train = mean(features_faces')';
% Find Phi - modified representation of training images.
% 548 is the total number of training images.
for i = 1:548
    Phi(:,i) = raw_features(:,i) - Psi_train;
% Create a matrix from all modified vector images
A = Phi;
% Find covariance matrix using trick above
C = A'*A;
[eig_mat, eig_vals] = eig(C);
% Sort eigen vals to get order
eig_vals_vect = diag(eig_vals);
[sorted_eig_vals, eig_indices] = sort(eig_vals_vect,'descend');
sorted_eig_mat = zeros(548);
for i=1:548
    sorted_eig_mat(:,i) = eig_mat(:,eig_indices(i));
% Find out Eigen faces
Eig_faces = (A*sorted_eig_mat);
% Display an eigenface using the reshape command.

% Find out weights for all eigenfaces
% Each column contains weight for corresponding image
W_train = Eig_faces'*Phi;
face_fts = W_train(1:250,:)'; % features using 250 eigenfaces.


Onionesque Reality Home >>

Read Full Post »

I got linked to by a blog today morning and I wanted to see what it was about. It turned out to be a web-log on the author’s digital photography, compositions of multiple images taken by him and images post processing. Some of the processed images are beautiful. Consider a sample:


[The Spirit of Autumn]

The above image christened The Spirit of Autumn is actually a composition of 3 different images taken on a digital camera. The resultant image was then processed further in GIMP (GNU based image manipulation tools) to get a wonderful output.

2678796319_0db988b36d[Through a Glass Wetly]

I am strongly attracted to painted abstract art, and very rarely to digital art (though I like ingenious fractals and mathematical figures) however this work is truly deserving of a hearty applause. Interestingly, the author pursues photography as a hobby. You can check out some of his other art work at his web-log here.

Image processing is one of my most favorite subjects and I have been involved in projects concerning some Image Processing too, however I think I should move beyond looking at MRI scans or X-rays once in a while and try my hand at post processing photographs taken by me (using tools ofcourse, not algorithms as such) to TRY and get as spectacular results as obtained by DJ Lenfirewood as above. LOL.


1. Download GIMP

Onionesque Reality Home >>

Read Full Post »

Before I make a start I would want to make it very clear that inspite of what that the title may suggest, this is not a “sensational” post. It is just something that really intrigued me. It basically falls under the domain of image segmentation and pattern recognition, however it is something that can intrigue a person with a non-scientific background with a like (or dislike) for Franz Kafka’s work equally. I keep the title because it is the title of an original work by Dr Vitorino Ramos and hence making changes to it is not a good thing.

Note: For people who are  not interested in technical details can skip those parts and only read the stuff in bold there.

Franz Kafka is one writer whose works have had a profound impact on me in terms that they disturbed me each time I thought about them. No, not because of his writings per se ONLY but for a greater part because i had read a lot on his rather tragic life and i saw a heart breaking reflection in his works of what happened in his life (i see a lot of similarities between Kafka’s life and that of Premchand albeit that Premchand’s work got published in his lifetime mostly, though he got true critical acclaim after his death). Yes i do think that his writings give a good picture of Europe at that time, on human needs and behavior, but the prior reason outweighs all these. Kafka remains one of my favorite writers, though his works are basically short stories. He mostly wrote on a theme that emphasized the alienation of man and the indifferent society. Kafka’s tormenting thoughts on dehumanization, the cruel world, bureaucratic labyrinths which he experienced as being part of the not so liked Jewish minority in Prague, his experiences in jobs he did, his love life and affairs, on a constant fear of mental and physical collapse as a result of clinical depression and the ill health that he suffered from, reflected in a lot of his works. Including in his novella The Metamorphosis.

W. H Auden rightly wrote about Kafka:

“Kafka is important to us because his predicament is the predicament of the modern man”

In metamorphosis the protagonist Gregor Samsa turns into a giant insect when he wakes up one morning. It is kind of apparent that the “transformation” was meant in a metaphorical sense by Kafka and not in a literal one, mostly based on his fears and his own life experiences. The Novella starts like this. . .

As Gregor Samsa awoke one morning from uneasy dreams he found himself transformed in his bed into a monstrous vermin.

While rummaging through a few scientific papers that explored the problem of pattern recognition using a distributed approach i came across a few by Dr Ramos et al, which dealt with the issue using the artificial colonies approach.

In the previous post i had mentioned that the self organization of neurons into a brain like structure and the self organization of an ant colony were similar in more than a few ways. If it may be implemented then it could have implications in pattern recognition problems, where the perceptive abilities emerge from the local and simple interactions of simple agents. Such decentralized systems, a part of the swarm intelligence paradigm look very promising in applying to pattern recognition and the specific case of image segmentation as basically these may be considered a clustering and combinatorial problem taking the image itself as an ant colony habit.

The basis for this post was laid down in the previous post on colony cognitive maps. We observed the evolution of a pheromonal field there and a simple model for the same:

[Evolution of a distribution of (artificial) ants over time: Image Source]

Click to Enlarge

The above is the evolution of the distribution of artificial ants in a square lattice, this work has been extended to digital image lattices by Ramos et al. Image segmentation is an image processing problem wherein the regions of the image under consideration may be partitioned into different regions. Like into areas of low contrast and areas of high contrast, on basis of texture and grey level and so on. Image segmentation is very important as the output of an image segmentation process may be used as an input in object recognition based scenarios. The work of Ramos et al (In references below) and some of the papers cited in his works have really intrigued me and i would strongly suggest readers to have a look at them if at all they are interested in image segmentation, pattern recognition and self organization in general, some might also be interested in implementing something similar too!

In one of the papers a swarm of artificial ants was thrown on a digital habitat (an image of Albert Einstein) to explore it for 1000 iterations. The Einstein image is replaced by a map image. The evolution of the colony cognitive maps for the Einstein image habitat is shown below for various iterations.

[Evolution of a pheromonal field on an Einstein image habitat for t= 0, 1, 100, 110, 120, 130, 150, 200, 300, 400, 500, 800, 900, 1000: Image Source]

The above is represented most aptly in a .gif image.

[Evolution of a pheromonal field on an Einstein habitat: Image Source]

Now instead of Einstein a Kafka image was taken and was subject to the same. Image Source

The Kafka image habitat is replaced by a red ant in the second row. The abstract of the paper by the same name goes as.

Created with an Artificial Ant Colony, that uses images as Habitats, being sensible to their gray levels. At the second row,  Kafka is replaced as a substrate, by Red Ant. In black, the higher levels of pheromone (a chemical evaporative sugar substance used by swarms on their orientation trough out the trails). It’s exactly this artificial evaporation and the computational ant collective group synergy reallocating their upgrades of pheromone at interesting places, that allows for the emergence of adaptation and “perception” of new images. Only some of the 6000 iterations processed are represented. The system does not have any type of hierarchy, and ants communicate only in indirect forms, through out the successive alteration that they found on the Habitat.

Now what intrigues me is that the transition is extremely rapid. Such perceptive ability with change in the image habitat could have massive implications at how we look at pattern recognition for such cases.

Extremely intriguing!

Resources on Franz Kafka:

1. A Brief Biography

2. The Metamorphosis At Project Gutenberg. Click here >>

3. The Kafka Project

References and STRONGLY recommended papers:

1. Artificial Ant Colonies in Digital Image Habitats – A Mass behavior Effect Study on Pattern Recognition. Vitorino Ramos and Filipe Almeida. Click Here >>

2. Social Cognitive Maps, Swarms Collective Perception and Distributed Search on Dynamic Landscapes. Vitorino Ramos, Carlos Fernandes, Agostinho C. Rosa. Click Here >>

3. Self-Regulated Artificial Ant Colonies on Digital Image Habitats. Carlos Fernandes, Vitorino Ramos, Agostinho C. Rosa. Click Here >>

4. On the Implicit and the Artificial – Morphogenesis and Emergent Aesthetics in Autonomous Collective Systems. Vitorino Ramos. Click Here >>

5. A Strange Metamorphosis [Kafka to Red Ant], Vitorino Ramos.

Onionesque Reality Home >>

Read Full Post »

Motivation: About a couple of months back i was wondering on designing a speaker dependent speech recognizer on the 8051 micro-controller or any of its derivatives for simple machine control. We would of course need an isolated word (or digit) recognizer.

Problem: Speech recognizers can be implemented using Hidden Markov Models or Artificial Neural Networks. There are plenty of such systems in place. However the problem with these algorithms is that they are computationally pretty intensive, and thus can not be implemented on a simple 8 bit fixed point micro-processor, and that is what we need for simple machine control applications. So there is a need for a simpler algorithm.

All these algorithms also employ a short term feature vector to take care of the non-stationary nature of speech. Generally the vector length is so chosen that the nature of the signal in this band is quasi-stationary. Feature vectors are an area of active research. Generally however at the university level, Mel Frequency Cepstrum Coefficients (MFCC) or Linear Predictive Coefficients are taken as features. These too require computations that are beyond the scope of a simple processor/ micro controller like the 8051.

Solution: I was thinking what could be done to reduce this burden and choose a simpler feature so that it could be implemented on 8051. While researching on this i came across a paper[1]. This papers deals with this problem exactly!

The researchers have used only zero crossings of the speech signal to determine the feature vector. Since this novel feature extraction method is based on zero crossings only, it just needs a one bit A to D conversion[2]. This feature extraction is computationally very simple and does not require the speech signal to be pre-processed.

This feature vector is basically the histogram of the time interval between successive zero-crossings of the utterance in a short time window[1]. These feature vectors for each window are then combined together to form a feature matrix. Since we are dealing with only small time series (isolated words), we can employ Dynamic Time Warping to compare the input matrix with the reference matrix’ stored. I will discuss this in another post sometime.

To obtain this vector the following steps need to be followed.

1. The speech signal x(t) is band-pass filtered to give s(t).

2. s(t) is then subjected to infinite amplitude clipping with the help of a ZCD to give u(t).

3. u(t) is then sampled at say 8Khz to give u[n]. The feature extraction is carried out on u[n].

4. u[n] is divided in a number of short time windows for every one of the calculated W samples.

5. The histogram for each of this short time window is found. The histogram(or vector) is found as follows: The number of times ONLY ONE sample is recorded between successive zero crossings will constitute the element number 1 of the vector. The number of times ONLY TWO samples are recorded between successive zero crossings will constitute the element number two of the feature vector and so on. In this way we construct an histogram which is an appropriate feature vector.

These vectors then can be combined for all windows to get the feature matrix. These as i said earlier can be compared using DTW/DDTW/Fast DTW or some other algorithm.

As an example take an utterance for the number three in Hindi which is spoken as “teen”. The first plot below gives the waveform for the utterance. The second plot gives the end-point detected version of the same, end point detection reduces computations (and hence the memory required) by removing the “useless” portions of the utterance which do not contain any intelligence.

The surface plot for the above utterance by me for the matrix (where, as i have mentioned implicitly the rows represent the windows and the columns represent the histogram terms) prepared is as:


[1] A Microprocessor based Speech Recognizer for Isolated Hindi Digits, Ashutosh Saxena and Abhishek Singh, IEEE ACE.

[2] Zero-Crossing-Based Linear Prediction for Speech Recognition, Lipovac, Electronics Letters, pages 90-92, vol. 25 Issue 2,19 Jan 1989.

Onionesque Reality Home >>

Read Full Post »

Older Posts »