Feeds:
Posts

## Neighbourhood Gerrymandering: An Approach to Discriminative Metric Learning via Latent Structured Prediction

An informal summary of a recent project I had some involvement in.

Motivation: Why care about Metric Learning?

In many machine learning algorithms, such as k-means, Support Vector Machines, k-Nearest Neighbour based classification, kernel regression, methods based on Gaussian Processes etc etc – there is a fundamental reliance, that is to be able to measure dissimilarity between two examples. Usually this is done by using the Euclidean distance between points (i.e. points that are closer in this sense are considered more similar), which is usually suboptimal in the sense that will be explained below. Being able to compare examples and decide if they are similar or dissimilar or return a measure of similarity is one of the most fundamental problems in machine learning. Ofcourse a related question is: What does mean by “similar” afterall?

To illustrate the above let us work with k-Nearest Neighbour classification. Before starting, let us just illustrate the really simple idea (of kNN classification) by an example: Consider the following points in $\mathbb{R}^2$, with the classes marked by different colours.

Now suppose we have a new point – marked with black – whose class is unknown. We assign it a class by looking at the nearest neighbors and taking the majority vote:

Some notes on kNN:

A brief digression first before moving on the problem in the above (what is nearest?). kNN classifiers are very simple and yet in many cases they can give excellent performance. For example, consider the performance on the MNIST dataset, it is clear that kNN can give competitive performance as compared to other more complicated models.

Moreover, they are simple to implement, use local information and hence are inherently nonlinear. The biggest advantage in my opinion is that it is easy to add new classes (since no retraining from scratch is required) and since we average across points, kNN is also relatively robust to label noise. It also has some attractive theoretical properties: for example kNN is universally consistent (as the number of points approaches infinity, with appropriate choice of k, the kNN error will approach the Bayes Risk).

Notion of “Nearest”:

At the same time, kNN classifiers also have their disadvantages. One is related to the notion of “nearest” (which falls back on what was talked about at the start) i.e. how does one decide what points are “nearest”. Usually such points are decided on the basis of the Euclidean distance on the native feature space which usually has shortfalls. Why? Because nearness in the Euclidean space may not correspond to nearness in the label space i.e. points that might be far off in the Euclidean space may have similar labels. In such cases, clearly the notion of “near” using the euclidean distance is suboptimal. This is illustrated by a set of figures below (adapted from slides by Kilian Weinberger):

An Illustration:

Consider the image of this lady – now how do we decide what is more similar to it?

Someone might mean similar on the basis of the gender:

Or on the basis of age:

Or on the basis of the hairstyle!

Similarity depends on the context! Something that the euclidean distance in the native feature space would fail to capture. This context is provided by labels.

Distance Metric Learning:

The goal of Metric Learning is to learn a distance metric, so that the above label information is incorporated in the notion of distance i.e. points that are semantically similar are now closer in the new space. The idea is to take the original or native feature space, use the label information and then amplify directions that are more informative and squish directions that are not. This is illustrated in this figure – notice that the point marked in black would be incorrectly classified in the native feature space, however under the learnt metric it would be correctly classified.

It is worthwhile to have a brief look at what this means. The Euclidean distance (with $x_i \in \mathbb{R}^d$) is defined by

$\sqrt{(x_i - x_j)^T (x_i - x_j)}$

as also was evident in the above figure, this corresponds to the following euclidean ball in 2-D

A family of distance measure may be defined using an inner product matrix. These are called the Mahalanobis metrics.

$\sqrt{(x_i - x_j)^T \mathbf{W}(x_i - x_j)}$

The learnt metric affects a rescaling and rotation of the original space. The goal is now to learn this $\mathbf{W} \succeq 0$ using the label information so that the new distances correspond better to the semantic context. It is easy to see that when $\mathbf{W} \succeq 0$, the above is still a distance metric.

Learning $\mathbf{W}$:

Usually the real motivation for metric learning is to optimize for the kNN objective i.e. learn the matrix $\mathbf{W} \succeq 0$ so that the kNN error is reduced. But note that directly optimizing for the kNN loss is intractable because of the combinatorial nature of the optimization (we’ll see this in a bit), so instead, $\mathbf{W}$ is learnt as follows:

1. Define a set of “good” neighbors for each point. The definition of “good” is usually some combination of proximity to the query point and label agreement between the points.

2. Define a set of “bad” neighbours for each point. This might be a set of points that are “close” to the query point but disagree on the label (i.e. inspite of being close to the query point they might give a wrong classification if they were chosen to classify the query point).

3. Set up the optimization problem for $\mathbf{W}$ such that for each query point, “good” neighbours are pulled closer to it while “bad” neighbours are pushed farther away, and thus learn $\mathbf{W}$ so as to minimize the leave one out kNN error.

The exact formulation of “good” and “bad” varies from method to method. Here are some examples:

In one of the earliest papers on distance metric learning by Xing, Ng, Jordan and Russell (2002) – good neighbors are similarly labeled k points. The optimization is done so that each class is mapped into a ball of fixed radius. However no separation is enforced between the classes. This is illustrated in the following figure (the query point is marked with an X, similarly labeled k points are moved into a ball of a fixed radius):

One problem with the above is that the kNN objective does not really require that similarly labeled points are clustered together, hence in a way it optimizes for a harder objective. This is remedied by the LMNN described briefly below.

One of the more famous Metric Learning papers is the Large Margin Nearest Neighbors by Weinberger and Saul (2006). Here good neighbors are similarly labeled k points (and the circle around x is the distance of the farthest of the good neighbours) and “worst offenders” or “bad” neighbours are points that are of a different class but still in the nearest neighbors of the query point. The optimization is basically a semidefinite program that works to pull the good neighbours towards the query point and a margin is enforced by pushing the offending points out of this circle. Thus in a way, the goal in LMNN is to deform the metric in such a way that the neighbourhood for each point is “pure.

There are many approaches to the metric learning problem, however a few more notable ones are:

1. Neighbourhood Components Analysis (Goldberger, Roweis, Hinton and Salakhutdinov, 2004): Here the piecewise constant error of the kNN rule is replaced by a soft version. This leads to a non-convex objective that can be optimized by gradient descent. Basically, NCA tries to optimize for the choice  of neighbour at the price of losing convexity.

2. Collapsing Classes (Globerson and Roweis, 2006): This method attempts to remedy the non-convexity above by optimizing a similar stochastic rule while attempting to collapse each class to one point, making the problem convex.

3. Metric Learning to Rank (McFee and Lankriet, 2010): This paper takes a different take on metric learning, treating it as a ranking problem. Note that given a fixed p.s.d matrix $\mathbf{W}$ a query point induces a permutation on the training set (in order of increasing distance). The idea thus is to optimize the metric for some ranking measure (such as precision@k). But note that this is not necessarily the same as requiring correct classification.

Neighbourhood Gerrymandering:

As a motivation we can look at the cartoon above for LMNN. Since we are looking to optimize for the kNN objective, the requirement to learn the metric should just be correct classification. Thus, we should need to push the points to ensure the same. Thus we can have the circle around x as simply the distance of the farthest point in the k nearest neighbours (irrespective of class). Now, we would like to deform the metric such that enough points are pulled in and pushed out of this circle so as to ensure correct classification. This is illustrated below.

This method is akin to the common practice of Gerrymandering, in drawing up borders of election districts so as to provide advantages to desired political parties. This is done by concentrating voters from a particular party and/or by spreading out voters from other parties. In the above, the “districts” are cells in the Voronoi diagram defined by the Mahalanobis metric and “parties” are class labels voted for by each neighbour.

Motivations and Intuition:

Now we can step back a little from the survey above, and think a bit about the kNN problem in somewhat more precise terms so that the above approach can be motivated better.

For kNN, given a query point and a fixed metric, there is an implicit latent variable: The choice of the k “neighbours”.

Given this latent variable – inference of the label for the query point is trivial – since it is just the majority vote. But notice that for any given query point, there can exist a very large number of  choices of k points that may correspond to correct classification (basically any set of points with majority of correct class will work). Now we basically want to learn a metric so that we prefer one of these sets over any set of k neighbours which would vote for a wrong class. In particular, from the sets that affects correct classification we would like to pick the set that is on average most similar to the query point.

We can write kNN prediction as an inference problem with a structured latent variable being the choice of k neighbours.

The learning then corresponds to  minimizing a sum of structured latent hinge loss and a regularizer. Computing the latent hinge loss involves loss-augmented inference which is basically looking for the worst offending k points (points that have high average similarity with the query point, yet correspond to a high loss). Given the combinatorial nature of the problem, efficient inference and loss-augmented inference is key. Optimization can basically be just gradient descent on the surrograte loss. To make this a bit more clear, the setup is described below:

Problem Setup:

Suppose we are given $N$ training examples that are represented by a “native” feature map, $\mathbf{X} = \{x_1, \dots, x_N\}$ with $x_i \in \mathbb{R}^d$ with class labels $\mathbf{y} = [y_1, \dots, y_N]^T$ with $y_i \in [\mathbf{R}]$, where $[\mathbf{R}]$ stands for the set $\{1, \dots, \mathbf{R}\}$.

Suppose are also provided with a loss matrix $\Lambda$ with $\Lambda(r,r')$ being the loss incurred by predicting $r'$ when the correct class is $r$. We assume that $\Lambda(r,r) = 0$ and $\forall (r,r'), \Lambda(r,r') \geq 0$.

Now let $h \subset \mathbf{X}$ be a set of examples in $\mathbf{X}$.

As stated earlier, we are interested in the Mahalanobis metrics:

$D_W(x,x_i) = (x-x_i)^T W (x-x_i)$

For a fixed $W$ we may define the distance of $h$ with respect to a point $x$ as:

$\displaystyle S_W(x,h) - \sum_{x_j \in h} D_W(x, x_j)$

Therefore, the set of k-Nearest Neighbours of $x$ in $\mathbf{X}$ is:

$h_W(x ) = \arg\max_{|h|=k} S_W(x,h)$

For any set $h$ of $k$ examples from $\mathbf{X}$ we can predict the label of $x$ by a simple majority vote.

$\hat{y}(h) = majority\{y_j: x_j \in h\}$

The kNN classifier therefore predicts $\hat{y}(h_W(x))$.

Thus, the classification loss incurred using the set $h$ can be defined as:

$\Delta(y,h) = \Lambda(y,\hat{y}(h))$

Learning and Inference:

One might want to learn $W$ so as to minimize the training loss:

$\displaystyle \sum_i \Delta(y_i, h_W(x_i))$

However as mentioned in passing above, this fails because of the intractable nature of  the classification loss $\Delta$. Thus we’d have to resort to the usual remedy: define a tractable surrograte loss.

It must be stressed again that the output of prediction is a structured object $h_W$. The loss in structured prediction penalizes the gap between score of the correct structured output and the score of the “worst offending” incorrect output. This leads to the following definition of the surrogate:

$L(x,y,W) = \max_h [S_W(x,h) + \Delta(y,h)] - \max_{h: \Delta(y,h) = 0} S_W(x,h)$

This corresponds to our earlier intuition on wanting to learn $W$ such that the gap between the “good neighbours” and “worst offenders” is increased.

So, although the loss above was arrived at by intuitive arguments, it turns out that our problem is an instance of a familiar type of problem: Latent Structured Prediction and hence the machinery for optimization there can be used here as well. The objective for us corresponds to:

$\displaystyle \min_W \| W\|^2_{F} + C \sum_i (L(x_i, y_i,W))$

Where $\| \cdot \|_F$ is the Frobenius norm.

Note that the regularizer is convex, but the loss is not convex to the subtraction of the max term i.e. now it is a difference of convex functions which means the concave convex procedure may be used for optimization (although we just use stochastic gradient descent). Also note that the optimization at each step needs an efficient subroutine to determine the correct structured output (inference of the best set of neighbours) and the worst offending incorrect structured output (loss augmented inference i.e. finding the worst set of neighbors). Turns out that for this problem this is possible (although not presented here).

It is interesting to think about how this approach extends to regression and to see how it works when the embeddings learnt are not linear.

## Blind Source Separation in Magnetic Resonance Images

I have been involved in a major project on contrast enhancement of Magnetic Resonance Images by using Independent Component Analysis (ICA) and Support Vector Machines (SVM) for the past couple of  months. It is an extremely exciting project and also something new for me, as I have worked on bio-medical images just once before. In the past, I have used ICA and SVM in face recognition/authentication, however this application is quite novel.

This post intends to introduce the problem, discuss a motivating example, some methods, expected work and some problems.

__________

A Simple Introduction and Motivating Example:

The simplest motivating example for this problem is the famous cocktail party problem:

You are at a cocktail party, and there are about 12 people present with each talking simultaneously. Add to that a music source. So that makes it 13.

Suppose you want to follow what each person was saying later and for doing so you place a number of tape recorders at different locations in the room (let’s not worry about the number of recorders right now). When you hear them later, the sounds would hardly be understandable as they would be mixed up.

Now you define an engineering problem : that using these recordings (which are basically mixtures), separate out the different sources with as little distortion as possible. In a real time cocktail party, the brain shows[1][2][3] a remarkable ability to follow one conversation. However such a problem has proved to be quite difficult in signal processing. Let’s just illustrate the cocktail party problem in a cartoon below :

The Cocktail Party Problem

Please listen to a demo of the cocktail party problem at the HUT ICA project page.

__________

The Logic Behind Constructing MR Images in Simple Terms:

Now, keeping the previous brief discussion in mind. Let’s introduce in simple words how MRI works. This is just a simplification to make the idea clearer, and not really how MRI works.  Discussing MRI in detail would divert the focus of the post. To look at how MRI works follow these highly recommended tutorials[4][5][6]:

Suppose your body is placed in a Magnetic Field (let’s not worry about specifics yet). Consider two contiguous tissues in your body – X and Y. When subject to a magnetic field, the particles (protons) in the tissues would get aligned according to the field. The amount of magnetization would depend on the tissue type. Now suppose we want to measure how much a tissue gets magnetized. One way to think about it is like this : First apply the magnetic field, after the application the particles would get excited. Once the field is removed, these particles would tend to relax to their ground state. By being able to measure the time it takes for the particles to return, we would get some measure of the magnetization of the tissue(s). This is because, the greater the time for relaxation, greater the magnetization.

An image is basically a measure of the energy distribution. Now suppose we have the measurements for tissues X and Y, and since they were of a different nature (composition, density of protons etc), their response to the field would have been different. Thus we would get some contrast between them and thus would get an image.

In very simplistic terms, this is how MRI scans are obtained. Though as mentioned above, please follow [4][5][6] for detailed tutorials on MRI.

__________

MRI scans of the Brain and the Cocktail Party Problem :

Now consider the above discussion in context of taking a MRI scan of the brain. The brain has a number of constituents. Some being : Gray Matter, White Matter, Cerebrospinal Fluid (CSF) Fat, Muscle/Skin, Glial Matter etc. Now since each is unique, they would exhibit unique characteristics under a magnetic field. However, while taking a scan, we get one MRI image of the entire brain.

These scans can be considered as an equivalent to the mixtures of the cocktail party example. If we apply blind source separation to these, we should be able to separate out the various constituents such as gray matter, white matter, CSF etc. These images of the independent sources can be used for better diagnosis. This would be something like this :

If suppose the Simulated MR scans (from the McGill Simulated brain Database) were as follows:

Simulated MR Scans

The “ground truth” images for these scans would be as follows :

Ground Truth Images of Different Brain Tissue Substances

__________

Restatement of the Broad Research Problem and Use of ICA and SVM:

Magnetic Resonance Imaging is superior to Computerised Tomography for brain imaging at least, for the reason that it can give much better soft tissue contrast (because even small changes in the proton density and composition in the tissue are well represented).

Like for most techniques, improvements to scans obtained by MRI are much desired to improve diagnosis. Blind source separation has been used to separate physiologically different components from EEG[7]/MEG[8] data (similar to the cocktail party problem), financial data[9] and even in fMRI[10][11]. But it has not received much attention for MRI. Nakai et al[12] used Independent Component Analysis for the purpose of separating physiologically independent components from MRI scans. They took MR images of 10 normal subjects, 3 subjects with brain tumour and 1 subject with multiple sclerosis and performed ICA on the data. They reported success in improving contrast for gray and white matter, which was beneficial for the diagnosis of brain tumour. The demylination in Multiple sclerosis cases was also enhanced in the images. They suggested that ICA could potentially separate out all the tissues which had different relaxation characteristics (different sources of the cocktail party example). This approach thus shows much promise.

In more technical terms : Consider a set of MR frames as a single multispectral image. Where each band is taken during a particular pulse sequence (will be discussed below). Then use ICA on the data to separate out the physiologically independent components. A classifier such as the SVM can improve the contrast further of the separated independent components.

However, using ICA for MRI has been tricky, something I would discuss towards the end of this post and also in future posts.

Before doing so, I intend to touch up on the basics for the sake of completeness.

__________

Magnetic Resonance Imaging:

I had been thinking of writing a detailed tutorial on MRI, mostly because it requires some basic physics. However I don’t think it is required. I would recommend [4][5][6] for a study of the same in sufficient depth. I have recently taken tutorials on MRI, and would be willing to write for the blog if there are requests.

__________

An Introduction to Independent Component Analysis:

Independent Component Analysis was developed initially to solve problems such as the cocktail party problem discussed above.

Let’s formalize a problem like the cocktail party example. For simplicity let us assume that there are only two sources and two mixtures (obtained by keeping two recorders at different locations in the party).

Let’s represent these two mixtures as $x_1$ and $x_2$, and let $s_1$ and $s_2$ be the two sources that were mixed. Since we are assuming that the two microphones were kept at different locations, the mixtures $x_1$ and $x_2$ would be different.

We could write this as:

$x_1 = a_{11}s_1 + a_{12}s_2 \quad \cdots \quad (1)$

$x_2 = a_{21}s_1 + a_{22}s_2 \quad \cdots \quad (2)$

The coefficients $a_{11}, a_{12}, a_{21}, a_{22}$ are basically some parameters that depend on the distance of the respective source from the microphones.

Let’s define our problem as : Using only the mixtures $x_i$ estimate the signal sources $s_i$. It is notable that you do not have any knowledge of the parameters $a_{ij}$.

This could be illustrated by this :

Consider three signals:

Suppose we have five mixtures obtained from these three signals.

Signals obtained by mixing source signals

If you only have the mixed signals available. And do not know how they were mixed (parameters $a_{ij}$ not known). And from these mixed signals ($x_{i}$) you have to estimate the source signals ($s_{i}$). This problem is of considerable difficulty.

One approach would be : Use the statistical properties of the signals ($s_i$) to estimate the parameters ($a_{ij}$). It is surprising that it is enough to assume that $s_1$ and $s_2$ are statistically independent. This assumption might not be valid in many scenarios. But works well in most situations.

We could write the above system of linear equations in matrix form as :

$x=As$

where, $A$ represents the mixing matrix, $x$ and $s$ represent the mixtures and the sources respectively.

The problem is to estimate $s$ from $x$ without knowing $A$. The assumption made is that the sources $s$ are statistically independent.

How we go about solving this problem is exciting and an area of active research.  ICA was originally developed for solving such problems. Please follow [12][13][14] for discussions on mutual information, measures of non-gaussianity such as Kurtosis and Negentropy and the fastICA algorithm.

__________

Why can ICA be used in MRI?

One limitation that ICA faces is that it can not work if more than one signal sources have a  Gaussian distribution. This can be illustrated as follows:

Again consider our equation for just two sources :

$\displaystyle \begin{bmatrix} x_1 \\ x_2 \end{bmatrix} = \begin{bmatrix} a_{11} & a_{12} \\ a_{21} & a_{22} \end{bmatrix} \begin{bmatrix} s_1 \\ s_2 \end{bmatrix}$

Our problem was : We have to estimate $s$ from $x$ without any knowledge of $A$. We would first need to estimate the parameters $A$ from $x$, assuming statistical independence of $s$. And then we could find $s$ as :

$s = Wx$, where $W=A^{-1}$ , or the inverse of the estimated mixing matrix $A$.

To understand how a solution would become impossible if both the sources had a Gaussian distribution, consider this :

Consider two independent components having the following uniform distributions:

$P(s_i) = \begin{cases} \frac{1}{2 \sqrt{3}} & \text{if} \quad |s_i| \leq \sqrt{3} \\ 0 & \text{otherwise} \end{cases}$

The joint density of the two sources would then be uniform on a square. This follows from the fact that the joint density would be the product of the two marginal densities.

The joint distribution for Si

[ Image Source : Reference [12][13] ]

Now if $s_1$ and $s_2$ were mixed by a mixing matrix $A$

$A = \begin{bmatrix} 2 & 3 \\ 2 & 1 \end{bmatrix}$

The mixtures obtained are $x_1$ and $x_2$. Now since the original sources had a joint distribution on a square, and they were transformed by using a mixing matrix, the joint distribution of the mixtures $x_1$ and $x_2$ will be a parallelogram. These mixtures are no longer independent.

Joint Distribution of the mixtures

[ Image Source : Reference [12][13] ]

Now consider the problem once again : We have to estimate the mixing matrix $A$ from the mixtures $x_i$, and using this estimated $A$ we have to estimate the sources $s_i$.

From the above joint distribution we have a way to estimate $A$. The edges of the parallelogram are in a direction given by the columns of $A$. This is an intuitive way of estimating the mixing matrix : obtain the joint distributions of the mixtures, estimate the columns of the mixing matrix by finding the directions of the edges of the parallelogram. This solution gives a good intuitive feel of a in-principle solution of the problem( however, it isn’t practical).

However, now instead of two independent sources having a uniform distribution consider two independent sources having a Gaussian distribution. The joint distribution would be :

Joint Distribution when both Independent sources are Gaussian

[ Image Source : Reference [12][13] ]

Now going by the above discussion, because of the nature of the above joint distribution, it is not possible to estimate the mixing matrix from it.

Thus ICA fails when one or more independent components have a a gaussian distribution.

Noise in MRI is non-gaussian[16], therefore ICA is suited for MRI.

__________

Problems in Using ICA for MRI Blind Source Separation:

The application of ICA for MRI faces a number of problems. I would discuss these in later blog posts. I would only discuss one major problem – the problem of Over-Complete ICA.

Over-Complete ICA in MRI:

The problem of over complete ICA occurs when there are lesser sensors (tape recorders from our above discussion) than sources. This problem can be understood by the following discussion. Suppose you have 3 mixtures $x_1$, $x_2$ and $x_3$ (imagine you have collected 3 tape recordings in a cocktail party of 6). Therefore you now have to estimate 6 sources from 3 mixtures.

Now the problem becomes something like this :

$x_1 = a_{11}s_1 + a_{12}s_2 + a_{13}s_3 + a_{14}s_4 + a_{15}s_5 + a_{16}s_6$

$x_2 = a_{21}s_1 + a_{22}s_2 + a_{23}s_3 + a_{24}s_4 + a_{25}s_5 + a_{26}s_6$

$x_3 = a_{31}s_1 + a_{32}s_2 + a_{33}s_3 + a_{34}s_4 + a_{35}s_5 + a_{36}s_6$

Assume for a second we can still estimate $a_{ij}$, still we can not find all the signal sources. As the number of linear equations is just three, while the number of unknowns is 6. This is a considerably harder problem and has been discussed by many groups such as [19][20][21].

Now dropping our assumption, the estimation of $a_{ij}$ is also harder in such a case.

The Case in MRI:

The problem of over-complete ICA doesn’t arise when it comes to functional-MRI. However it is a problem when it comes to MRI[17].

In MRI, by varying the parameters used for imaging, the three kind of images that can be obtained are T1 weighted, T2 weighted and Proton Density images. Going by our discussion in the section on MRI above. These three can be treated as mixtures.

Therefore, we have 3 mixtures at our disposal.  However, as the ground truth images above show: The number of different tissues in the brain exceeds 9. Thus this becomes a considerably difficult problem : We have to estimate 9-10 independent components from just 3 mixtures.

I would discuss methods that can help do that in later blog posts.

If only three mixtures are used, 3 ICs can be estimated. Since the actual number of ICs exceeds 9. It is obvious that the each of 3 ICs have atleast 2 ICs mixed, which means that a certain tissue type is not enhanced as much as it could have been had there been one IC for it. This can be understood by looking at this example.

3 ICs obtained by Applying Fast-ICA on MR scans

[I used FastICA for obtaining these Independent Components ]

To get more ICs, in simple words, we need more mixtures. However we can obtain more mixtures from the existing mixtures itself by a process of Band-Expansion[18].

I would discuss this problem of OC-ICA and it’s possible solutions in later posts.

__________

To Conclude:

A basic idea related to application of ICA to MR scans was discussed. It is clear that even with just three ICs significant tissue contrast enhancement is achieved. Problems related to OC-ICA would be discussed in later posts one by one. I would also discuss quantifying the results obtained using the Tanimoto/Jaccard coefficient of similarity.

__________

References and Resources:

Cocktail Party Problem

[1] “Some Experiments on the Recognition of Speech, with One and with Two Ears“; E. Colin Cherry; The Journal of the Acoustical Society of America; September 1953. (PDF)

[2] “The Attentive Brain“; Stephen Grossberg; Department of Cognitive and Neural Systemss – Boston University; American Scientist, 1995. (PDF)

[3] “The Cocktail Party Problem : A Primer“; Josh H. McDermott; Current Biology Vol 19. No. 22. (PDF)

Magnetic Resonance Imaging

[4] “Magnetic Resonance ImagingTutorial“; H Panepucci and A Tannus; Technical Report; USP, 1994. (PDF)

[5] “10 Video lessons on MRI by Paul Callaghan” (~ an hour in total). (Videos)

[6] “MRI Tutorial for Neuroscience Boot Camp” Melissa Saenz. (PDF)

Sample ICA Applications Similar to The Cocktail Party Problem

[7] “Independent Component Analysis of Electroencephalographic Data“; Makieng, Bell, Jung, Sejnowski; Advances in Neural Information Processing Systems, 1996. (PDF)

[8] “Application of ICA to MEG noise Reduction“; Masaki Kawakatsu; 4th International Symposium on Independent Component Analysis and Blind Source Separation; 2003. (PDF)

[9] “Independent Component Analysis in Financial Data” from the book Computational Finance; Yasser S. Abu-Mostafa; The MIT Press; 2000. (Book Link)

[10] “ICA of functional MRI data : An overview“; Calhoun, Adali, Hansen, Larsen, Pekar; 4th International Symposium on Independent Component Analysis and Blind Source Separation; 2003. (PDF)

[11] “Independent Component Analysis of fMRI Data – Examining the Assumptions“; McKeown, Sejnowski; Human Brain Mapping; 1998. (PDF)

Independent Component Analysis : Tutorials/Books

[12] “Independent Component Analysis : Algorithms and Applications“; Aapo Hyvärinen, Erkki Oja; Neural Networks; 2000. (PDF)

[13] “Independent Component Analysis“; Aapo Hyvärinen, Juha Karhunen, Erkki Oja; John Wiley Publications; 2001. (Book Link)

[14] ICA Tutorial at videolectures.net by Aapo Hyvärinen. (Videos)

Independent Component Analysis for Magnetic Resonance Imaging

[15] “Application of of Independent Component Analysis to Magnetic Resonance Imaging for enhancing the Contrast of Gray and White Matter“; Nakai, Muraki, Bagarinao, Miki, Takehara, Matsuo, Kato, Sakahara, Isoda; NeuroImage; 2004. (Journal Link)

[16] “Noise in MRI“; Albert Macovski; Magnetic Resonance in Medicine; 1996. (PDF)

[17] “Independent Component Analysis in Magnetic Resonance Image Analysis“;  Ouyang, Chen, Chai, Clayton Chen, Poon, Yang, Lee; EURASIP journal on Advances in Signal Processing; 2008 (Journal Link)

[18] “Band Expansion Based Over-Complete Independent Component Analysis for Multispectral Processing of Magnetic Resonance Images “; Ouyang, Chen, Chai, Clayton Chen, Poon, Yang, Lee; IEEE Transactions on Biomedical Imaging; June 2008. (PDF)

Over-Complete ICA:

[19] “Blind Source Separation of More Sources Than Mixtures Using Over Complete Representations“; Lee, Lewicki, Girolami, Sejnowski; IEEE Signal Processing Letters; 1999. (PDF)

[20] “Learning Overcomplete Representations“; Lewicki, Sejnowski. (PDF)

[21] “A Fast Algorithm for estimating over-complete ICA bases for Image Windows “; Hyvarinen, Cristescu, Oja; International Joint Conference on Neural Networks; 1999. (IEEE Xplore link)

__________

Onionesque Reality Home >>

## Face Recognition/Authentication Using Support Vector Machines

This post is part of a series on face recognition, I have been posting on face recognition for a while. There would be at least 7-8 more posts in the near future on the topic. Though I can not promise a time frame within which all would be up.

Previous Related Posts:

3. A Huge Collection of Datasets (Post links to a number of face image databases)

This post would reference two of my posts. One on SVMs and the other on Face Recognition using Eigenfaces.

Note: This post focuses on the idea behind using SVMs for face recognition and authentication. In future posts I will cover the various packages that can be used to implement SVMs and how to go about using them, and specifically for face recognition. The same can be easily extended to other similar problems such as content based retrieval systems, speech recognition, character or signature verification systems as well.

_____

Difference between Face Authentication (Verification) and Face Recognition (also called identification):

This might seem like a silly thing to start with. But for the sake of completeness, It is a good point to start with.

Face Authentication can be considered a subset of face recognition. Though due to the small difference there are a few non-concurrent parts in both the systems.

Face Authentication (also called verification) involves a one to one check that compares an input image (also called a query image, probe image or simply probe) with only the image (or class) that the user claims to be. In simple words, if you stand in front of a face authentication system and claim to be a certain user, the system will ONLY check if you are that user or not.

Face Recognition (or Identification) is another thing, though ofcourse related. It involves a one to many comparison of the input image (or probe or query image) with a template library. In simple words, in a face recognition system the input image will be compared with ALL the classes and then a decision will be made so as to identify to WHO the the input image belongs to. Or if it does not belong to the database at all.

Like I just said before, though both Authentication and Recognition are related there are some differences in the method involved, which are obvious due to the different nature of both.

_____

A Touch-Up of Support Vector Machines:

A few posts ago I wrote a post on why Support Vector Machines had this rather “seemingly” un-intuitive name. It had a brief introduction to SVMs as well. For those completely new to Support Vector Machines this post should help. I’ll still add a little for this post.

Support Vector Machine is a binary classification method that finds the optimal linear decision surface between two classes. The decision surface is nothing but a weighted combination of the support vectors. In other words, the support vectors decide the nature of the boundary between the two classes. Take a look at the image below:

The SVM takes in labeled training examples $\{\; x_i, y_i \}$, where $x_i$ represents the features and $y_i$ the class label, that could be either 1 or -1.  On training we obtain a set of Support Vectors $m$, multipliers $\alpha_i$, $y_i$and the term $b$. To understand what $b$ does, look at the above figure. It is somewhat like the intercept term $c$ in the equation of a straight line, $y = mx + c$. The terms $w$ and $x$ determine the orientation of the hyperplane while $b$ determines the actual position of the hyperplane.

As is indicated in the diagram, the linear decision surface is :

$w\star x + b = 0 \qquad(1)$

where $\displaystyle w = \sum_{i=1}^m \alpha_i y_i s_i$

where $s_i$ are the support vectors.

The above holds when the data (classes) is linearly separable. Sometimes however, that’s not the case. Take the following example:

The two classes are indicated by the two different colors. The data is clearly not LINEARLY separable.

However when mapped onto two dimensions, a linear decision surface between them can be made with ease.

Take another example. In this example the data is not linearly separable in 2-D, so they are mapped onto three dimensions where a linear decision surface between the classes can be made.

By Cover’s Theorem it is more likely that a data-set not linearly separable in some dimension would be linearly separable  in a higher dimension. The above two examples are simple, sometimes the data might be linearly separable at very high dimensions, maybe at infinite dimensions.

But how do we realize it? This done by employing the beautiful Kernel Trick. In place of the inner products we use a suitable Mercer Kernel. I don’t believe it is a good idea to discuss kernels here, or it will be a needless digression from face recognition. I promise to discuss it some time later.

Thus the non-linear decision surface changes from $\qquad(1)$ to:

$\displaystyle w = \sum_{i=1}^m \alpha_i y_i K(s_i, x) +b = 0 \qquad(2)$

Where $K$ represents a Kernel. It could be a Radial Basis (Gaussian) Kernel, A linear Kernel, A polynomial Kernel or a custom Kernel. :)

_____

Face Authentication is a two class problem. As I have mentioned earlier, here the system is presented with a claimed identity and it has to make a decision whether the claimant is really that person or not. The SVM in such applications will have to be fed with the images of one person, which will constitute one class and the other class will consist of images of other people other than that person. The SVM will then generate a linear decision surface.

For a input/probe image $p$, the identity is accepted if:

$w \star p + b < 0$

Or it is rejected. We can parameterize the decision surface by modifying the above as:

$w \star x + b = \Delta$

Then, a claim will be accepted if for a probe, $p$

$w \star p + b < \Delta$

_____

Now face recognition is a $\mathcal{K}$ class problem. Where $\mathcal{K}$ is the number of classes (or individuals).  Whereas the traditional Support Vector Machine is a binary classifier. So we’ll make a few changes to the way we are representing the faces to suit our classifier. I will come back to this in a while.

Feature Extraction: The faces will have to be represented by some appropriate features, these could be weights obtained using the Eigenfaces method, or using gabor features or anything else. I have written a post earlier that talked of a face recognition system based on Eigenfaces. I would direct the reader to check face representation using Eigenfaces there.

Using Eigenfaces, each probe $\Phi$could be represented as a vector of weights:

$\Omega = \begin{bmatrix}w_1\\w_2\\ \vdots\\w_M \end{bmatrix}$

After obtaining such a weight vector for the input or probe image and for all the other images stored in the library, we were simply finding the Euclidean or the Mahalanobis distance of the weight vector of the probe image with those of the images in the template library.  And then were recognizing the probe as a face that gave the minimum score provided it was below a certain threshold. I have discussed this is much detail there. And since I have, I would not discuss this again here.

_____

Representation in Difference Space:

SVMs are binary classifiers, that is – they give the class which might be 1 or -1, so we would have to modify the representation of faces a little bit than what we were doing in that previous post to make it somewhat more desirable. In the previous approach that is “a view based or face space approach”, each image was encoded separately. Here, we would change the representation and encode faces into a difference space. The difference space takes into account the dissimilarities between faces.

In the difference space there can be two different classes.

1. The class that encodes the dissimilarities between different images of the same person,

2. The other class encodes the dissimilarities between images of other people. These two classes are then given to a SVM which then generates a decision surface.

As  I wrote earlier, Face recognition traditionally can be thought of as a $\mathcal{K}$ class problem and face authentication can be thought of as a $\mathcal{K}$ instances two class problem. To reduce it to a two class problem we formulate the problem into a difference space as I have already mentioned.

Now consider a training set $\mathcal{T} = \{ \;t_1, \ldots, t_M\}$ having ${M}$ training images belonging to $\mathcal{K}$ individuals. Each individual can have more than one image, that means $M > \mathcal{K}$ ofcourse. It is from $\mathcal{T}$ that we generate the two classes I mentioned above.

1. The within class differences set. This set takes into account the differences in the images of the same class or individual. In more formal terms:

$\mathcal{C}_1 = \{ \; t_i - t_j | t_i \backsim t_j \}$

Where $t_i$ and $t_j$ are images and $t_i \backsim t_j$ indicates that they belong to the same person.

This set contains the differences not just for one individual but for all $\mathcal{K}$ individuals.

2. The between class differences set. This set gives the dissimilarities of different images of different individually. In more formal terms:

$\mathcal{C}_2 = \{ \; t_i - t_j | t_i \nsim t_j\}$

Where $t_i$ and $t_j$ are images and $t_i \nsim t_j$ indicates that they do not belong to the same person.

_____

Face Authentication:

For Authentication the incoming probe $p$ and a claimed identity $i$ is presented.

Using this, we first find out the similarity score:

$\delta = \displaystyle \sum_{i=1}^m \alpha_i y_i K(s_i, ClaimedID - p) +b$

We then accept this claim if it lies below a certain threshold $\Delta$ or else reject it. I have discussed the need for a threshold at the end of this post, please have a look. $\Delta$ is to be found heuristically.

_____

Face Recognition:

Consider a set of images $\mathcal{T} = \{ \;t_1, \ldots, t_M\}$, and a probe $p$ which is to be indentified.

We take $p$ and score it with every image in the set $t_i$:

$\delta = \displaystyle \sum_{i=1}^m \alpha_i y_i K(s_i, t_i - p) + b$

The image with the lowest score but below a threshold is recognized. I have written at the end of this post explaining why this threshold is important. This threshold is mostly chose heuristically.

_____

References and Important Papers

1. Face Recognition Using Eigenfaces, Matthew A. Turk and Alex P. Pentland, MIT Vision and Modeling Lab, CVPR ‘91.

2. Eigenfaces Versus Fischerfaces : Recognition using Class Specific Linear Projection, Belhumeur, Hespanha, Kreigman, PAMI ‘97.

3. Eigenfaces for Recognition, Matthew A. Turk and Alex P. Pentland, Journal of Cognitive Neuroscience ‘91.

4. Support Vector Machines Applied to Face Recognition, P. J. Phillips, Neural Information Processing Systems ’99.

5. The Nature of Statistical Learning Theory (Book), Vladimir Vapnik, Springer ’99.

6. A Tutorial on Support Vector Machines for Pattern Recognition, Christopher J. C. Burges, Data Mining and Knowledge Discovery, ’99

_____

Onionesque Reality Home >>

## Why are Support Vectors Machines called so?

I am often asked by some interested friends : What ARE actually support vector machines? What kind of machines? It is obvious that such questions are put up by people who are not initiated to the idea of SVMs or even learning machines so to say.

But I have also noticed that a lot of people working in the ML/AI/Pattern Recognition domain don’t have a clear idea of why Support Vector Machines are called so. Understanding why Artificial Neural Networks are named so is not so difficult, but Support Vector Machines does sound somewhat abstruse. This is exactly what this post aims to address.

I’d be giving a short introduction first. For those who simply wish to get to the point, I would recommend them to skip to this point.

Let’s parse SVM into the three constituent words and attack them one by one – Support Vector Machines. Let’s look at Machines first.

_____

Learning Machines: Clearly this part is meant for the former category in the two I mentioned in the above introductory paragraph i.e for the uninitiated.

Let me start with a basic definition that I think is necessary to be put for the sake of mere completeness. I hope this does not disgust the more expert reader. ;-)

Machine Learning

Way back in 1959, Arthur Samuel defined Machine Learning as the field of study that gave computers the ability to learn without being explicitly programmed. Samuel was a pioneer in the area as he is widely credited to have made the first self-learning program that played checkers.

Tom Mitchell in Machine Learning puts the definition more formally as: A learning program is said to learn from experience E with respect to some task T and some performance measure P, if its performance on T, as measured by P, improves with experience E.

Given a training set, we feed it into a learning algorithm (like SVM, Artificial Neural Nets, Logistic Regression, Linear Regression etc). The learning algorithm then outputs a function, which for historical reasons is called the hypothesis and denoted by $\hslash$.

So, basically the hypothesis’ job is to take a new input and give out a estimated output or class. The parameters that define the hypothesis are what are “learned” by using the training set. So the term learning machine is used in this sense.

To sum up: The hypothesis can be thought of as a machine that gives a prediction y on some unseen input x. The parameters of the hypothesis are learned (for a toy example see the part on pigeons in this post).

Ofcourse the above is defined in a supervised context but can be easily generalized.

_____

Support Vector Machines

An Introduction: I would prefer to give an extremely short introduction to SVMs first before coming to the origin of the term “Support Vectors”.

Consider the image below. How would you classify this data?

This set looks linearly separable. That means we could draw a straight line to separate the two classes (indicated by the two different colors). Also note that these data points lie in a two dimensional space, so we talk of a straight line. We could easily graduate to higher dimensions, as an example in a 3-D space we would have spoken of constructing a plane to separate the points and a hyperplane in a $\mathcal{N}$ dimensional space.

Coming back, though we can draw a straight line to separate these two classes, there is a problem. There are an infinite number of candidate lines. Which straight line to choose?

There are two intuitions that lead us to the best hyperplane :

1. Confidence in making the correct prediction: Without getting into much detail (as the point of this post is to see why are support vector machines called so and not what they are), this intuition is formalized by the functional margin. The functional margin of a hyperplane given by $wx+b = 0$ w.r.t a specific training example $(x^{(i)},y^{(i)})$ is defined as:

$\displaystyle \hat{\gamma}^{(i)} = y^{(i)}(w^T x^{(i)}+b)$

If $y^{(i)}=1$, for a large functional margin (greater confidence in correct prediction) we want $w^T x^{(i)}+b \gg 0$

If $y^{(i)}= -1$, for a large functional margin we want $w^T x^{(i)}+b \ll 0$.

The above captures our first intuition into a single formal statement that we would like the functional margin to be large.

2. Margin: Another intuition about choosing the best hyperplane is to choose one in which the distance from the training points is the maximum. This is formalized by the geometric margin. Without getting into the details of the derivation, the geometric margin is given by:

$\displaystyle\gamma^{(i)}= \frac{\hat{\gamma}^{(i)}}{\begin{Vmatrix}w\end{Vmatrix}}$

Which is simply the functional margin normalized. So these intuitions leads to the maximum margin classifier which is a precursor to the SVM.

Choose $\gamma$, $w$, $b$ so as to maximize the geometric margin

$max_{\gamma, w, b} \displaystyle \gamma$

subject to the condition that $y^{(i)}(w^{T}x^{(i)}+b) > \gamma$ and $\begin{Vmatrix}w\end{Vmatrix}=1$.

Working on the above optimization problem and trying to formulate it as a convex optimization problem leads to Support Vector Machines.

Also, the data I considered was linearly separable. We could easily extend the idea to non-separable data. For the general case, the dual of the support vector machines (for non-separable data) is given as:

$\displaystyle max L_{D} = \sum_i \alpha_i - \frac{1}{2}\sum_{i,j}\alpha_i\alpha_j y_i y_j \langle x_i \centerdot x_j\rangle$

subject to:

$\displaystyle 0\leq \alpha_i \leq C$,

$\displaystyle\sum_i \alpha_i y_i = 0$

The solution is given by:

$\displaystyle w = \sum_{i=1}^{N_S}\alpha_i y _i x_i \qquad(1)$

where ${N}_S$ is the number of support vectors and the $\alpha$s represent the Lagrangian multipliers.

[I have not made the above diagram by myself. I had taken it quite a while back and don’t remember from where. In case you do know, please do point out]

Support Vectors?

Now that a brief introduction is done, let’s come back to the main point.

In the above diagram we see that the thinner lines mark the distance from the classifier to the closest data points called the support vectors (darkened data points). The distance between the two thin lines is called the margin. The Support Vectors constrain the width of the margin. And since they are very less as compared to the total number of data points, they hand us many advantages but let us not get into that.

The question is, why are these points called Support Vectors at all? To understand  this, consider a mechanical analogy. Consider that the data is in $\mathcal{R}^2$ and suppose the $i^{ith}$ support vector exerts a force of $\displaystyle F_i = \alpha_i y_i \hat{w}$ on a stiff sheet lying along the decision surface. $\hat{w}$ represents the unit vector in the direction $w$.

The solution $\qquad (1)$ then satisfies the conditions for mechanical equilibrium:

$\displaystyle \sum Forces = \sum_i \alpha_i y_i \hat{w} = 0$

$\displaystyle \sum Torques = \sum_i s_i \star (\alpha_i y_i \hat{w}) = o$

$s_i$ are the support vectors and $\star$ denotes the vector product.

This mechanical analogy emphasizes two important points:

1. The most important data points are the support vectors as they have the maximum values of $\alpha$. These points exert the maximum force on the decision sheet. This force exerted is constrained to be below C for any point in the non-separable case.

2. Since the torque exerted by the support vectors comes out to be zero, we can say that these specific data points are “supporting” the hyperplane into “equilibrium”.

I think this explains how the term “Support Vectors” originates.

_____

Onionesque Reality Home >>

## Demystifying Support Vector Machines for Beginners: A Reading List

Though I have worked on pattern recognition in the past I have always wanted to work with Neural Networks for the same. However for some reason or the other I could never do so, I could not even take it as an elective subject due to some constraints. Over the last two years or so I have been promising myself and ordering myself to stick to a schedule and study ANNs properly, however due to a combination of procrastination, over-work and bad planning I have never been able to do anything with them.

However I have now got the opportunity to work with Support Vector Machines and over the past some time I have been reading extensively on the same and have been trying to get playing with them. Now that the actual implementation and work is set to start I am pretty excited to work with them. It is nice that I get to work with SVMs though I could not with ANNs.

Support Vector Machine is a classifier derived from statistical learning theory by Vladimir Vapnik and his co-workers. The foundations for the same were laid by him as late as the 1970s SVM shot to prominence when using pixel maps as input it gave an accuracy comparable with sophisticated Neural Networks with elaborate features in a handwriting recognition task.

Traditionally Neural Networks based approaches have suffered some serious drawbacks, especially with generalization, producing models that can overfit the data. SVMs embodies the structural risk minimization principle that is shown superior to the empirical risk minimization that neural networks use. This difference gives SVMs the greater ability to generalize.

However learning how to work with SVMs can be challenging and somewhat intimidating at first. When i started reading on the topic I took the books by Vapnik on the subject but could not make much head or tail. I could only attain a certain degree of understanding, nothing more. To specialize in something I do well when I start off as a generalist, having a good and quite correct idea of what is exactly going on. Knowing in general what is to be done and what is what, after this initial know-how makes me comfortable I reach the stage of starting with the mathematics which gives profound understanding as anything without mathematics is meaningless. However most books that I came across missed the first point for me, and it was very difficult to make a headstart. There was a book which I could read in two days that helped me get that general picture quite well. I would highly recommend it for most who are in the process of starting with SVMs.The book is titled Support Vector Machines and other Kernel Based Learning methods and is authored by Nello Cristianini and John-Shawe Taylor.

I would highly recommend people who are starting with Support Vector Machines to buy this book. It can  be obtained easily over Amazon.

This book has very less of a Mathematical treatment but it makes clear the ideas involved and this introduces a person studying from it to think more clearly before he/she can refine his/her understanding by reading something heavier mathematically. Another that I would highly recommend is the book Support Vector Machines for Pattern Classification by Shigeo Abe.

Another book that I highly recommend is Learning with Kernels by Bernhard Scholkopf and Alexander Smola. Perfect book for beginners.

Only after one has covered the required stuff from here that I would suggest Vapnik’s books which then would work wonderfully well.

Other than the books there are a number of Video Lectures and tutorials on the Internet that can work as well!

Below is a listing of a large number of good tutorials on the topic. I don’t intend to flood a person interested in starting with too much information, where ever possible i have described what the document carries so that one could decide what should suffice for him/her on the basis of need. Also I have star-marked some of the posts. This marks the ones that i have seen and studied from personally and found them most helpful and i am sure they would work the same way with both beginners and people with reasonable experience alike.

Webcasts/ Video Lectures on Learning Theory, Support Vector Machines and related ideas:

EDIT: For those interested. I had posted about a course on Machine Learning that has been provided by Stanford university. It too is suited for an introduction to Support Vector Machines. Please find the post here. Also this comment might be helpful, suggestions to it according to your learning journey are also welcome.

1. *Machine Learning Workshop, University of California at Berkeley. This series covers most of the basics required. Beginners can skip the sessions on Bayesian models and Manifold Learning.

Workshop Outline:

Session 1: Classification.

Session 2: Regression.

Session 3: Feature Selection

Session 4: Diagnostics

Session 5: Clustering

Session 6: Graphical Models

Session 7: Linear Dimensionality Reduction

Session 8: Manifold Learning and Visualization

Session 9: Structured Classification

Session 10: Reinforcement Learning

Session 11: Non-Parametric Bayesian Models

2. Washington University. Beginners might be interested on the sole talk on the topic of Supervised Learning for Computer Vision Applications or maybe in the talk on Dimensionality Reduction.

3. Reinforcement Learning, Universitat Freiburg.

4. Deep Learning Workshop. Good talks, But I’d say these are meant for only the highly interested.

5. *Introduction to Learning Theory, Olivier Bousquet.

This tutorial focuses on the “larger picture” than on mathematical proofs, it is not restricted to statistical learning theory however. The course comprises of five lectures and is quite good to watch. The Frenchman is both smart and fun!

6. *Statistical Learning Theory, Olivier Bousquet. This course gives a detailed introduction to Learning Theory with a focus on the Classification problem.

Course Outline:

Probabilistic and Concentration inequalities, Union Bounds, Chaining, Measuring the size of a function class, Vapnik Chervonenkis Dimension, Shattering Dimensions and Rademacher averages, Classification with real valued functions.

7. *Statistical Learning Theory, Olivier Bousquet. This is not the repeat of the above course. This one is a more recent lecture series than the above actually. This course has six lectures. Another excellent set.

Course Outline:

Learning Theory: Foundations and Goals

Learning Bounds: Ingredients and Results

Implications: What to conclude from bounds

7. Advanced Statistical Learning Theory, Olivier Bousquet. This set of lectures compliment the above courses on statistical learning theory and give a more detailed exposition of the current advancements in the same.This course has three lectures.

Course Outline:

PAC Bayesian bounds: a simple derivation, comparison with Rademacher averages, Local Rademacher complexity with classification loss, Talagrand’s inequality. Tsybakov noise conditions, Properties of loss functions for classification (influence on approximation and estimation, relationship with noise conditions), Applications to SVM – Estimation and approximation properties, role of eigenvalues of the Gram matrix.

8. *Statistical Learning Theory, John-Shawe Taylor, University of London. One plus point of this course is that is has some good English. Don’t miss this lecture as it has been given by the same professor whose book we just discussed.

9. *Learning with Kernels, Bernhard Scholkopf.

This course covers the basics for Support Vector Machines and related Kernel methods. This course has six lectures.

Course Outline:

Kernel and Feature Spaces, Large Margin Classification, Basic Ideas of Learning Theory, Support Vector Machines, Other Kernel Algorithms.

10. Kernel Methods, Alexander Smola, Australian National University.  This is an advanced course as compared to the above and covers exponential families, density estimation, and conditional estimators such as Gaussian Process classification, regression, and conditional random fields, Moment matching techniques in Hilbert space that can be used to design two-sample tests and independence tests in statistics.

11. *Introduction to Kernel Methods, Bernhard Scholkopf, There are four parts to this course.

Course Outline:

Kernels and Feature Space, Large Margin Classification, Basic Ideas of Learning Theory, Support Vector Machines, Examples of Other Kernel Algorithms.

12. Introduction to Kernel Methods, Partha Niyogi.

13. Introduction to Kernel Methods, Mikhail Belkin, Ohio State University.This lecture is second in part to the above.

14. *Kernel Methods in Statistical Learning, John-Shawe Taylor.

15. *Support Vector Machines, Chih-Jen Lin, National Taiwan University. Easily one of the best talks on SVM. Almost like a run-down tutorial.

Course Outline:

Basic concepts for Support Vector Machines, training and optimization procedures of SVM, Classification and SVM regression.

16. *Kernel Methods and Support Vector Machines, Alexander Smola. A comprehensive six lecture course.

Course Outline:

Introduction of the main ideas of statistical learning theory, Support Vector Machines, Kernel Feature Spaces, An overview of the applications of Kernel Methods.

1. Basics of Probability and Statistics for Machine Learning, Mikaela Keller.

This course covers most of the basics that would be required for the above courses. However sometimes the shooting quality is a little shady. This talk seems to be the most popular on the video lectures site, one major reason in my opinion is that the lady delivering the lecture is quite pretty!

2. Some Mathematical Tools for Machine Learning, Chris Burges.

3. Machine Learning Laboratory, S.V.N Vishwanathan.

4. Machine Learning Laboratory, Chrisfried Webers.

Introductory Tutorials (PDF/PS):

3. *Support Vector Machines- Hype or Hallelujah (K. P. Bennett, RPI). Click Here >>

4. Support Vector Machines and Pattern Recognition (Georgia Tech). Click Here >>

5. An Introduction to Support Vector Machines in Data Mining (Georgia Tech). Click Here >>

8. *A Practical Guide to Support Vector Classification (Hsu, Chang, Lin, Via U-Michigan Ann Arbor). Click Here >>

9. *A Tutorial on Support Vector Machines for Pattern Recognition (Christopher J.C Burges, Bell Labs Lucent Technologies, Data mining and knowledge Discovery). Click Here >>

10. Support Vector Clustering (Hur, Horn, Siegelmann, Journal of Machine Learning Research. Via MIT). Click Here >>

11. *What is a Support Vector Machine (Noble, MIT). Click Here >>

12. Notes on PCA, Regularization, Sparisty and Support Vector Machines (Poggio, Girosi, MIT Dept of Brain and Cognitive Sciences). Click Here >>

13. *CS 229 Lecture Notes on Support Vector Machines (Andrew Ng, Stanford University). Click Here >>

Introductory Slides (mostly lecture slides):

1. Support Vector Machines in Machine Learning (Arizona State University). Click here >>

Lecture Outline:

What is Machine Learning, Solving the Quadratic Programs, Three very different approaches, Comparison on medium and large sets.

Lecture Outline:

The Learning Problem, What do we know about test data, The capacity of a classifier, Shattering, The Hyperplane Classifier, The Kernel Trick, Quadratic Programming.

3. Support Vector Machines, Linear Case (Jieping Ye, Arizona State University). Click Here >>

Lecture Outline:

Linear Classifiers, Maximum Margin Classifier, SVM for Separable data, SVM for non-separable data.

4. Support Vector Machines, Non Linear Case (Jieping Ye, Arizona State University). Click Here >>

Lecture Outline:

Non Linear SVM using basis functions, Non-Linear SVMs using Kernels, SVMs for Multi-class Classification, SVM path, SVM for unbalanced data.

5. Support Vector Machines (Sue Ann Hong, Carnegie Mellon). Click Here >>

6. Support Vector Machines (Carnegie Mellon University Machine Learning 10701/15781). Click Here >>

9. Support Vector Machines (Via U-Maryland at College Park). Click Here >>

10. Support Vector Machines: Algorithms and Applications (MIT OCW). Click Here >>

Papers/Notes on some basic related ideas (No estoric research papers here):

1. Robust Feature Induction for Support Vector Machines (Arizona State University). Click Here >>

3. *Training Data Set for Support Vector Machines (Brown University). Click Here >>

4. Support Vector Machines are Universally Consistent (Journal Of Complexity). Click Here >>

5. Feature Selection for Classification of Variable Length Multi-Attribute Motions (Li, Khan, Prabhakaran). Click Here >>

6. Selecting Data for Fast Support Vector Machine Training (Wang, Neskovic, Cooper). Click Here >>

8. The Support Vector Decomposition Machine (Periera, Gordon, Carnegie Mellon). Click Here >>

10. Supervised Clustering with Support Vector Machines (Finley, Joachims, Cornell University). Click Here >>

11. Metric Learning: A Support Vector Approach (Cornell University). Click Here >>

12. Training Linear SVMs in Linear Time (Joachims, Cornell Unversity). Click Here >>

13. *Rule Extraction from Linear Support Vector Machines (Fung, Sandilya, Rao, Siemens Medical Solutions). Click Here >>

14. Support Vector Machines, Reproducing Kernel Hilbert Spaces and Randomizeed GACV (Wahba, University of Wisconsian at Madison). Click Here >>

15. The Mathematics of Learning: Dealing with Data (Poggio, Girosi, AI Lab, MIT). Click Here >>

16. Training Invariant Support Vector Machines (Decoste, Scholkopf, Machine Learning). Click Here >>

*As I have already mentioned above, the star marked courses/lectures/tutorials/papers are the ones that I have seen and studied from personally (and hence can vouch for) and these in my opinion should work best for beginners.

Onionesque Reality Home >>