Feeds:
Posts
Comments

Archive for March, 2009

Recently I attended a series of 6 talks each by Jiri Navratil of the IBM Thomas J. Watson research center and Frédéric Bimbot of the IRISA. Both of them are some of the best researchers I have met. Very helpful and extremely humble. In one of Jiri’s opening talk he mentioned an adage in the literature of speech/language/speaker recognition that interested me. It not only applies to speech processing research but in recognition problems in general.

jiri_navratil

[Jiri Navratil speaking : Photo taken by me after his permission]

It went like this:

It is easier to reject imposters than it is to accept true speakers.

People’s voices are distinctive. That is, a person’s speech exhibits distinctive characteristics that indicate the identity of the speaker. We are all familiar with this and we all use it in our everyday lives to help us interact with others. Of course from time to time we might notice that a person sounds very much like another person we know. Or we might even momentarily mistake as one person as another because of the sound of the person’s voice. But this similarity between voices of different individuals is not what the technical challenge in speaker recognition is all about.

The challenge in speaker recognition is variance, not similarity. That is, the challenge to decode a highly variable speech signal into the characteristics that indicate the speaker’s identity. These variations are formidable and myriad. The principal cause of variance is the speaker.

An explanation for why the speaker’s variability is such a vexing problem is that the use of speech – unlike fingerprints or handprints or retinal patterns, is to a very large degree a result of what the person “does“; rather then “who the person is” – speech is a “performing art and each performance is unique”

Above are excerpts from The NIST Speaker recognition evaluation – Overview, methodology, systems, results, perspective. By G. R Doddington et al. Speech Communication, vol 31, pp 225-254, 2000.

Dr Navratil basically spoke on Acoustics and Phonotactics in Language Identification, while Dr Bimbot spoke on Gaussian Mixture Models and Universal Background Models in the course of their talks.

_____

Quick Links:

1. Official Webpage of Jiri Navratil

2. Official Webpage of Frédéric Bimbot

_____

Onionesque Reality Home >>

Advertisements

Read Full Post »

I am often asked by some interested friends : What ARE actually support vector machines? What kind of machines? It is obvious that such questions are put up by people who are not initiated to the idea of SVMs or even learning machines so to say.

But I have also noticed that a lot of people working in the ML/AI/Pattern Recognition domain don’t have a clear idea of why Support Vector Machines are called so. Understanding why Artificial Neural Networks are named so is not so difficult, but Support Vector Machines does sound somewhat abstruse. This is exactly what this post aims to address.

I’d be giving a short introduction first. For those who simply wish to get to the point, I would recommend them to skip to this point.

Let’s parse SVM into the three constituent words and attack them one by one – Support Vector Machines. Let’s look at Machines first.

_____

Learning Machines: Clearly this part is meant for the former category in the two I mentioned in the above introductory paragraph i.e for the uninitiated.

Let me start with a basic definition that I think is necessary to be put for the sake of mere completeness. I hope this does not disgust the more expert reader. ;-)

Machine Learning

Way back in 1959, Arthur Samuel defined Machine Learning as the field of study that gave computers the ability to learn without being explicitly programmed. Samuel was a pioneer in the area as he is widely credited to have made the first self-learning program that played checkers.

Tom Mitchell in Machine Learning puts the definition more formally as: A learning program is said to learn from experience E with respect to some task T and some performance measure P, if its performance on T, as measured by P, improves with experience E.

Given a training set, we feed it into a learning algorithm (like SVM, Artificial Neural Nets, Logistic Regression, Linear Regression etc). The learning algorithm then outputs a function, which for historical reasons is called the hypothesis and denoted by \hslash.

learning-supervised2

So, basically the hypothesis’ job is to take a new input and give out a estimated output or class. The parameters that define the hypothesis are what are “learned” by using the training set. So the term learning machine is used in this sense.

To sum up: The hypothesis can be thought of as a machine that gives a prediction y on some unseen input x. The parameters of the hypothesis are learned (for a toy example see the part on pigeons in this post).

Ofcourse the above is defined in a supervised context but can be easily generalized.

_____

Support Vector Machines

An Introduction: I would prefer to give an extremely short introduction to SVMs first before coming to the origin of the term “Support Vectors”.

Consider the image below. How would you classify this data?

data-pointsThis set looks linearly separable. That means we could draw a straight line to separate the two classes (indicated by the two different colors). Also note that these data points lie in a two dimensional space, so we talk of a straight line. We could easily graduate to higher dimensions, as an example in a 3-D space we would have spoken of constructing a plane to separate the points and a hyperplane in a \mathcal{N} dimensional space.

Coming back, though we can draw a straight line to separate these two classes, there is a problem. There are an infinite number of candidate lines. Which straight line to choose?

infinte-linear-separators

There are two intuitions that lead us to the best hyperplane :

1. Confidence in making the correct prediction: Without getting into much detail (as the point of this post is to see why are support vector machines called so and not what they are), this intuition is formalized by the functional margin. The functional margin of a hyperplane given by wx+b = 0 w.r.t a specific training example (x^{(i)},y^{(i)}) is defined as:

\displaystyle \hat{\gamma}^{(i)} = y^{(i)}(w^T x^{(i)}+b)

If y^{(i)}=1, for a large functional margin (greater confidence in correct prediction) we want w^T x^{(i)}+b \gg 0

If y^{(i)}= -1, for a large functional margin we want w^T x^{(i)}+b \ll 0.

The above captures our first intuition into a single formal statement that we would like the functional margin to be large.

2. Margin: Another intuition about choosing the best hyperplane is to choose one in which the distance from the training points is the maximum. This is formalized by the geometric margin. Without getting into the details of the derivation, the geometric margin is given by:

\displaystyle\gamma^{(i)}= \frac{\hat{\gamma}^{(i)}}{\begin{Vmatrix}w\end{Vmatrix}}

Which is simply the functional margin normalized. So these intuitions leads to the maximum margin classifier which is a precursor to the SVM.

To sum up: To realize these intuitions and get the best hyperplane, the optimization problem is:

Choose \gamma, w, b so as to maximize the geometric margin

max_{\gamma, w, b} \displaystyle \gamma

subject to the condition that y^{(i)}(w^{T}x^{(i)}+b) > \gamma and \begin{Vmatrix}w\end{Vmatrix}=1.

Working on the above optimization problem and trying to formulate it as a convex optimization problem leads to Support Vector Machines.

Also, the data I considered was linearly separable. We could easily extend the idea to non-separable data. For the general case, the dual of the support vector machines (for non-separable data) is given as:

\displaystyle max L_{D} = \sum_i \alpha_i - \frac{1}{2}\sum_{i,j}\alpha_i\alpha_j y_i y_j \langle x_i \centerdot x_j\rangle

subject to:

\displaystyle 0\leq \alpha_i \leq C,

\displaystyle\sum_i \alpha_i y_i = 0

The solution is given by:

\displaystyle w = \sum_{i=1}^{N_S}\alpha_i y _i x_i \qquad(1)

where {N}_S is the number of support vectors and the \alphas represent the Lagrangian multipliers.

optimal-margin-classifier

[I have not made the above diagram by myself. I had taken it quite a while back and don’t remember from where. In case you do know, please do point out]

Support Vectors?

Now that a brief introduction is done, let’s come back to the main point.

In the above diagram we see that the thinner lines mark the distance from the classifier to the closest data points called the support vectors (darkened data points). The distance between the two thin lines is called the margin. The Support Vectors constrain the width of the margin. And since they are very less as compared to the total number of data points, they hand us many advantages but let us not get into that.

The question is, why are these points called Support Vectors at all? To understand  this, consider a mechanical analogy. Consider that the data is in \mathcal{R}^2 and suppose the i^{ith} support vector exerts a force of \displaystyle F_i = \alpha_i y_i \hat{w} on a stiff sheet lying along the decision surface. \hat{w} represents the unit vector in the direction w.

The solution \qquad (1) then satisfies the conditions for mechanical equilibrium:

\displaystyle \sum Forces = \sum_i \alpha_i y_i \hat{w} = 0

\displaystyle \sum Torques = \sum_i s_i \star (\alpha_i y_i \hat{w}) = o

s_i are the support vectors and \star denotes the vector product.

This mechanical analogy emphasizes two important points:

1. The most important data points are the support vectors as they have the maximum values of \alpha. These points exert the maximum force on the decision sheet. This force exerted is constrained to be below C for any point in the non-separable case.

2. Since the torque exerted by the support vectors comes out to be zero, we can say that these specific data points are “supporting” the hyperplane into “equilibrium”.

I think this explains how the term “Support Vectors” originates.

_____

Onionesque Reality Home >>

Read Full Post »