I am often asked by some interested friends : What ARE actually support vector machines? What kind of machines? It is obvious that such questions are put up by people who are not initiated to the idea of SVMs or even learning machines so to say.
But I have also noticed that a lot of people working in the ML/AI/Pattern Recognition domain don’t have a clear idea of why Support Vector Machines are called so. Understanding why Artificial Neural Networks are named so is not so difficult, but Support Vector Machines does sound somewhat abstruse. This is exactly what this post aims to address.
I’d be giving a short introduction first. For those who simply wish to get to the point, I would recommend them to skip to this point.
Let’s parse SVM into the three constituent words and attack them one by one – Support Vector Machines. Let’s look at Machines first.
Learning Machines: Clearly this part is meant for the former category in the two I mentioned in the above introductory paragraph i.e for the uninitiated.
Let me start with a basic definition that I think is necessary to be put for the sake of mere completeness. I hope this does not disgust the more expert reader. ;-)
Way back in 1959, Arthur Samuel defined Machine Learning as the field of study that gave computers the ability to learn without being explicitly programmed. Samuel was a pioneer in the area as he is widely credited to have made the first self-learning program that played checkers.
Tom Mitchell in Machine Learning puts the definition more formally as: A learning program is said to learn from experience E with respect to some task T and some performance measure P, if its performance on T, as measured by P, improves with experience E.
Given a training set, we feed it into a learning algorithm (like SVM, Artificial Neural Nets, Logistic Regression, Linear Regression etc). The learning algorithm then outputs a function, which for historical reasons is called the hypothesis and denoted by .
So, basically the hypothesis’ job is to take a new input and give out a estimated output or class. The parameters that define the hypothesis are what are “learned” by using the training set. So the term learning machine is used in this sense.
To sum up: The hypothesis can be thought of as a machine that gives a prediction y on some unseen input x. The parameters of the hypothesis are learned (for a toy example see the part on pigeons in this post).
Ofcourse the above is defined in a supervised context but can be easily generalized.
Support Vector Machines
An Introduction: I would prefer to give an extremely short introduction to SVMs first before coming to the origin of the term “Support Vectors”.
Consider the image below. How would you classify this data?
This set looks linearly separable. That means we could draw a straight line to separate the two classes (indicated by the two different colors). Also note that these data points lie in a two dimensional space, so we talk of a straight line. We could easily graduate to higher dimensions, as an example in a 3-D space we would have spoken of constructing a plane to separate the points and a hyperplane in a dimensional space.
Coming back, though we can draw a straight line to separate these two classes, there is a problem. There are an infinite number of candidate lines. Which straight line to choose?
There are two intuitions that lead us to the best hyperplane :
1. Confidence in making the correct prediction: Without getting into much detail (as the point of this post is to see why are support vector machines called so and not what they are), this intuition is formalized by the functional margin. The functional margin of a hyperplane given by w.r.t a specific training example is defined as:
If , for a large functional margin (greater confidence in correct prediction) we want
If , for a large functional margin we want .
The above captures our first intuition into a single formal statement that we would like the functional margin to be large.
2. Margin: Another intuition about choosing the best hyperplane is to choose one in which the distance from the training points is the maximum. This is formalized by the geometric margin. Without getting into the details of the derivation, the geometric margin is given by:
Which is simply the functional margin normalized. So these intuitions leads to the maximum margin classifier which is a precursor to the SVM.
To sum up: To realize these intuitions and get the best hyperplane, the optimization problem is:
Choose , , so as to maximize the geometric margin
subject to the condition that and .
Working on the above optimization problem and trying to formulate it as a convex optimization problem leads to Support Vector Machines.
Also, the data I considered was linearly separable. We could easily extend the idea to non-separable data. For the general case, the dual of the support vector machines (for non-separable data) is given as:
The solution is given by:
where is the number of support vectors and the s represent the Lagrangian multipliers.
[I have not made the above diagram by myself. I had taken it quite a while back and don't remember from where. In case you do know, please do point out]
Now that a brief introduction is done, let’s come back to the main point.
In the above diagram we see that the thinner lines mark the distance from the classifier to the closest data points called the support vectors (darkened data points). The distance between the two thin lines is called the margin. The Support Vectors constrain the width of the margin. And since they are very less as compared to the total number of data points, they hand us many advantages but let us not get into that.
The question is, why are these points called Support Vectors at all? To understand this, consider a mechanical analogy. Consider that the data is in and suppose the support vector exerts a force of on a stiff sheet lying along the decision surface. represents the unit vector in the direction .
The solution then satisfies the conditions for mechanical equilibrium:
are the support vectors and denotes the vector product.
This mechanical analogy emphasizes two important points:
1. The most important data points are the support vectors as they have the maximum values of . These points exert the maximum force on the decision sheet. This force exerted is constrained to be below C for any point in the non-separable case.
2. Since the torque exerted by the support vectors comes out to be zero, we can say that these specific data points are “supporting” the hyperplane into “equilibrium”.
I think this explains how the term “Support Vectors” originates.