Support Vector Machines for Machine Learning

Support Vector Machines for Machine Learning

Before we start understanding about SVM here in this article, let me clarify the fact that there are multiple blogs and videos available in the internet on SVM. My intension in this article is not to provide the same details here again. This article can be a trigger point on understanding what all you need to learn more on SVM.

Support Vector Machines are perhaps one of the most popular and talked about machine learning algorithms.

After reading this article I hope you will know:

·      How to untangle the Jargons used to refer to SVM.

·      How a SVM model can be used to make predictions for new data.

·      How and when to use an SVM model.

·      Where you might look to get more information on SVM.

SVM is an exciting algorithm and the concepts are relatively simple.

This article is for developers with little or no background in statistics, calculus and linear algebra.

 Let’s go through this article with point by point……

What is Support Vector Machine?

By definition, A Support Vector Machine (SVM) is a discriminative classifier ceremoniously defined by a separating hyperplane. In other words, given labeled training data (supervised learning), the algorithm outputs an optimal hyperplane which classifies new data. In two dimentional space this hyperplane is a line dividing a plane in two parts where in each class lay in either side, but in multi (n) dimentional space the hyperplane is an (n-1) dimentional plane which divide the data in multiple classes.

Did you get anything out of this definition? Don’t worry…..

We will demystify this in “How Support Vector Machine works”

But for now, we should at least know that SVM is a supervised classification algorithm, which can be used in regression analysis also.

For simplicity, read the below definition:

Support Vector Machine (SVM) is primarily a classier method that performs classification tasks by constructing hyperplanes in a multidimensional space that separates cases of different class labels. SVM supports both regression and classification tasks and can handle multiple continuous and categorical variables. For categorical variables a dummy variable is created with case values as either 0 or 1. Thus, a categorical dependent variable consisting of three levels, say (A, B, C), is represented by a set of three dummy variables:

A: {1 0 0}, B: {0 1 0}, C: {0 0 1}

Hope this is a-bit clear now…..Let’s check on other details one by one now…

What are some use cases for SVM?

SVM the best first choice for any classification task….

The reason: SVM is one of the most robust and accurate algorithm among the other classification algorithms.

The aim of using SVM is to correctly classify unseen data. SVMs have a number of applications in several fields.

Some common applications of SVM are-

·      Face detection – SVMc classify parts of the image as a face and non-face and create a square boundary around the face. But here we mostly use Nueral network as this outperforms SVM.

·      Text and hypertext categorization – SVMs allow Text and hypertext categorization for both inductive and transductive models. They use training data to classify documents into different categories. It categorizes on the basis of the score generated and then compares with the threshold value.

·      Classification of images – Use of SVMs provides better search accuracy for image classification. It provides better accuracy in comparison to the traditional query-based searching techniques.

·      Bioinformatics – It includes protein classification and cancer classification. We use SVM for identifying the classification of genes, patients on the basis of genes and other biological problems.

·      Protein fold and remote homology detection – Apply SVM algorithms for protein remote homology detection.

·      Handwriting recognition – We use SVMs to recognize handwritten characters used widely.

·      Generalized predictive control(GPC) – Use SVM based GPC to control chaotic dynamics with useful parameters.

When we use Support Vector machine for Classification?

SVM is one of the best classifier but not the best. In fact, no one could be the best. It depends upon the problem which classifier would be suitable.

As far as, SVM is concerned, it is a suitable classifier in following cases:

1) When number of features (variables) and number of training data is not very large (say millions of features and millions of instances (data)). When the data is too big, neural network outperforms SVM or like when we are going for image classification neural network outperforms SVM.

2) When leanness in the problem is very high, i.e., most of the features have zero value.

3) It is the best for document classification problems where leanness is high and features/instances are also very high.

4) It also performs very well for problems like genes classsification, drug disambiguation etc., where number of features are not very high.

How does SVM compare to other ML algorithm?

It is one of the best classification algoritm, because of following things:

1) It uses Kernel trick (I will talk about this later)

2) It is Optimal margin based classification technique in Machine Learning.

3) Good number of algorithms are proposed which utilizes problem structures and other smaller-smaller things like problem shrinking during optimization etc.

But again, the selection of algorithm completely depends on what problem we are trying to solve, what type of dataset we are handling etc.

How Support Vector Machine works?

In SVM, we are trying to fit a hyperplane between certain data points in best possible way to classify/separate the data in two or more best possible classes/separation…But the hyperplane is fitted in such a way that we have maximum gap (calls margin) in both sides of the plane from the nearest points (calls support vectors) from the line in both sides.

Check the below diagram…

No alt text provided for this image

This is the reason we need to know that in SVM we are dealing with a constrained optimization problem when we are trying to classify the data.

Constrained optimization!!!!!  Confused???

Let’s try to simplify ….

Optimization mean we have to maximize the margin and constraint means the dots can’t be on the road i.e. inside the margin.

No alt text provided for this image

This constrained optimization problem can be solved using Lagrange Multipliers technique, which involves quite a lot of maths and I will not cover the maths part of it at this moment, but now at least we know on high level how SVM works.

There are two classes in the diagram, green and blue. The line in the middle is the decision boundary. The highlighted points are the support vectors. The distance between all of these points and the line is the maximum possible among all possible decision boundaries, thus making this the most optimal one.

So, now what?

Now, I have to tell you that fact that what I just described above with the digram, is the simplest way of classification in SVM and is a linear classification as the data points are in linear relation…So, when we need to classify nonlinear data, the SVM classification becomes a little complex, which we will check slowly…

 Let’s go steps by step on understanding each of the Jargons of SVM…

 What are Support Vectors?

By definition, The vectors(cases) that define the hyperplane are the support vectors. Actually these are the points nearest to the hyperplane as shown in the diagram.

 What is Hyperplane?

A hyperplane is a subspace whose dimension is one less than that of its ambient space. If a space is 3-dimensional then its hyperplanes are the 2-dimensional planes, while if the space is 2-dimensional, its hyperplanes are the 1-dimensional lines. ... By its nature, it separates the space into two half spaces.

Linear vs Nonlinear Classification?

At the most fundamental point, linear methods can only solve problems that are linearly separable (usually via a hyperplane). If you can solve it with a linear method, you're usually better off. However, if linear isn't working for your particular problem, the next step is to use a nonlinear method, which typically involves applying some type of transformation to your input dataset. After the transformation many techniques try to use a linear method for separation. Basically if we have n dimentional nonlinear data then we use transformation of the data to make it (n-1) dimensional data so that we can separate these using a hyperplane in (n-1) dimensional space. This is dimensionality reduction…

A method is linear (very basically) if our classification threshold is linear (a line, a plane or a hyperplane). Otherwise, it is non linear. Obviously, linear methods involve only linear combinations of data, leading to easier implementations, etc.

Simply, transforming data feature to stable space. Technically, non-linear methods transform data to a new representational space (based on the kernel function) and then apply classification techniques. Most of the time, this transformation describes the data features in a more clear structure in comparison with the original space, which the classification algorithms can create more accurate predictor in the new space. So, if classification techniques reform (transform to new space) data features before applying classifier, most of the time, we call them non-linear methods.

What is Kernal trick?

SVMs can efficiently perform a non-linear classification using what is called the kernel trick, implicitly mapping their inputs into high-dimensional feature spaces.

The original maximum-margin hyperplane algorithm proposed by Vapnik in 1963 constructed a linear classifier. However, in 1992, Bernhard E. Boser, Isabelle M. Guyon and Vladimir N. Vapnik suggested a way to create nonlinear classifiers by applying the kernel trick .

How about the above definition?? Too much, right?

Many of us who try to learn about SVM find it difficult to grasp the intelligence of kernels. I am truthful to say that for me it took considerable amount of time going through lots of resources but end of the day I have traversed the fence. So, let me give a try to help you do that too.

No alt text provided for this image

In the above diagram we have seen the hyperplane separating the data and I should say that this is applicable only for linearly separable data. What if the data are not linearly separable (as shown in the below diagrams) Here comes the concept of kernels. The idea is that the data, which isn’t linearly separable in ’n’ dimensional space may be linearly separable in a higher dimensional space. To understand how kernels work, some math would be necessary…. But, I will not cover the maths part of it here and try to go through the concept only.

No alt text provided for this image
No alt text provided for this image

Let’s consider the decision boundary, A hyperplane separating the classes (features) has weights(co-efficients) given by a vector ‘W’. We try to maximize the distance(margin) from this vector ‘W’ to the nearest points(support vectors) so this now becomes our constraint. We solve the following

No alt text provided for this image

Let’s understand the equation:

l ---> the number of data points in our training data

y ---> denote the outputs of the data points, which for our convenience, we express as +1 or -1.

X ---> is the feature vector in each training example

α---> the Lagrangian constant

No alt text provided for this image

For our understanding, Lagrangian multipliers are used to include the constraints for solving a minimization or maximization problems, thus permitting us to not concern about them while reaching our solution. 

While minimizing for W(α) [Weight vector as a function of alpha] we see the term x*xΤ(transpose). That is to say that we do not exactly need the exact data points, but only their inner products to compute our decision boundary.

How does this help?

What it implies is that if we want to transform our existing data into a higher dimensional data, which in many cases help us classify better(see the diagram below) we need not compute the exact transformation of our data, we just need the inner product of our data in that higher dimensional space. This works like a charisma in datasets which aren’t linearly separable!

It is much easier to get the inner product in a higher dimensional space than the actual points in a higher dimensional space. 

No alt text provided for this image

(In the above diagram you can see that the data which were not linearly separable for using Kernal trick in higher dimensions become separable…)

Polynomial kernels by simply using exponents of ‘m’ map the data into a ‘m’ dimensional space can be effective for the solution. In this way, we get our solution from a higher dimensional space without even visiting it.

There is, of course, a lot more math and logic involved than what I’ve explained here, but I hope you get a decent perception about the kernel trick.

 I will get in to a cover a the complete maths of SVM in some other article where we will go through the loss function(what to minimize) and objective function(what to optimize), but for now this is what the SVM is all about.

3D transformation after Kernel Function transformation

More information on SVM:

Vladimir Vapnik, one of the inventors of the technique has two books that are considered influential on the topic. They are very mathematical.

·      The Nature of Statistical Learning Theory, Vapnik, 1995

·      Statistical Learning Theory, Vapnik, 1998

There are countless tutorials and journal articles on SVM on internet.

Below is a link to a seminal paper on SVM by Cortes and Vapnik and another to an excellent introductory tutorial.

·      Support-Vector Networks [PDF] by Cortes and Vapnik 1995

·      A Tutorial on Support Vector Machines for Pattern Recognition [PDF] 1998

Wikipedia provides some good (although dense) information on SVM.

要查看或添加评论,请登录

Subhasish Bhattacharjee的更多文章

社区洞察

其他会员也浏览了