Support Vector Machine — an overview

Support Vector Machine — Machine Learning Algorithm

No alt text provided for this image

The Support Vector Machine, created by Vladimir Vapnik in the 60s is still one of the most popular machine learning classifiers. Support Vector Machines (SVMs) are a particularly powerful and flexible class of supervised algorithms for both classification and regression. SVM is also capable of detecting outliers.

The objective of the Support Vector Machine is to find the best splitting boundary between data. In a two-dimensional space, you can think of a splitting boundary like the best fit ‘line’ that divides your dataset. With a Support Vector Machine, we are dealing in vector space, thus the separating line is actually a separating ‘hyperplane’. The best separating hyperplane is defined as the hyperplane that contains the ‘widest’ margin between support vectors. The hyperplane may also be referred to as a decision boundary.

In this section, we will develop the intuition behind support vector machines and their use in classification problems. Here, we will consider discriminative classification in which we simply find a line or curve (in two dimensions) or manifold (in multiple dimensions) that divides the classes from each other.

As an example of this, consider the simple case of a classification task, in which the two classes of points are well separated:

No alt text provided for this image

A linear discriminative classifier would attempt to draw a straight line separating the two sets of data, and thereby create a model for classification. For two-dimensional data like that shown here, this is a task we could do by hand. But immediately we see a problem: there is more than one possible dividing line that can perfectly discriminate between the two classes!

No alt text provided for this image

These are three very different separators that, nevertheless, perfectly discriminate between these samples. Depending on which you choose, a new data point (for example, the one marked by the ‘X’ in the above figure) will be assigned a different label! Evidently our simple intuition of ‘drawing a line between classes’ is not enough and we need to think deeper.

Support Vector Machines offer one way to improve on this. So, rather than simply drawing a zero-width line between the classes, we can draw a margin of some width around each line, up to the nearest point.

No alt text provided for this image

In support vector machines, the line that maximizes the margin is the one we will choose as the optimal model. Support vector machines are an example of such a maximum margin estimator.

Fitting a support vector machine

Let’s see the result of an actual fit to this data: we will use Scikit-Learn’s support vector classifier to train an SVM model on this data.

No alt text provided for this image
No alt text provided for this image

To better visualize what’s happening here, let’s plot SVM decision boundaries:

No alt text provided for this image

This is the dividing line that maximizes the margin between two sets of points. Notice that a few of the training points just touch the margin; they are indicated by the black circles in the above figure. These points are the pivotal elements of this fit , and are known as the support vectors, and give the algorithm its name. In Scikit-Learn, the identity of these points is stored in the support_vectors_ attribute of the classifier.

No alt text provided for this image
No alt text provided for this image

A key to this classifier’s success is that for the fit, only the position of the support vectors matters; any points further from the margin that are on the correct side do not modify the fit! Technically, this is because these points do not contribute to the loss function used to fit the model, so their position and number do not matter so long as they do not cross the margin.

We can see this, for example, if we plot the model learned from the first 60 points and first 120 points of the dataset:

No alt text provided for this image

In the left panel, we see the model and support vectors for 60 training points. In the right panel, we have doubled the number of training points, but the model has not changed. The three support vectors from the left panel are still the support vectors from the right panel. This is one of the strengths of the SVM model.

Kernel SVM: When the data is not linearly separable

When SVM is combined with kernels, it becomes more powerful. To motivate the need for kernels, let’s look at some data that is not linearly separable:

No alt text provided for this image

It is clear that no linear discrimination will ever be able to separate this data. But we can think about how we might project the data into a higher dimension such that a linear separator would be sufficient. For example, one simple projection we could use would be to compute a radial basis function centered on the middle clump. We can visualize this extra data dimension using a three-dimensional plot:

No alt text provided for this image

We can see that with this additional dimension, the data becomes trivially linearly separable, by drawing a separating plane at, say, r=0.7.

Here we had to choose and carefully tune our projection; if we had not centered our radial basis function (RBF) in the right direction, we would not have seen such clean, linearly separable results. In general, the need to make such a choice is a problem: we would like to somehow automatically find the best basis functions to use.

One strategy to this end is to compute a basis function centered at every point in the dataset, and let the SVM algorithm examine through the results. This type of basis function transformation is known as a kernel transformation.

A potential problem with this strategy — projecting N points into N dimensions — is that it might become very computationally intensive as N grows large. However, because of a neat little procedure known as the kernel trick, a fit on kernel transformed data can be done implicitly — that is, without ever building the full N-dimensional representation of the kernel projection! This kernel trick is built into the SVM, and is one of the reasons the method is so powerful.

In Scikit-Learn, we can apply kernelized SVM simply by changing our linear kernel to an RBF (radial basis function) kernel, using the kernel model hyperparameter:

No alt text provided for this image
No alt text provided for this image
No alt text provided for this image

Using this kernelized support vector machine, we learn a suitable nonlinear decision boundary. This kernel transformation strategy is used often in machine learning to turn fast linear methods into fast nonlinear methods, especially for models in which the kernel trick can be used.

Softening the margins

So far we have discussed about very clean datasets, in which a perfect decision boundary exists. But what if your data has some amount of overlaps?

No alt text provided for this image

To handle this case, the SVM has a feature that softens the margin; that is, it allows some of the points to creep into the margin if that allows a better fit. The hardness of the margin is controlled by a tuning parameter, most often known as ‘C’. For very large ‘C’, the margin is hard and the points cannot lie in it. For smaller C, the margin is softer, and can grow to encompass some points.

The plot shown below, gives a visual picture of how a changing C parameter affects the final fit, via the softening of the margin:

No alt text provided for this image

The optimal value of the C parameter will depend on your dataset, and should be tuned via cross-validation or a similar procedure.

Pros and Cons associated with SVM

Pros:

  1. SVM is effective in high dimensional spaces.
  2. It is effective in cases where number of dimensions is greater than number of samples.
  3. It uses a subset of training points in the decision function (called Support vectors), so it is also memory efficient.
  4. It works well with clear margin of separation .

Cons:

  1. SVM doesn’t directly provide probability estimates, these are calculated using an expensive five-fold cross-validation.
  2. SVM doesn’t perform well, when we have large dataset because the required training time is higher.


要查看或添加评论,请登录

Yogesh Khurana的更多文章

社区洞察

其他会员也浏览了