Support Vector Machines in Machine Learning
“Support Vector Machine” (SVM) is a supervised machine learning algorithm that can be used for both classification or regression challenges. However, it is mostly used in classification problems.
The support vector machine is highly preferred by many as it produces significant accuracy with less computation power. Support Vector Machine, abbreviated as SVM.
What is Support Vector Machine?
The objective of the support vector machine algorithm is to find a hyperplane in N-dimensional space(N — the number of features) that distinctly classifies the data points.
To separate the two classes of data points, there are many possible hyperplanes that could be chosen. Our objective is to find a plane that has the maximum margin, i.e the maximum distance between data points of both classes. Maximizing the margin distance provides some reinforcement so that future data points can be classified with more confidence.
Hyperplanes:
Hyperplanes are decision boundaries that help classify the data points. Data points falling on either side of the hyperplane can be attributed to different classes.
Also, the dimension of the hyperplane depends upon the number of features. If the number of input features is 2, then the hyperplane is just a line. If the number of input features is 3, then the hyperplane becomes a two-dimensional plane. It becomes difficult to imagine when the number of features exceeds 3.
Support Vectors:
Support vectors are data points that are closer to the hyperplane and influence the position and orientation of the hyperplane. Using these support vectors, we maximize the margin of the classifier. Deleting the support vectors will change the position of the hyperplane. These are the points that help us build our SVM.
Computing the Margin:
Here we have considered 3 planes:
Plane w`x + b = 0 is the central plane which separates the data points and let we denote it by π.
Plane w`x + b = 1 is the plane above which are the positive points lies and let we denote it by π(+)
Plane w`x + b = -1 is the plane below which all negative points lie and we denote it by π(-).
Now the margin is the distance between the π(+) and π(-) and our task is to find this margin and maximize it.
Consider a point x_0 on the plane π(-) such that w`x_0 + b = -1. Let the distance between the π(+) and π(-) be ‘r’.
Now our constrained optimization problem becomes :
max(w) { 2/||w|| }
Constraints: All positive points lie above the π(+) and all negative points lie below the π(-) and no points lie in between the margin. Such a model is known as the “ Hard Margin SVM ”.
Hard Margin SVM:
Hard Margin SVM is suitable only when our data is completely linear separable and the optimization problem holds in the case of Hard Margin SVM is what we discussed above.
max(w) { 2/||w|| }
Soft Margin SVM:
When our data is not linearly separable what should we do? Should we do feature transformation? Should we modify our earlier optimization problem which can classify our non-linear data? So the answer to the above thing is Soft-Max SVMs !! What are they ?? Here we will learn every aspect of these surely in great detail. Please bear with me!!!
Mathematical Formulation of SVMs:
Since our key idea behind the SVM is to maximize the margin so let’s first compute the constraints for this optimization problem.
For the support vectors, we know :
=> y_i * (w`x + b ) = 1
For the non support vectors we know:
y_i * (w`x + b ) >1
So our optimization problem becomes :
max(w) { 2/||w|| } such that y_i * (w`x_i + b ) ≥1.
Wow! we got our first constrained optimization problem which computes the w* using which we can get the maximizing margin hyperplane.
But, here is a problem with this formulation! Can you think of it?? Well, what about the scenario when the points are in the ‘Margin’ area.
Consider the points other than A, B, C. For all these points my hard max SVM would even work good but it will fail for the A, B, C points. Let the ||w|| = 1 and distance between the π and π(+) become 1(this is just for easy math I am doing)
For point A: The distance from the π, D{π, A}= -0.5(as normal is in the direction of positive points).
We can write this distance as D{π, A}= 1 — (1.5) let’s call this 1.5 as zeta{i} symbolized as ‘ ζ ’. So ζ_i = 1.5 .
Significance of ζ{i}: The ζ_i repersents the distance of the points away from the correct hyperplane in the incorrect direction.
Here ζ_i = 1.5 it means point A is at a distance of 1.5 from the π(+) and -0.5 from the π hyperplane.
For point B: The distance from the π, D{π, B}= -1.5(as normal is in the direction of positive points).We can write this distance as D{π, B}= 1-(2.5) let’s call this 2.5 as zeta{i} symbolized as ‘ ζ ’. So ζ_i = 2.5.
For point C: The distance from the π, D{π, C}= -1.5(as normal is in the direction of positive points but the point is negative and we consider only signed distances during the classification).We can write this distance as D{π, C}= 1-(2.5) let’s call this 2.5 as zeta{i} symbolized as ‘ ζ ’. So ζ_i = 2.5.
For the support vectors and the correctly classified points we have ζ_i = 0 and for the misclassified points ζ_i > 0.This is obvious for the above figure. I hope till here we are cool !
Now we will formulate the optimizing problem of Soft-max SVM. Before doing that I want to discuss the basic framework of formulating the Optimization problem. We try to minimize the error or loss that can occur in the model along with minimizing /maximizing the variables.
So the total error in our case is :
As we know max(f(x)) = min(-f(x))
max(f(x)) = 1/min(f(x))
So we can write max(w) { 2/||w|| } as min(w) 1/2 ||w||
But min(||w||) is same as min(w){||w`w||}
So we can write our final optimal constrained equation with loss term as:
Let’s talk about the C: It is the hyperparameter that tunes how much error rate we have to optimize. As the C value increases the importance of loss term increases and our model tries to reduce the loss as much as possible which is an alarming bell condition for the Overfitting Model. When the value of C decreases our model is under a fitted or Highly Biased Model.
Primal And Dual Forms of SVM:
Now we will discuss the primal and dual forms of Hard Margin SVM and Soft Margin SVM. What is Primal Form? What is dual form and why we want them? These are the basic questions which come to our mind.
The optimization problem which we formulated in each case of Hard and Soft Margin is called constrained optimization problem and they are also called a primal form of the optimization problem.
When we transform the constrained optimization problem into the other form using the concept of Lagrange’s multipliers we get the Dual form of the optimization problem which is quite easy to solve as compared to Primal form.
As we know the optimization problem for the Hard Margin is given by :
This is called the Primal Form of the hard Margin SVM. Now we will derive the Dual form of the Hard Margin SVM which is very easy friends as you will see it. In order to convert into Dual Form, we make use of Lagrange Multipliers. Since we have only one constrained here we will make use of only one Lagrange multiplier α.
By solving three KKT equations we will get following the final dual form of SVM.
Hinge Loss function in SVM:
Purple color is hinge loss, yellow is log loss, blue is 0 - 1 loss and green is mean squared error.
By looking at the plots above, this nature of curves brings out a few major differences between logistic loss and hinge loss —
- Note that the logistic loss diverges faster than hinge loss. So, in general, it will be more sensitive to outliers — why? Because, assuming there is an outlier in our data, the logistic loss (due to its diverging nature) will be very high compared to the hinge loss for that outlier. This means greater adjustments to our weights.
- Note that the logistic loss does not go to zero even if the point is classified sufficiently confidently — The horizontal axis is the confidence of ‘predicted y’ value, if we take a value like ‘1.5’ on x-axis, then the corresponding logistic loss (yellow line) still shows some loss (close to 0.2 from the above plot and hence still not very confident of our prediction), whereas the hinge loss is ‘0’ ( which means there is no loss and we are more confident of our prediction). This nature of logistic loss might lead to minor degradation in accuracy.
- Logistic regression has a more probabilistic interpretation.
Overview of Kernel:
Can you try to solve the above problem linearly? No
The red and blue balls cannot be separated by a straight line as they are randomly distributed and this, in reality, is how most real-life problem data are -randomly distributed.
In machine learning, a “kernel” is usually used to refer to the kernel trick, a method of using a linear classifier to solve a non-linear problem. It entails transforming linearly inseparable data to linearly separable ones. The kernel function is what is applied to each data instance to map the original non-linear observations into a higher-dimensional space in which they become separable.
The dual form of SVM contains xi^T * xj, will replace it with kernel function
in the equation as K(x_i , x_j) = φ(x`_i)^T φ(x`_j) where φ(x) represents the transformed feature of x_i vector in Z space. In nutshell, the Kernel helps in performing the inner product or dot product precisely of the pair of each data points in Z space. For example, we have a polynomial kernel which is given by:
These functions can be of different types. For example linear, nonlinear, polynomial, radial basis function (RBF), and sigmoid.
Examples of SVM Kernels
Let us see some common kernels used with SVMs and their uses:
Polynomial kernel
It is popular in image processing.
Equation is:
where d is the degree of the polynomial.
Gaussian kernel
It is a general-purpose kernel; used when there is no prior knowledge about the data. Equation is:
Gaussian radial basis function (RBF)
It is a general-purpose kernel; used when there is no prior knowledge about the data.
Equation is:
, for:
Sometimes parametrized using:
Laplace RBF kernel
It is a general-purpose kernel; used when there is no prior knowledge about the data.
Equation is:
Hyperbolic tangent kernel
We can use it in neural networks.
Equation is:
Hyperbolic tangent kernel equation, for some (not every) k>0 and c<0.
Sigmoid kernel
We can use it as the proxy for neural networks. Equation is:
Bessel function of the first kind Kernel
We can use it to remove the cross term in mathematical functions. Equation is :
where j is the Bessel function of the first kind.
ANOVA radial basis kernel
We can use it in regression problems. Equation is:
Linear splines kernel in one-dimension
It is useful when dealing with large sparse data vectors. It is often used in text categorization. The splines kernel also performs well in regression problems. Equation is:
Support Vector Regression (SVR):
SVC (Support vector classifier):
Linear SVC:
nu - SVC:
SVR (Support vector regression):
nu - SVR:
Real-world cases:
Feature Engineering/ Feature Transformation:
SVM tends to works well if we chose the right kernel for feature engineering otherwise we can use general-purpose kernel RBF.
SVM using similarity function:
We can find distance using similarity function and apply the SVM.
Feature importance:
There is no way to get feature importance for kernel SVM (Forward feature selection/Backward feature selection). For linear SVM we can get feature importance using coefficient vector W the same as linear regression.
Impact of outliers on SVM:
The impact of outliers on SVM is very low, in SVM only thing matters is support vectors.
Bias Variance Trade-off:
High value 'C' gives overfitting and low value of 'C' gives underfitting, we have to select 'C' and 'sigma' values in such a way that it can trade of Bias and Variance.
Time Complexity:
Training time O(n^2) - n is large training time is high.
Testing time O(Kd) - k is the number of support vectors.
Interview Questions:
Question Context: 1 – 2
Suppose you are using a Linear SVM classifier with a 2 class classification problem. Now you have been given the following data in which some points are circled red that are representing support vectors.
1) If you remove the following anyone red points from the data. Will the decision boundary change?
A) Yes
B) No
Solution: A
These three examples are positioned such that removing any one of them introduces slack in the constraints. So the decision boundary would completely change.
2) If you remove the non-red circled points from the data, the decision boundary will change?
A) True
B) False
Solution: B
On the other hand, the rest of the points in the data won’t affect the decision boundary much.
3) What do you mean by generalization error in terms of the SVM?
A) How far the hyperplane is from the support vectors
B) How accurately the SVM can predict outcomes for unseen data
C) The threshold amount of error in an SVM
Solution: B
Generalization error in statistics is generally the out-of-sample error which is the measure of how accurately a model can predict values for previously unseen data.
4) When the C parameter is set to infinite, which of the following holds true?
A) The optimal hyperplane if exists, will be the one that completely separates the data
B) The soft-margin classifier will separate the data
C) None of the above
Solution: A
At such a high level of misclassification penalty, the soft margin will not hold existence as there will be no room for error.
5) What do you mean by a hard margin?
A) The SVM allows very low error in classification
B) The SVM allows a high amount of error in classification
C) None of the above
Solution: A
A hard margin means that an SVM is very rigid in classification and tries to work extremely well in the training set, causing overfitting.
6) The minimum time complexity for training an SVM is O(n2). According to this fact, what sizes of datasets are not best suited for SVM’s?
A) Large datasets
B) Small datasets
C) Medium-sized datasets
D) Size does not matter
Solution: A
Datasets that have a clear classification boundary will function best with SVM’s.
7) The effectiveness of an SVM depends upon:
A) Selection of Kernel
B) Kernel Parameters
C) Soft Margin Parameter C
D) All of the above
Solution: D
The SVM effectiveness depends upon how you choose the basic 3 requirements mentioned above in such a way that it maximizes your efficiency, reduces error and overfitting.
8) Support vectors are the data points that lie closest to the decision surface.
A) TRUE
B) FALSE
Solution: A
They are the points closest to the hyperplane and the hardest ones to classify. They also have a direct bearing on the location of the decision surface.
9) The SVM’s are less effective when:
A) The data is linearly separable
B) The data is clean and ready to use
C) The data is noisy and contains overlapping points
Solution: C
When the data has noise and overlapping points, there is a problem in drawing a clear hyperplane without misclassifying.
10) Suppose you are using RBF kernel in SVM with high Gamma value. What does this signify?
A) The model would consider even far away points from hyperplane for modeling
B) The model would consider only the points close to the hyperplane for modeling
C) The model would not be affected by the distance of points from hyperplane for modeling
D) None of the above
Solution: B
The gamma parameter in SVM tuning signifies the influence of points either near or far away from the hyperplane.
For a low gamma, the model will be too constrained and include all points of the training dataset, without really capturing the shape.
For a higher gamma, the model will capture the shape of the dataset well.
11) The cost parameter in the SVM means:
A) The number of cross-validations to be made
B) The kernel to be used
C) The tradeoff between misclassification and simplicity of the model
D) None of the above
Solution: C
The cost parameter decides how much an SVM should be allowed to “bend” with the data. For a low cost, you aim for a smooth decision surface and for a higher cost, you aim to classify more points correctly. It is also simply referred to as the cost of misclassification.
12)
Suppose you are building an SVM model on data X. The data X can be error-prone which means that you should not trust any specific data point too much. Now think that you want to build an SVM model that has quadratic kernel function of polynomial degree 2 that uses Slack variable C as one of its hyperparameters. Based upon that give the answer for the following question.
What would happen when you use a very large value of C(C->infinity)?
Note: For small C was also classifying all data points correctly
A) We can still classify data correctly for the given set of hyperparameter C
B) We can not classify data correctly for the given setting of hyperparameter C
C) Can’t Say
D) None of these
Solution: A
For large values of C, the penalty for misclassifying points is very high, so the decision boundary will perfectly separate the data if possible.
13) What would happen when you use very small C (C~0)?
A) Misclassification would happen
B) Data will be correctly classified
C) Can’t say
D) None of these
Solution: A
The classifier can maximize the margin between most of the points while misclassifying a few points because the penalty is so low.
14) If I am using all features of my dataset and I achieve 100% accuracy on my training set, but ~70% on the validation set, what should I look out for?
A) Underfitting
B) Nothing, the model is perfect
C) Overfitting
Solution: C
If we’re achieving 100% training accuracy very easily, we need to check to verify if we’re overfitting our data.
15) Which of the following are real-world applications of the SVM?
A) Text and Hypertext Categorization
B) Image Classification
C) Clustering of News Articles
D) All of the above
Solution: D
SVM’s are highly versatile models that can be used for practically all real-world problems ranging from regression to clustering and handwriting recognitions.
Question Context: 16 – 18
Suppose you have trained an SVM with a linear decision boundary after training SVM, you correctly infer that your SVM model is underfitting.
16) Which of the following option would you more likely to consider iterating SVM next time?
A) You want to increase your data points
B) You want to decrease your data points
C) You will try to calculate more variables
D) You will try to reduce the features
Solution: C
The best option here would be to create more features for the model.
17) Suppose you gave the correct answer in the previous question. What do you think that is actually happening?
1. We are lowering the bias
2. We are lowering the variance
3. We are increasing the bias
4. We are increasing the variance
A) 1 and 2
B) 2 and 3
C) 1 and 4
D) 2 and 4
Solution: C
A better model will lower the bias and increase the variance
18) In the above question suppose you want to change one of it’s(SVM) hyperparameter so that effect would be the same as previous questions i.e model will not underfit?
A) We will increase the parameter C
B) We will decrease the parameter C
C) Changing in C don’t affect
D) None of these
Solution: A
Increasing C parameter would be the right thing to do here, as it will ensure the regularized model
19) We usually use feature normalization before using the Gaussian kernel in SVM. What is true about feature normalization?
1. We do feature normalization so that new feature will dominate other
2. Some times, feature normalization is not feasible in case of categorical variables
3. Feature normalization always helps when we use Gaussian kernel in SVM
A) 1
B) 1 and 2
C) 1 and 3
D) 2 and 3
Solution: B
Statements one and two are correct.
Question Context: 20-22
Suppose you are dealing with 4 class classification problems and you want to train an SVM model on the data for that you are using the One-vs-all method. Now answer the below questions?
20) How many times do we need to train our SVM model in such a case?
A) 1
B) 2
C) 3
D) 4
Solution: D
For a 4 class problem, you would have to train the SVM at least 4 times if you are using a one-vs-all method.
21) Suppose you have the same distribution of classes in the data. Now, say for training 1 time in one vs all setting the SVM is taking 10 seconds. How many seconds would it require to train one-vs-all method end to end?
A) 20
B) 40
C) 60
D) 80
Solution: B
It would take 10×4 = 40 seconds
22) Suppose your problem has changed now. Now, data has only 2 classes. What would you think about how many times we need to train SVM in such a case?
A) 1
B) 2
C) 3
D) 4
Solution: A
Training the SVM only one time would give you appropriate results
Question context: 23 – 24
Suppose you are using SVM with the linear kernel of polynomial degree 2, Now think that you have applied this on data and found that it perfectly fits the data that means, Training and testing accuracy is 100%.
23) Now, think that you increase the complexity(or degree of the polynomial of this kernel). What would you think will happen?
A) Increasing the complexity will overfit the data
B) Increasing the complexity will underfit the data
C) Nothing will happen since your model was already 100% accurate
D) None of these
Solution: A
Increasing the complexity of the data would make the algorithm overfit the data.
24) In the previous question after increasing the complexity, you found that training accuracy was still 100%. According to you what is the reason behind that?
1. Since data is fixed and we are fitting more polynomial term or parameters so the algorithm starts memorizing everything in the data
2. Since data is fixed and SVM doesn’t need to search in big hypothesis space
A) 1
B) 2
C) 1 and 2
D) None of these
Solution: C
Both the given statements are correct.
25) What is/are true about kernel in SVM?
1. Kernel function map low dimensional data to high dimensional space
2. It’s a similarity function
A) 1
B) 2
C) 1 and 2
D) None of these
Solution: C
Both the given statements are correct.
Marketing Data Analyst @ AAA Club Alliance | SAS Certified Specialist, Data Scientist
2 年Great!