Classification in Data Science
What is Classification?
Although classification can be performed on both structured and unstructured data, it is mainly used for structured data where the target feature is already labelled. Classification is a technique where we categorize data into a given number of classes. The main goal of a classification problem is to identify the category/class to which a new data will fall under.
Before we get into details of the classification algorithms of data science, let’s get familiar with few of the terminologies:
- Classifier: An algorithm that maps the input data to a specific category.
- Classification model: A classification model tries to draw some conclusion from the input values given for training. It will predict the class labels/categories for the new data.
- Feature: A feature is an individual measurable property of a phenomenon being observed.
- Binary Classification: Classification task with two possible outcomes. eg: Gender classification (Male / Female)
- Multi class classification: Classification with more than two classes. In multi class classification each sample is assigned to one and only one target label. eg: An animal can be cat or dog or cow or deer but not any two at the same time.
- Multi label classification: Classification task where each sample is mapped to a set of target labels (more than one class). eg: A news article can be about sports, a person, and location at the same time.
Types of classification algorithms:
Based on the type of problem statement and the type or target feature, classification can be categorised in multiple types. Some of the widely used algorithms are described below:
Na?ve Bayes
Naive Bayes algorithm is based on Baye’s theorem with the assumption of independence between every pair of features. Naive Bayes classifiers work well in many real-world situations such as document classification and spam filtering.
Naive Bayes is a probabilistic classifier inspired by the Baye’s theorem. Under a simple assumption which is the attributes are conditionally independent.
The classification is conducted by deriving the maximum posterior which is the maximal P(Ci|X) with the above assumption applying to Bayes theorem. This assumption greatly reduces the computational cost by only counting the class distribution. Even though the assumption is not valid in most cases since the attributes are dependent, surprisingly Naive Bayes has able to perform impressively.
Naive Bayes is a very simple algorithm to implement and good results have obtained in most cases. It can be easily scalable to larger datasets since it takes linear time, rather than by expensive iterative approximation as used for many other types of classifiers.
Naive Bayes can suffer from a problem called the zero-probability problem. When the conditional probability is zero for a attribute, it fails to give a valid prediction. This needs to be fixed explicitly using a Laplacian estimator.
Stochastic Gradient Descent
Stochastic gradient descent is a simple and very efficient approach to fit linear models. It is particularly useful when the number of samples is very large. It supports different loss functions and penalties for classification.
The steps of the algorithm are
1. Find the slope of the objective function with respect to each parameter/feature. In other words, compute the gradient of the function.
2. Pick a random initial value for the parameters. (To clarify, in the parabola example, differentiate “y” with respect to “x”. If we had more features like x1, x2 etc., we take the partial derivative of “y” with respect to each of the features.)
3. Update the gradient function by plugging in the parameter values.
4. Calculate the step sizes for each feature as : step size = gradient * learning rate.
5. Calculate the new parameters as : new params = old params -step size
6. Repeat steps 3 to 5 until gradient is almost 0.
K-Nearest Neighbours
Neighbours based classification is a type of lazy learning as it does not attempt to construct a general internal model, but simply stores instances of the training data. Classification is computed from a simple majority vote of the k nearest neighbours of each point.
It is a non-parametric, lazy algorithm. It’s non-parametric since it does not make any assumption on data distribution (the data does not have to be normally distributed). It is lazy since it does not really learn any model and make generalization of the data (It does not train some parameters of some function where input X gives output y).
So strictly speaking, this is not really a learning algorithm. It simply classifies objects based on feature similarity (feature = input variables).
fig, classifying new example depending on Training instance distance
Classification is computed from a simple majority vote of the k nearest neighbours of each point.
Decision Tree
Given a data of attributes together with its classes, a decision tree produces a sequence of rules that can be used to classify the data.
Decision Tree, as it name says, makes decision with tree-like model. It splits the sample into two or more homogeneous sets (leaves) based on the most significant differentiators in your input variables. To choose a differentiator (predictor), the algorithm considers all features and does a binary split on them (for categorical data, split by cat; for continuous, pick a cut-off threshold). It will then choose the one with the least cost (i.e. highest accuracy), and repeats recursively, until it successfully splits the data in all leaves (or reaches the maximum depth).
Random Forest
Random forest classifier is a meta-estimator that fits several decision trees on various sub-samples of datasets and uses average to improve the predictive accuracy of the model and controls over-fitting. The sub-sample size is always the same as the original input sample size, but the samples are drawn with replacement.
Support Support Vector Machine
Support vector machine is a representation of the training data as points in space separated into categories by a clear gap that is as wide as possible. New examples are then mapped into that same space and predicted to belong to a category based on which side of the gap they fall.
Description of diagram:
Examples of hyperplanes:
· H1 is not a good hyperplane as it doesn’t separate the classes
· H2 does but only with small margin
· H3 separates them with maximum margin (distance)
Parameters of SVM
There are three main parameters which we could play with when constructing a SVM classifier:
· Type of kernel
· Gamma value
· C value