Classification in Data Science

No alt text provided for this image

What is Classification?

Although classification can be performed on both structured and unstructured data, it is mainly used for structured data where the target feature is already labelled. Classification is a technique where we categorize data into a given number of classes. The main goal of a classification problem is to identify the category/class to which a new data will fall under.

Before we get into details of the classification algorithms of data science, let’s get familiar with few of the terminologies:

  • Classifier: An algorithm that maps the input data to a specific category.
  • Classification model: A classification model tries to draw some conclusion from the input values given for training. It will predict the class labels/categories for the new data.
  • Feature: A feature is an individual measurable property of a phenomenon being observed.
  • Binary Classification: Classification task with two possible outcomes. eg: Gender classification (Male / Female)
  • Multi class classification: Classification with more than two classes. In multi class classification each sample is assigned to one and only one target label. eg: An animal can be cat or dog or cow or deer but not any two at the same time.
  • Multi label classification: Classification task where each sample is mapped to a set of target labels (more than one class). eg: A news article can be about sports, a person, and location at the same time.

Types of classification algorithms:

Based on the type of problem statement and the type or target feature, classification can be categorised in multiple types. Some of the widely used algorithms are described below:

Na?ve Bayes

Naive Bayes algorithm is based on Baye’s theorem with the assumption of independence between every pair of features. Naive Bayes classifiers work well in many real-world situations such as document classification and spam filtering.

No alt text provided for this image

Naive Bayes is a probabilistic classifier inspired by the Baye’s theorem. Under a simple assumption which is the attributes are conditionally independent.

No alt text provided for this image

The classification is conducted by deriving the maximum posterior which is the maximal P(Ci|X) with the above assumption applying to Bayes theorem. This assumption greatly reduces the computational cost by only counting the class distribution. Even though the assumption is not valid in most cases since the attributes are dependent, surprisingly Naive Bayes has able to perform impressively.

Naive Bayes is a very simple algorithm to implement and good results have obtained in most cases. It can be easily scalable to larger datasets since it takes linear time, rather than by expensive iterative approximation as used for many other types of classifiers.

Naive Bayes can suffer from a problem called the zero-probability problem. When the conditional probability is zero for a attribute, it fails to give a valid prediction. This needs to be fixed explicitly using a Laplacian estimator.

Stochastic Gradient Descent

Stochastic gradient descent is a simple and very efficient approach to fit linear models. It is particularly useful when the number of samples is very large. It supports different loss functions and penalties for classification.

No alt text provided for this image

The steps of the algorithm are

1.      Find the slope of the objective function with respect to each parameter/feature. In other words, compute the gradient of the function.

2.      Pick a random initial value for the parameters. (To clarify, in the parabola example, differentiate “y” with respect to “x”. If we had more features like x1, x2 etc., we take the partial derivative of “y” with respect to each of the features.)

3.      Update the gradient function by plugging in the parameter values.

4.      Calculate the step sizes for each feature as : step size = gradient * learning rate.

5.      Calculate the new parameters as : new params = old params -step size

6.      Repeat steps 3 to 5 until gradient is almost 0.

K-Nearest Neighbours

Neighbours based classification is a type of lazy learning as it does not attempt to construct a general internal model, but simply stores instances of the training data. Classification is computed from a simple majority vote of the k nearest neighbours of each point.

No alt text provided for this image

It is a non-parametric, lazy algorithm. It’s non-parametric since it does not make any assumption on data distribution (the data does not have to be normally distributed). It is lazy since it does not really learn any model and make generalization of the data (It does not train some parameters of some function where input X gives output y).

So strictly speaking, this is not really a learning algorithm. It simply classifies objects based on feature similarity (feature = input variables).

No alt text provided for this image

fig, classifying new example depending on Training instance distance

Classification is computed from a simple majority vote of the k nearest neighbours of each point.

Decision Tree

Given a data of attributes together with its classes, a decision tree produces a sequence of rules that can be used to classify the data.

No alt text provided for this image

Decision Tree, as it name says, makes decision with tree-like model. It splits the sample into two or more homogeneous sets (leaves) based on the most significant differentiators in your input variables. To choose a differentiator (predictor), the algorithm considers all features and does a binary split on them (for categorical data, split by cat; for continuous, pick a cut-off threshold). It will then choose the one with the least cost (i.e. highest accuracy), and repeats recursively, until it successfully splits the data in all leaves (or reaches the maximum depth).

No alt text provided for this image

Random Forest

Random forest classifier is a meta-estimator that fits several decision trees on various sub-samples of datasets and uses average to improve the predictive accuracy of the model and controls over-fitting. The sub-sample size is always the same as the original input sample size, but the samples are drawn with replacement.

No alt text provided for this image
No alt text provided for this image

Support Support Vector Machine

Support vector machine is a representation of the training data as points in space separated into categories by a clear gap that is as wide as possible. New examples are then mapped into that same space and predicted to belong to a category based on which side of the gap they fall.

No alt text provided for this image
No alt text provided for this image

Description of diagram:

Examples of hyperplanes:

·        H1 is not a good hyperplane as it doesn’t separate the classes

·        H2 does but only with small margin

·        H3 separates them with maximum margin (distance)

Parameters of SVM

There are three main parameters which we could play with when constructing a SVM classifier:

·        Type of kernel

·        Gamma value

·        C value

 

 


要查看或添加评论,请登录

Gautam Kumar的更多文章

  • Treating outliers on a dataset

    Treating outliers on a dataset

    An outlier is an observation that lies an abnormal distance from other values in a random sample from a population. In…

  • What is Cloud Computing

    What is Cloud Computing

    The most simplistic definition of cloud computing is the delivery of on-demand IT services over the internet. The…

    1 条评论
  • An Introduction to Lambda Function

    An Introduction to Lambda Function

    Functions are basically piece of codes which execute only when we invoke them. For any programming language, functions…

  • Understanding Support Vector Machine

    Understanding Support Vector Machine

    Support Vector Machine: An Introduction I have talked about Linear regression and Classification on my prior articles…

  • Understanding the basics of Data Clustering

    Understanding the basics of Data Clustering

    Clustering Clustering is the task of dividing the population or data points into a few groups such that data points in…

  • Multicollinearity - understanding the relationship between variables

    Multicollinearity - understanding the relationship between variables

    Multicollinearity Multicollinearity or simply collinearity is defined by the phenomenon in which two or more…

  • Dimension Reduction - Principal Component Analysis (aka PCA)

    Dimension Reduction - Principal Component Analysis (aka PCA)

    Being in an era of data flowing from every here and there, we often come across scenarios that we gather way too much…

    2 条评论
  • Understanding the ROC & AUC

    Understanding the ROC & AUC

    Introduction In any type of machine learning, we need to calculate the accuracy of the model for performance…

  • Linear Regression

    Linear Regression

    When it comes to supervised machine learning, there are two types of learning algorithms: Regression – this basically…

社区洞察

其他会员也浏览了