登录查看更多内容

Classification in Data Science

Gautam Kumar

Senior Lead Engineer

发布日期: 2020年3月31日

What is Classification?

Although classification can be performed on both structured and unstructured data, it is mainly used for structured data where the target feature is already labelled. Classification is a technique where we categorize data into a given number of classes. The main goal of a classification problem is to identify the category/class to which a new data will fall under.

Before we get into details of the classification algorithms of data science, let’s get familiar with few of the terminologies:

Classifier: An algorithm that maps the input data to a specific category.
Classification model: A classification model tries to draw some conclusion from the input values given for training. It will predict the class labels/categories for the new data.
Feature: A feature is an individual measurable property of a phenomenon being observed.
Binary Classification: Classification task with two possible outcomes. eg: Gender classification (Male / Female)
Multi class classification: Classification with more than two classes. In multi class classification each sample is assigned to one and only one target label. eg: An animal can be cat or dog or cow or deer but not any two at the same time.
Multi label classification: Classification task where each sample is mapped to a set of target labels (more than one class). eg: A news article can be about sports, a person, and location at the same time.

Types of classification algorithms:

Based on the type of problem statement and the type or target feature, classification can be categorised in multiple types. Some of the widely used algorithms are described below:

Na?ve Bayes

Naive Bayes algorithm is based on Baye’s theorem with the assumption of independence between every pair of features. Naive Bayes classifiers work well in many real-world situations such as document classification and spam filtering.

Naive Bayes is a probabilistic classifier inspired by the Baye’s theorem. Under a simple assumption which is the attributes are conditionally independent.

The classification is conducted by deriving the maximum posterior which is the maximal P(Ci|X) with the above assumption applying to Bayes theorem. This assumption greatly reduces the computational cost by only counting the class distribution. Even though the assumption is not valid in most cases since the attributes are dependent, surprisingly Naive Bayes has able to perform impressively.

Naive Bayes is a very simple algorithm to implement and good results have obtained in most cases. It can be easily scalable to larger datasets since it takes linear time, rather than by expensive iterative approximation as used for many other types of classifiers.

Naive Bayes can suffer from a problem called the zero-probability problem. When the conditional probability is zero for a attribute, it fails to give a valid prediction. This needs to be fixed explicitly using a Laplacian estimator.

Stochastic Gradient Descent

Stochastic gradient descent is a simple and very efficient approach to fit linear models. It is particularly useful when the number of samples is very large. It supports different loss functions and penalties for classification.

The steps of the algorithm are

1. Find the slope of the objective function with respect to each parameter/feature. In other words, compute the gradient of the function.

2. Pick a random initial value for the parameters. (To clarify, in the parabola example, differentiate “y” with respect to “x”. If we had more features like x1, x2 etc., we take the partial derivative of “y” with respect to each of the features.)

3. Update the gradient function by plugging in the parameter values.

4. Calculate the step sizes for each feature as : step size = gradient * learning rate.

5. Calculate the new parameters as : new params = old params -step size

6. Repeat steps 3 to 5 until gradient is almost 0.

K-Nearest Neighbours

Neighbours based classification is a type of lazy learning as it does not attempt to construct a general internal model, but simply stores instances of the training data. Classification is computed from a simple majority vote of the k nearest neighbours of each point.

It is a non-parametric, lazy algorithm. It’s non-parametric since it does not make any assumption on data distribution (the data does not have to be normally distributed). It is lazy since it does not really learn any model and make generalization of the data (It does not train some parameters of some function where input X gives output y).

So strictly speaking, this is not really a learning algorithm. It simply classifies objects based on feature similarity (feature = input variables).

fig, classifying new example depending on Training instance distance

Classification is computed from a simple majority vote of the k nearest neighbours of each point.

Decision Tree

Given a data of attributes together with its classes, a decision tree produces a sequence of rules that can be used to classify the data.

Decision Tree, as it name says, makes decision with tree-like model. It splits the sample into two or more homogeneous sets (leaves) based on the most significant differentiators in your input variables. To choose a differentiator (predictor), the algorithm considers all features and does a binary split on them (for categorical data, split by cat; for continuous, pick a cut-off threshold). It will then choose the one with the least cost (i.e. highest accuracy), and repeats recursively, until it successfully splits the data in all leaves (or reaches the maximum depth).

Random Forest

Random forest classifier is a meta-estimator that fits several decision trees on various sub-samples of datasets and uses average to improve the predictive accuracy of the model and controls over-fitting. The sub-sample size is always the same as the original input sample size, but the samples are drawn with replacement.

Support Support Vector Machine

Support vector machine is a representation of the training data as points in space separated into categories by a clear gap that is as wide as possible. New examples are then mapped into that same space and predicted to belong to a category based on which side of the gap they fall.

Description of diagram:

Examples of hyperplanes:

· H1 is not a good hyperplane as it doesn’t separate the classes

· H2 does but only with small margin

· H3 separates them with maximum margin (distance)

Parameters of SVM

There are three main parameters which we could play with when constructing a SVM classifier:

· Type of kernel

· Gamma value

· C value

要查看或添加评论，请登录

Gautam Kumar的更多文章

Treating outliers on a dataset

2022年12月23日

Treating outliers on a dataset

An outlier is an observation that lies an abnormal distance from other values in a random sample from a population. In…
What is Cloud Computing

2021年3月15日

What is Cloud Computing

The most simplistic definition of cloud computing is the delivery of on-demand IT services over the internet. The…

1 条评论
An Introduction to Lambda Function

2020年4月16日

An Introduction to Lambda Function

Functions are basically piece of codes which execute only when we invoke them. For any programming language, functions…
Understanding Support Vector Machine

2020年4月5日

Understanding Support Vector Machine

Support Vector Machine: An Introduction I have talked about Linear regression and Classification on my prior articles…
Understanding the basics of Data Clustering

2020年3月30日

Understanding the basics of Data Clustering

Clustering Clustering is the task of dividing the population or data points into a few groups such that data points in…
Multicollinearity - understanding the relationship between variables

2020年3月23日

Multicollinearity - understanding the relationship between variables

Multicollinearity Multicollinearity or simply collinearity is defined by the phenomenon in which two or more…
Dimension Reduction - Principal Component Analysis (aka PCA)

2020年3月20日

Dimension Reduction - Principal Component Analysis (aka PCA)

Being in an era of data flowing from every here and there, we often come across scenarios that we gather way too much…

2 条评论
Understanding the ROC & AUC

2020年3月15日

Understanding the ROC & AUC

Introduction In any type of machine learning, we need to calculate the accuracy of the model for performance…
Linear Regression

2020年3月15日

Linear Regression

When it comes to supervised machine learning, there are two types of learning algorithms: Regression – this basically…

See all articles

Classification in Data Science

Gautam Kumar

Senior Lead Engineer

What is Classification?

Types of classification algorithms:

Na?ve Bayes

Stochastic Gradient Descent

K-Nearest Neighbours

Decision Tree

Random Forest

Support Support Vector Machine

Parameters of SVM

Gautam Kumar的更多文章

社区洞察

其他会员也浏览了

Terminologies in Data Science and Artificial Intelligence (AI)

Data Science Best Practices

Basic Building Blocks of K-Means Clustering Algorithms

The Future of Data: How Synthetic Data is Revolutionizing the Industry

When the Quick Fix Goes Wrong: The Dark Side of Auto-ML

Understanding the Central Limit Theorem in Data Science

Data Science: Simply Explained!

Is AutoML End of a Data Scientist Job?

Data science transcends mere data; it's fundamentally anchored in critical thinking...

"The A-Z Guide to Essential Data Science Concepts!" ????

What is Classification?

Types of classification algorithms:

Na?ve Bayes

Stochastic Gradient Descent

K-Nearest Neighbours

Decision Tree

Random Forest

Support Support Vector Machine

Parameters of SVM

Gautam Kumar的更多文章

Treating outliers on a dataset

What is Cloud Computing

An Introduction to Lambda Function

Understanding Support Vector Machine

Understanding the basics of Data Clustering

Multicollinearity - understanding the relationship between variables

Dimension Reduction - Principal Component Analysis (aka PCA)

Understanding the ROC & AUC

Linear Regression

社区洞察

其他会员也浏览了

Terminologies in Data Science and Artificial Intelligence (AI)

Data Science Best Practices

Basic Building Blocks of K-Means Clustering Algorithms

The Future of Data: How Synthetic Data is Revolutionizing the Industry

When the Quick Fix Goes Wrong: The Dark Side of Auto-ML

Understanding the Central Limit Theorem in Data Science

Data Science: Simply Explained!

Is AutoML End of a Data Scientist Job?

Data science transcends mere data; it's fundamentally anchored in critical thinking...

"The A-Z Guide to Essential Data Science Concepts!" ????