Classification vs Regression
Alka Pandey
Aspiring Data Analyst ?? | Skilled in Data Visualization ?? | Proficient in Python, SQL, & Excel ?? | Enthusiastic about Uncovering Business Insights ?? | Learning from Google Data Analytics Program ??
These are Supervised Learning algorithms that work on labeled data sets and are used for prediction. Both of these techniques do Predictive Modelling.
Predictive Modeling: Developing models that use historical data to make new predictions.
Regression
It describes the relationship between dependent and independent variables of variables in a data set.
To put it simply, if we are making numerical predictions on a data set, we use “regression” algorithms.
Example: Price estimation using the features of the house, Salary estimation, height estimation by age, etc.
Frequently Used Regression Algorithms
Classification
An attempt is made to predict the y class categorically using x (input values).
To simplify it, the value we predict in classification problems is not a numerical value (such as salary, height), but categorical values such as sick/healthy, rainy/sunny, successful/unsuccessful, positive/negative.
Example: Spam email detection is one of the best examples of classification. Trained on millions of emails, the model classifies new emails as “spam” and “not spam.” Once spam is detected, the email is sent to the spam folder.
Frequently Used Classification Algorithms
Support Vector Machines (SVM)
Support Vector Machines (SVM) are robust and effective machine learning algorithms that transform your data into a linear decision space. The algorithm then determines the optimal hyperplane in this linear decision space, which separates your training data into different classes, such as legitimate or spam email.
SVM is particularly good at separating similar things. For instance, all the examples in a group might be similar, while some examples in a different group might be very different from each other.?
This makes it ideal for classification tasks that can be well classified with a straight line, which, as you can imagine, does not apply to all classification tasks
K-Nearest Neighbors (KNN)
The K-nearest neighbors algorithm is one of the most basic classification algorithms. It compares each example of a particular class to all examples of that class by proximity.
This is useful for two reasons:??
It's easy to implement because we can just take the mean of each class, which is easy to calculate. KNN can also be used to solve classification problems with multiple classes, since it's not sensitive to the number of attributes that can be used.
Decision Trees
Decision trees are a very common way to represent and visualize possible outcomes of a decision or an action based on probabilities. In a nutshell, a decision tree is a hierarchical representation of the outcome space where every node represents a choice or an action, with leaves representing states of the outcome.
Decision trees are highly human-readable since they generally follow a particular structure such as:
领英推荐
Now, we can visualize decision trees as a treelike structure where each leaf represents a state or an outcome, and the internal branches represent the various paths that could lead to that leaf. This is demonstrated in the following figure.
In our example above, consider whether or not the weather conditions are OK to play a football game. The first step would be to start at the root node labeled “Weather”, representing current weather conditions. We then move onto a branch, and the next nodes, depending on whether it’s sunny, overcast, or rainy. Finally, we continue down the decision tree until we have an outcome: could play or couldn’t play.
However, this simplicity is deceptive because decision trees can be used across a broad range of applications. For example, they are widely used in:
Decision trees are simpler than their random forest counterpart, which combines many decision trees into one model, which may be better suited for multi-class classification and other complex artificial intelligence tasks.?
While decision trees and neural networks each represent a different way of classifying (or grouping) data into clusters that share common characteristics (or features), there are some key differences between them.?
Essentially, decision trees work best for simple cases with few variables, while neural networks perform better when the data has more complex relationships between features or values (i.e., it’s “dense”).?
As such, decision trees are often used as the first line classification method in simple data science projects. However, they may not scale well when faced with large amounts of high-dimensional data, i.e., it’s difficult to interpret meaningful results from their analysis.?
Neural networks often become necessary if we seek to analyze large amounts of high-dimensional data, so let’s dive into neural nets.
Artificial Neural Networks (ANNs)
Artificial Neural Networks have been one of the most commonly used machine learning algorithms. As the name suggests, they emulate biological neural networks in computers.
More complex models end up requiring more parameters which can be slower at classifying new data points compared to simpler methods such as linear or logistic regression algorithms. However, this flexibility and scalability mean they cancan handle large amounts of unlabeled data with ease.
The way ANNs work is by training a set of parameters on data. These parameters are then used to determine the model’s output, which may be an input or an action.
For example, if we have a model for predicting whether or not someone will buy a car based on previous purchases and demographics, we can give it new data to predict what’s next. We can use multiple inputs, including things like age, marital status, income, and so on to drive its predictions.
Nodes in a neural network are connected by weighted links based on their relevance to the outcome, which is typically discovered through an error-minimization method known as backpropagation. Each layer in the network, and its corresponding neurons, roughly correlate to certain features or attributes.
The simplest form of artificial neural networks is the perceptron. Nowadays, deep neural networks are used for more complex problems like image categorization, where there are countless possible inputs (e.g., different road conditions for a self-driving car).
The end goal of the training is to tune the parameters (weights) for each node so that they best represent the data it has seen. When all nodes are trained, they should be able to make decisions based on their parameters.
One drawback of ANNs is that they are very dependent on the data used to train them. If the data is not representative of what we’re trying to predict, it can lead to overfitting. Akkio mitigates these issues with regularization methods.
They also tend to be rather slow compared to other algorithms — especially when compared against the likes of linear regression — which may impact their performance when training large models. Akkio combats this with a proprietary model training process that's 100x faster than the competition. In other words, you can build an artificial neural network in seconds.
Another drawback is that neural networks can be difficult to explain to others who may not be as familiar with the field. That said, no other AI architecture has managed such an enormous amount of growth in such a short time.
Logistic Regression
Logistic regression is actually a non-linear extension of linear regression that allows us to handle classes. This is achieved by classifying predictions into a given class based on a probability threshold.
Consider the following example of the probability of passing an exam versus hours of studying. Suppose we have a variable Y representing the probability of passing an exam and a variable X representing hours spent studying. In that case, we can fit a line through these points using a regression predictor.
We may then classify the point into one of two categories: pass or fail, based on how close its line is to a threshold. While this is a simple example, logistic regression is used in many real-world situations, such as when determining creditworthiness along a spectrum of categories, as well as other multi-label classification problems.
Logistic regression is an alternative to the naive Bayes classifier, which uses Bayes’ theorem, and results in higher bias but lower variance.?
Difference between Classification and Regression
Former Special Director General, Region Chennai at CPWD Govt. Of India
11 个月Thorough and good for learning.
Developer at Deloitte | Data Science | SAC | Founder of TechStudio | Speaker
11 个月Quite useful ??