登录查看更多内容

A broad overview of machine learning models and its classifications

Aisha Ekundayo, PhD

Data Analytics Consulting | AI Consultant | Data Product Management

发布日期: 2019年6月28日

A general definition of Machine Learning (ML) is “the art and science of programming computers so they can learn from data”. My favourite definition is the one below because it tells you instantly what a Machine Learning program does.

According to Tom Mitchell (1997)

“A computer program is said to learn from experience E with respect to some task T and some performance measure P, if its performance on T, as measured by P, improves with experience E "

We use Machine Learning every day in most of the apps that we use, from recommendation systems to our email. A general example of a Machine Learning program that we are all familiar with is our email spam filter. Examples that the spam filter system uses to learn are the training set, and each of the email used are known as a training instance. The task here is to check new emails if they are spam or not, and the experience E is the training set. The data scientist will define what the performance P should be. P could be defined as the number or ratio of rightly classified spam emails. This is known as accuracy in a classification model such as this example.

You only need to build your ML algorithm once, and it will automatically adapt to changes in the input data. Hence, automating a task will also help you to save time spent on writing new computer scripts or programs to perform the task each time the rules or data changes.

The advantage of using ML algorithm in the spam filter example is that the data scientist can inspect the results to see how many words were flagged by the algorithm as spam, and this might uncover underlying patterns and trends in the data that were not obvious previously.

The task of using ML algorithms to analyse a vast amount of data to discover patterns that are not immediately apparent is called data mining.

Therefore, Machine Learning is useful in solving the following problems:

Problems that require a long list of handwritten rules can be solved using one ML algorithm
Complex problems that cannot be solved with the traditional statistical approach
Constantly changing environment with new data streams. ML models can adapt instantly to new data
Uncovering insights into complex problems and patterns and trends in big data

So, how many types of Machine Learning algorithms do we have? Well, the discipline is continuously evolving with technological advancements which are used to develop new algorithms. However, there are broad categories of the main types of ML used in data science projects.

The main classifications of ML models are:

Supervised versus unsupervised learning
Online versus batch learning
Instance-based versus model-based learning

There are various categories of ML models, and they can also be divided into different types according to the amount of supervision that is required. They could be supervised learning, unsupervised learning, semi-supervised learning and reinforcement learning.

Supervised learning involves including the desired results called labels in the algorithm training set. Most supervised learning programs are classification tasks such as the spam filter aforementioned, as it involves categorising emails as spam or non-spam.

A regression task is supervised learning as it is largely about predicting a target numeric value such as the price of a house, given a set of features known as predictors, such as the sqft of the house, distance to a good school, proximity to transport links, etc. Logistic regression can be used for a classification task such as calculating the probability of an email being spam.

In data science, the main types of supervised learning algorithms used are:

k-Nearest Neighbors
Linear Regression
Logistic Regression
Support Vector Machines (SVMs)
Decision Trees and Random Forests
Neural networks

In unsupervised learning, as expected the training set is unlabeled and the system tries to learn by itself. Examples include:

Clustering; k-Means, Hierarchical Cluster Analysis (HCA), Expectation Maximization.
Visualisation and dimensionality reduction; Principal Component Analysis (PCA), Kernal PCA, Locally-Linear Embedding (LLE), t-distributed stochastic Neighbour Embedding (t-SNE).
Association rule learning; Apriori, Eclat.

Clustering is grouping similar classes together. For example, I can analyse my blog visitors by running a clustering algorithm to group similar visitors together. You can use a hierarchical clustering algorithm to subdivide each group into subgroups for more insights.

Other unsupervised algorithms include anomaly detection and association rule learning. Anomaly detection use cases include credit card fraud prevention through detection of an unusual transaction, identifying manufacturing defects and automatically removing outliers from a dataset before feeding it into another algorithm. You show the system normal transactions that are expected called instances, and when new data comes in, the algorithm analyses it to know whether it is normal or an anomaly.

Association technique is used to detect correlations between consumers. You can use this algorithm to categorise customers by spending pattern or shopping habits. For instance, you can use the algorithm to analyse customers shopping baskets and discover that people who buy nappies also buy beers. This can inform the way the items are arranged in the store to attract customers and improve sales.

Dimensionality algorithm can be used to clean or simplify a dataset to improve the program performance. This involves combining similar variables or features into one. For instance, you can use the dimensionality algorithm to create a new feature like the wear and tear of a car from the age and mileage data of the vehicle.

Semisupervised learning is labelling only a few samples of the training data and allowing the algorithm to do the rest. They are usually a combination of supervised and unsupervised algorithms. An example is the deep belief networks (DBNs) where restricted Boltzmann machines (RDMs) unsupervised algorithms are stacked on top of each other. The technique involves training the algorithm in an unsupervised manner and then fine-tuning it using a supervised learning method. An example is how Google algorithm recognises family members instantly from pictures uploaded once you name the members once on the platform.

Reinforcement Learning: Here, the agent which is the learning program observes the environment, select and performs actions which gets a return. The return could be positive if the system chooses the right action or penalty if negative action was taken. This allows the system to develop the best strategy for dealing with such situations called policy. It is used in designing robots in order for them to perform at the required level.

Batch and Online Learning: Batch learning involves training all your data and evaluating the performance of the algorithm in an offline mode before launching it. Online learning is training the program by increasing the amount of data fed into the system gradually, such a system is useful in applications using fast-changing data such as stock prices. However, you might need to build an anomaly detection algorithm to avoid bad data in an online environment for end users, and closely monitor the system so that learning can be switched off once an anomaly is detected in the program.

Instance-Based Versus Model-Based Learning: This classifies Machine learning models based on how they generalise. In instance-based learning, the program learns the instance by heart and generalises new cases using a given similarity measure. Model-based learning uses examples of situation expected to create a model, which in turn is then used to make predictions. Regression analysis is an example of model-based learning which can be used to uncover trends and patterns in a dataset.

I guess this leads us to the next logical question about challenges or issues that can reduce the predictive power of a Machine Learning algorithm. There are basically two fundamental things that can go wrong, that is, building a bad algorithm and or bad data. Let's break this down. The main bottlenecks are:

Insufficient data: You need a large number of data to have an accurate prediction from a machine learning model. Otherwise, you are better off using a statistical model.

Lack of representative training set: It is important to input training set that is identical to the expected inference to improve the predictive power of the model. Also, data ethics is important here to eliminate the risk of bias.

Lack of quality data: It is important to clean your training set to get rid of errors, outliers, handle missing values and noise.

Features quality: You need to ensure that only relevant features are included in the model. You can use dimensionality algorithm to improve existing features or a process known as feature engineering. Feature engineering comprises of three things, that is, feature selection; feature extraction and new features creation through the collection of new data.

Training data overfit: Use regularisation to reduce the problem of overfitting the training set. Overfitting is when the model shows inaccurate p-values, coefficients and R-squared due to the training set exploring the random error instead of uncovering only the relationship between the analysed features (variables). Therefore, you need to tune the hyperparameter in order to avoid data overfitting.

Training data underfit: This is the opposite of the situation described above. Underfitting can be avoided by using a model with more parameters (multivariate instead of a univariate model), feature engineering and reducing the regularisation hyper-parameter.

Finally, how do you test or validate a Machine Learning model before it is deployed? The recommended approach is to divide your data into two sets: the training set for training the model and the test set for testing the predictive power of the model. The testing set gives an error rate results when the inferences are made on the new instance, and this error measure is called the generalisation error. Looking at the generalisation error, you can know how well your model performs on new instances. The testing sets also give the testing error to know the performance of the model. Cross-validation is used to deal with the issue of choosing the wrong hyperparameters by splitting the training set into complementary subsets which are trained against different combinations of the subsets and validated using the remaining parts. For forecasting problem, you can use the Mean Absolute Percentage Error (MAPE).

ML algorithm validation result example: If the training error is high which means that many mistakes are made on the training set, but the generalisation error is low. This would suggest that the model is underfitting and needs finetuning. Opposite results suggest overfitting.

Clearly, Machine Learning is an exciting part of data science which has many capabilities and use cases that can be used to solve business problems and improve the performance of existing applications.

In summary, the main goal of Machine Learning is to train predictive models that can be used to enhance applications. Therefore, it is important for both technical and non-technical people to understand the fundamental concepts as presented in this article.

Useful Document:

Microsoft ML Cheatsheet

要查看或添加评论，请登录

Aisha Ekundayo, PhD的更多文章

Diary of a Data Product Manager: Considerations for a solid technology foundation

2023年10月13日

Diary of a Data Product Manager: Considerations for a solid technology foundation

Thank you for visiting my blog. In a previous blog, I presented the case for having a well-defined data strategy, data…
Diary of a Data Product Manager: A case for data strategy, product definition and stewardship.

2023年9月29日

Diary of a Data Product Manager: A case for data strategy, product definition and stewardship.

I want to share some lessons learned as a data product manager who has managed data products for the past few years…

3 条评论
Classic Approach and Modern Reality interaction for enhanced Big Data Applications across Industries

2021年5月13日

Classic Approach and Modern Reality interaction for enhanced Big Data Applications across Industries

As highlighted in the previous article, business processes generate a vast amount of structured and unstructured data…
The rhetoric and reality of Big Data Processing

2021年4月27日

The rhetoric and reality of Big Data Processing

The previous article concluded that Big Data could not be stored and processed using traditional methods such as a…

4 条评论
The Ascendancy of Big Data…

2021年4月11日

The Ascendancy of Big Data…

Advancement in technology recorded in the past two decades has led to a significant increase in data generated with…

3 条评论
How to plan a successful Data Science Project

2019年6月28日

How to plan a successful Data Science Project

Planning is essential in all projects to ensure that the strategic objectives of the project are accomplished. It is…
Five apparent reasons to study Economics

2019年6月28日

Five apparent reasons to study Economics

This blog is for young people, school leavers thinking about a degree and for those thinking about a career change. For…
Automate your machine learning models to maximise business benefits

2019年6月28日

Automate your machine learning models to maximise business benefits

As a data scientist, your work life will focus on several aspects of optimal use of data including deriving value from…

1 条评论
Tenets of Good Work and Why They Matter for a Happier Society

2018年11月28日

Tenets of Good Work and Why They Matter for a Happier Society

With record employment level in most G7 countries and the increase in economic growth, little attention is given to…
Increasing your ROI: How recruitment technology and analytics can drive performance and growth

2018年11月7日

Increasing your ROI: How recruitment technology and analytics can drive performance and growth

Majority of business leaders agree that embracing digital transformation and automation is key to both short-term and…

See all articles

A broad overview of machine learning models and its classifications

Aisha Ekundayo, PhD

Data Analytics Consulting | AI Consultant | Data Product Management

Aisha Ekundayo, PhD的更多文章

社区洞察

其他会员也浏览了

Roadmap to Master Machine Learning from Scratch

XGBOOST CLASSIFIER ALGORITHM IN MACHINE LEARNING

CRISP-DM Process for Machine Learning Projects

ML Day 11: Fun Quiz: ML Terms and Concepts

10 Essential Machine Learning Algorithms Every Beginner Should Know

Top 5 Machine Learning Fields: A Blog for Beginners

Automating Machine Learning (AutoML) Selection Criteria and Theoretical Principles

How I Fell In Love With Machine Learning

Leveling Up as a Product Manager: AI and Machine Learning Beyond the Code

Machine Learning: Introduction and Practical Example

Aisha Ekundayo, PhD的更多文章

Diary of a Data Product Manager: Considerations for a solid technology foundation

Diary of a Data Product Manager: A case for data strategy, product definition and stewardship.

Classic Approach and Modern Reality interaction for enhanced Big Data Applications across Industries

The rhetoric and reality of Big Data Processing

The Ascendancy of Big Data…

How to plan a successful Data Science Project

Five apparent reasons to study Economics

Automate your machine learning models to maximise business benefits

Tenets of Good Work and Why They Matter for a Happier Society

Increasing your ROI: How recruitment technology and analytics can drive performance and growth

社区洞察

其他会员也浏览了

Roadmap to Master Machine Learning from Scratch

XGBOOST CLASSIFIER ALGORITHM IN MACHINE LEARNING

CRISP-DM Process for Machine Learning Projects

ML Day 11: Fun Quiz: ML Terms and Concepts

10 Essential Machine Learning Algorithms Every Beginner Should Know

Top 5 Machine Learning Fields: A Blog for Beginners

Automating Machine Learning (AutoML) Selection Criteria and Theoretical Principles

How I Fell In Love With Machine Learning

Leveling Up as a Product Manager: AI and Machine Learning Beyond the Code

Machine Learning: Introduction and Practical Example