A broad overview of machine learning models and its classifications

A broad overview of machine learning models and its classifications

No alt text provided for this image

A general definition of Machine Learning (ML) is “the art and science of programming computers so they can learn from data”. My favourite definition is the one below because it tells you instantly what a Machine Learning program does.

According to Tom Mitchell (1997)

“A computer program is said to learn from experience E with respect to some task T and some performance measure P, if its performance on T, as measured by P, improves with experience E " 

We use Machine Learning every day in most of the apps that we use, from recommendation systems to our email. A general example of a Machine Learning program that we are all familiar with is our email spam filter. Examples that the spam filter system uses to learn are the training set, and each of the email used are known as a training instance. The task here is to check new emails if they are spam or not, and the experience E is the training set. The data scientist will define what the performance P should be. P could be defined as the number or ratio of rightly classified spam emails. This is known as accuracy in a classification model such as this example. 

You only need to build your ML algorithm once, and it will automatically adapt to changes in the input data. Hence, automating a task will also help you to save time spent on writing new computer scripts or programs to perform the task each time the rules or data changes.

The advantage of using ML algorithm in the spam filter example is that the data scientist can inspect the results to see how many words were flagged by the algorithm as spam, and this might uncover underlying patterns and trends in the data that were not obvious previously.

The task of using ML algorithms to analyse a vast amount of data to discover patterns that are not immediately apparent is called data mining. 

Therefore, Machine Learning is useful in solving the following problems:

  • Problems that require a long list of handwritten rules can be solved using one ML algorithm
  • Complex problems that cannot be solved with the traditional statistical approach
  • Constantly changing environment with new data streams. ML models can adapt instantly to new data
  • Uncovering insights into complex problems and patterns and trends in big data

So, how many types of Machine Learning algorithms do we have? Well, the discipline is continuously evolving with technological advancements which are used to develop new algorithms. However, there are broad categories of the main types of ML used in data science projects.

The main classifications of ML models are:

  1. Supervised versus unsupervised learning
  2. Online versus batch learning
  3. Instance-based versus model-based learning

There are various categories of ML models, and they can also be divided into different types according to the amount of supervision that is required. They could be supervised learning, unsupervised learning, semi-supervised learning and reinforcement learning. 

Supervised learning involves including the desired results called labels in the algorithm training set. Most supervised learning programs are classification tasks such as the spam filter aforementioned, as it involves categorising emails as spam or non-spam. 

A regression task is supervised learning as it is largely about predicting a target numeric value such as the price of a house, given a set of features known as predictors, such as the sqft of the house, distance to a good school, proximity to transport links, etc. Logistic regression can be used for a classification task such as calculating the probability of an email being spam. 

In data science, the main types of supervised learning algorithms used are:

  • k-Nearest Neighbors
  • Linear Regression
  • Logistic Regression
  • Support Vector Machines (SVMs)
  • Decision Trees and Random Forests
  • Neural networks

In unsupervised learning, as expected the training set is unlabeled and the system tries to learn by itself. Examples include:

  • Clustering; k-Means, Hierarchical Cluster Analysis (HCA), Expectation Maximization.
  • Visualisation and dimensionality reduction; Principal Component Analysis (PCA), Kernal PCA, Locally-Linear Embedding (LLE), t-distributed stochastic Neighbour Embedding (t-SNE).
  • Association rule learning; Apriori, Eclat.

Clustering is grouping similar classes together. For example, I can analyse my blog visitors by running a clustering algorithm to group similar visitors together. You can use a hierarchical clustering algorithm to subdivide each group into subgroups for more insights. 

Other unsupervised algorithms include anomaly detection and association rule learning. Anomaly detection use cases include credit card fraud prevention through detection of an unusual transaction, identifying manufacturing defects and automatically removing outliers from a dataset before feeding it into another algorithm. You show the system normal transactions that are expected called instances, and when new data comes in, the algorithm analyses it to know whether it is normal or an anomaly. 

Association technique is used to detect correlations between consumers. You can use this algorithm to categorise customers by spending pattern or shopping habits. For instance, you can use the algorithm to analyse customers shopping baskets and discover that people who buy nappies also buy beers. This can inform the way the items are arranged in the store to attract customers and improve sales.

Dimensionality algorithm can be used to clean or simplify a dataset to improve the program performance. This involves combining similar variables or features into one. For instance, you can use the dimensionality algorithm to create a new feature like the wear and tear of a car from the age and mileage data of the vehicle.

Semisupervised learning is labelling only a few samples of the training data and allowing the algorithm to do the rest. They are usually a combination of supervised and unsupervised algorithms. An example is the deep belief networks (DBNs) where restricted Boltzmann machines (RDMs) unsupervised algorithms are stacked on top of each other. The technique involves training the algorithm in an unsupervised manner and then fine-tuning it using a supervised learning method. An example is how Google algorithm recognises family members instantly from pictures uploaded once you name the members once on the platform. 

Reinforcement Learning: Here, the agent which is the learning program observes the environment, select and performs actions which gets a return. The return could be positive if the system chooses the right action or penalty if negative action was taken. This allows the system to develop the best strategy for dealing with such situations called policy. It is used in designing robots in order for them to perform at the required level. 

Batch and Online Learning: Batch learning involves training all your data and evaluating the performance of the algorithm in an offline mode before launching it. Online learning is training the program by increasing the amount of data fed into the system gradually, such a system is useful in applications using fast-changing data such as stock prices. However, you might need to build an anomaly detection algorithm to avoid bad data in an online environment for end users, and closely monitor the system so that learning can be switched off once an anomaly is detected in the program.

Instance-Based Versus Model-Based Learning: This classifies Machine learning models based on how they generalise. In instance-based learning, the program learns the instance by heart and generalises new cases using a given similarity measure. Model-based learning uses examples of situation expected to create a model, which in turn is then used to make predictions. Regression analysis is an example of model-based learning which can be used to uncover trends and patterns in a dataset. 

I guess this leads us to the next logical question about challenges or issues that can reduce the predictive power of a Machine Learning algorithm. There are basically two fundamental things that can go wrong, that is, building a bad algorithm and or bad data. Let's break this down. The main bottlenecks are:

Insufficient data: You need a large number of data to have an accurate prediction from a machine learning model. Otherwise, you are better off using a statistical model.

Lack of representative training set: It is important to input training set that is identical to the expected inference to improve the predictive power of the model. Also, data ethics is important here to eliminate the risk of bias.

Lack of quality data: It is important to clean your training set to get rid of errors, outliers, handle missing values and noise.

Features quality: You need to ensure that only relevant features are included in the model. You can use dimensionality algorithm to improve existing features or a process known as feature engineering. Feature engineering comprises of three things, that is, feature selection; feature extraction and new features creation through the collection of new data.

Training data overfit: Use regularisation to reduce the problem of overfitting the training set. Overfitting is when the model shows inaccurate p-values, coefficients and R-squared due to the training set exploring the random error instead of uncovering only the relationship between the analysed features (variables). Therefore, you need to tune the hyperparameter in order to avoid data overfitting. 

Training data underfit: This is the opposite of the situation described above. Underfitting can be avoided by using a model with more parameters (multivariate instead of a univariate model), feature engineering and reducing the regularisation hyper-parameter.

Finally, how do you test or validate a Machine Learning model before it is deployed? The recommended approach is to divide your data into two sets: the training set for training the model and the test set for testing the predictive power of the model. The testing set gives an error rate results when the inferences are made on the new instance, and this error measure is called the generalisation error. Looking at the generalisation error, you can know how well your model performs on new instances. The testing sets also give the testing error to know the performance of the model. Cross-validation is used to deal with the issue of choosing the wrong hyperparameters by splitting the training set into complementary subsets which are trained against different combinations of the subsets and validated using the remaining parts. For forecasting problem, you can use the Mean Absolute Percentage Error (MAPE).

ML algorithm validation result example: If the training error is high which means that many mistakes are made on the training set, but the generalisation error is low. This would suggest that the model is underfitting and needs finetuning. Opposite results suggest overfitting.

Clearly, Machine Learning is an exciting part of data science which has many capabilities and use cases that can be used to solve business problems and improve the performance of existing applications.

In summary, the main goal of Machine Learning is to train predictive models that can be used to enhance applications. Therefore, it is important for both technical and non-technical people to understand the fundamental concepts as presented in this article.

Useful Document:

Microsoft ML Cheatsheet

要查看或添加评论,请登录

Aisha Ekundayo, PhD的更多文章

社区洞察

其他会员也浏览了