登录查看更多内容

How to set and deploy your machine learning experiment with R

Valentina Alto

AI and App Innovation Technical Architect at Microsoft | Tech Author | MSc in Data Science

发布日期: 2019年4月12日

The aim of this article is providing a foretaste of the potentiality of machine learning algorithms using R, following step-by-step a standard procedure that, once got familiar, could be a good starting point to design customized models.

The idea behind each model, indeed, is the same. In a nutshell, it consists of finding an algorithm which fit data well, train it on part of our dataset (called training set), evaluate it on the remaining portion (test set) and then let it chew new data to make predictions. The algorithm, once fed with new data, will remember its previous knowledge and will improve itself without any further human interventions.

Non è stato fornito nessun testo alternativo per questa immagine

Needless to say, this definition fails to do justice to this phenomenal intersection between Statistics and Computer Science that is Machine Learning. However, this notion will be sufficient to let you size the structure of the experiment and, maybe, to make you more curious to dive deeper into the topic.

The code is provided at the end of this article, with some related comments.

So here there are the steps we are going to follow to set up our experiment:

Downloading some data available on R environment
Setting the task we want to solve
Splitting labels and predictors into train and test sets
Picking the most suitable algorithm for the chosen task
Train, fit and finally test the model on the test set
Evaluate its performance

Let’s begin. The dataset I’m going to use is the well-known Iris dataset. It includes a variation of Iris flowers of three related species. Let’s have a look at it (at the first 10 lines) and at some related stats:

The dataset contains 150 observations of flowers, each having four features (or independent variables) and one label, the specie (or dependent variable). Hence, we can already conclude that way of training we will use will be a ‘Supervised Learning’, since data are already labelled and the aim is deciding whether or not an observation belongs to a group.

We can also visualize our data, first considering only two features and then the whole set of independent variables:

There is plenty of analytics that can be performed on data to start inquiring about possible correlations, and only few of those will be discussed here. However, it is stunning how many information could be gathered just looking at some plots and stats, without even starting the process of training algorithms.

First, I’m willing to know whether my data are balanced. Indeed, facing imbalanced data implies some further interventions and smoothening procedures before manipulating them, in order not to have meaningless results. Furthermore, it’s always a good starting point checking the probability distribution of our variables, in case we want to run some tests afterwards. Finally, having a first visualization of possible correlations is a good approach to set the basis of our analysis.

Surprisingly, we can visualize all of these metrics with just one plot, using GGally and ggpairs:

Nice, isn’t it? With just few lines of code, we derived very meaningful information, namely the fact that data are perfectly balanced (look at the lower right graph).

Okay, now that we are familiar with our dataset, let’s split it. The idea is creating a validation test, where the model will be tested and will receive an evaluation, and a training set, where it will be fitted. However, since the main goal of this analysis is adapting and generalize our algorithms to new, unlabelled data, we want to make sure the training set we select is random, not biased. Hence, next step is splitting the training set itself into K folds: throughout an iterative rotation, the algorithm will be trained on K-1 folds and tested on the remaining ones, for K times. The error estimation will be averaged over all K trials, so that “lucky” rounds will be compensated with those “unlucky”. This approach is called Cross Validation.

The most important step of the whole process is formulating the right question and build the model accordingly. Here, my aim might be, once facing an unlabelled, unknown flower, having a set of rules which could tell me “look, since the flower has these features, it is 99% a Setosa”. This set of rules, or decision boundary, is nothing but the output of the model trained on those data.

Once set our task, next step is picking the most suitable algorithm. As we are facing a classification problem, the algorithm I’m going to employ is the Support Vector Machine (SVM), but be aware that this is neither the only solution nor necessarily the most accurate: R libraries offer a variety of algorithms recipes and, once targeted the family of algorithms you are interested in (in this case, classification), your decision will depend on the kind of task, data and dimensions you are facing (as well as on your personal tastes).

I decided for SVM since it is the most popular classifier, easy to visualize and with a very intuitive ground idea. This idea, in a nutshell, is finding a decision boundary, called hyperplane, which is able to segregate data in the most accurate way. The optimum hyperplane is the one which guarantees the largest “area of freedom” of other future observations, an area where they can afford to deviate from their pattern without undermining the model. This area, which represents the largest separation between classes, is called margin. Thus, we choose the hyperplane so that the distance from it to the nearest data point on each side is maximized (yet under some constraints I'm not going to dwell on here).

Lots of words. Let's visualize it with a simple graph:

Now it looks far more straightforward.

To make it even clearer, we can implement it on our data using only two features, Sepal Length and Petal Length, and only two labels, Setosa and Versicolor.

As you can see, the two clusters are clearly linearly separable. The SVM algorithm will build the hyperplane which will do that in the most general approach:

As the previous figure displays, the space is choppend into two pieces, segregating observations which are labelled as "Setosa" from those labelled as "Versicolor". Now, observing this plot, we notice some datapoints are displayed as "x". Does it have a special meaning? It does. Those points are the so called Support Vectors (SVs) and they are fundamental for our algorithm: actually, SVs are the only point which matter for the algorithm. It means that all the other observations could be moved from their current positions, without affecting the model, since, again, it is determined only by SVs.

Now let’s implement the same procedure for the whole training dataset and let's display some related stats:

Besides accuracy (number of well-classified observations/total number of observations), some further parameters are displayed, namely C, Kappa, Sigma. However, here we are only interested in measuring accuracy, since our aim is evaluating model's performance in terms of well-predicted observations.

We are now at our final step of the experiment: making predictions with the model we trained on our validation test, of which, remember, we already know the labels. Hence, we can immediately evaluate the performance of SVM.

Again, many metrics are displayed, but for now let’s consider only the confusion matrix: on the main diagonal lay all the well-categorized observations and we can immediately say that our model made pretty accurate predictions (indeed, the overall accuracy is 94.4%).

Could we say this solution is satisfying? Are we done now that we are happy with our algorithm? Well, it would be far too negligent training only one algorithm without comparing it to potential competitors. Furthermore, so many implementations and interventions should be made to make the elected algorithm more performing.

Nevertheless, the model we trained is far from being useless: with dignity, it made its job and returned good results. Even perfectionists couldn’t deny that, at least, these results might be a good starting point for further analyses and tests.

Conclusions? With a basic knowledge of algorithms’ families and keeping in mind the task we want to solve, building a machine learning model can be very straightforward and fast, maintaining at the same time a high quality performance.

Here there is the whole experiment coded in R:

Gianluca Airoldi

Partner Sales Executive presso Red Hat

5 年

Molto interessante!

要查看或添加评论，请登录

Valentina Alto的更多文章

Getting started with Azure GPT-4-Turbo Vision in Industry-based scenarios

2023年12月16日

Getting started with Azure GPT-4-Turbo Vision in Industry-based scenarios

On December 12, Microsoft announced the Public Preview of the GPT-4-turbo Vision model in Azure. GPT-4-turbo vision is…

5 条评论
Getting started with Azure AI Studio

2023年11月16日

Getting started with Azure AI Studio

During the opening keynote of the Microsoft Ignite event, on November 15th, Artificial Intelligence was the undiscussed…

3 条评论
Computer Vision: Feature Matching with OpenCV

2019年7月14日

Computer Vision: Feature Matching with OpenCV

Computer vision is a field of study which aims at gaining a deep understanding from digital images or videos. Combined…
Building your first chatbot with Python

2019年7月11日

Building your first chatbot with Python

Today, if you are about to order some foods on a restaurant's website or you need assistance because your router is not…

1 条评论
The Bias-Variance trade-off

2019年7月5日

The Bias-Variance trade-off

Machine Learning models' ultimate goal is making reliable predictions on new, unknown data. With this purpose in mind…

1 条评论
Streaming analysis with Kafka, InfluxDB and Grafana

2019年7月2日

Streaming analysis with Kafka, InfluxDB and Grafana

If you are dealing with the streaming analysis of your data, there are some tools which can offer performing and…
Decision Tree and Information Gain

2019年6月27日

Decision Tree and Information Gain

Decision trees are some of the most popular ML algorithms used in industry, as they are quite interpretable and…
Natural Language Processing with TextBlob

2019年6月20日

Natural Language Processing with TextBlob

Natural Language Processing (NPL) is a field of Artificial Intelligence whose purpose is finding computational methods…
Features Engineering: behind the scenes of ML algorithms

2019年6月13日

Features Engineering: behind the scenes of ML algorithms

The majority of people (including me) tend to think that the core activity of building a Machine Learning algorithm is,…
Neural Networks: parameters, hyperparameters and optimization strategies

2019年6月7日

Neural Networks: parameters, hyperparameters and optimization strategies

Neural Networks (NNs) are the typical algorithms used in Deep Learning analysis. NNs can take different shapes and…

1 条评论

See all articles

How to set and deploy your machine learning experiment with R

Valentina Alto

AI and App Innovation Technical Architect at Microsoft | Tech Author | MSc in Data Science

Valentina Alto的更多文章

社区洞察

其他会员也浏览了

ML Day 8: Basic ML Algorithms Every IT Professional Should Know

Understanding K-Means and K-Nearest Neighbours: Key Differences and Confusing Similarities

10 Machine Learning Algorithms every Data Scientist should know

Random Forest and XGBoost: The MVPs of Machine Learning Models

Decision Tree

Training Data vs Test Data in Machine Learning - Essential Guide

Why Big Data And Machine Learning Are Important In Our Society

Top 8 Machine Learning Algorithms Explained In Less Than 1 Minute Each

Strategic Evaluation and Comparison of Machine Learning Models

Unleashing the Power of Big Data: A Comprehensive Look at Machine Learning Algorithms

Valentina Alto的更多文章

Getting started with Azure GPT-4-Turbo Vision in Industry-based scenarios

Getting started with Azure AI Studio

Computer Vision: Feature Matching with OpenCV

Building your first chatbot with Python

The Bias-Variance trade-off

Streaming analysis with Kafka, InfluxDB and Grafana

Decision Tree and Information Gain

Natural Language Processing with TextBlob

Features Engineering: behind the scenes of ML algorithms

Neural Networks: parameters, hyperparameters and optimization strategies

社区洞察

其他会员也浏览了

ML Day 8: Basic ML Algorithms Every IT Professional Should Know

Understanding K-Means and K-Nearest Neighbours: Key Differences and Confusing Similarities

10 Machine Learning Algorithms every Data Scientist should know

Random Forest and XGBoost: The MVPs of Machine Learning Models

Decision Tree

Training Data vs Test Data in Machine Learning - Essential Guide

Why Big Data And Machine Learning Are Important In Our Society

Top 8 Machine Learning Algorithms Explained In Less Than 1 Minute Each

Strategic Evaluation and Comparison of Machine Learning Models

Unleashing the Power of Big Data: A Comprehensive Look at Machine Learning Algorithms