登录查看更多内容

The machine learning workflow

Tobias Pahlberg

Senior Test Engineer at Flasheye

发布日期: 2019年10月14日

In the following blogpost the most common steps in the machine learning workflow are described. These are the basics, mixed with some personal insights. The post does not cover everything but I hope you find something useful!

Acquire data

The first step of the machine learning workflow is usually to acquire or generate data. It can be difficult to assess at an early stage; which variables will prove to be important in the end. Therefore, it is better to measure and acquire all possible reasonable variables. There are many ways to estimate variable importance and reject insignificant variables later in the process.

Pre-processing

Much of the work associated with machine learning has to do with pre-processing of the training data. We want to start by removing oddities in the data. For example, missing data, data that has not been measured correctly, or data that should not be available to the classifier at a later stage. The formatting of the data might also need to be changed and homogenized to be able to feed it to a learning algorithm. As an example, categorical data might need to be converted from text to numbers.

For numerical variables it is good practice to mean-center (divide the values by the average value) and scale it to unit variance (divide the values by the standard deviation). This means that variables with different units, for example, “years” and “dollars”, can be compared more easily. Some classifiers are sensitive to data of different scales, for example the k-Nearest Neighbor classifier. However, even if the chosen classifier is supposed to be robust towards different scales, it can prove beneficial when performing classifier hyper-parameter optimization later.

After this step there is the task of mitigating noise.

In signal- and image processing it is common to apply filters to enhance the features that we are trying to capture and to suppress everything else. Common steps are, for example, equalization of the image brightness (see figure below) and smoothing.

Histogram equalization of the image brightness and smoothing

(Left) Image with low contrast. (Right) Histogram equalization of the same image.

Extract information/features

The next step is often to select or extract meaningful higher-level features from the data. The output of a sensor, for example, an image or an audio-recording, might not be very descriptive in its raw form, and only parts of it might be useful. A better measurement value, or feature, could likely be extracted from the data, and it is good to exclude unimportant variables when we have found out who they are.

In simple cases these variables/features can be figured out manually or handcrafted, but nowadays it becomes more and more common to calculate, or learn, these features using convolutional neural nets. At least in automated image analysis.

Training

In this step, the chosen classifier, or classifiers, are trained using the available data. A model is fitted and thresholds for each variable are set using some learning method. To obtain an unbiased classifier it is important that the classes are balanced, that there are roughly equal amounts of observations for each class in the training data.

Machine learning algorithms can be very sensitive to differences in the input data. To the naked eye, the new data might look very similar to previously acquired data, but if the new variation is not included in the training data, the models will likely fail miserably in reality.

In the best of worlds, we would always have a stabile controlled setup and input data that does not vary over time. But as is usually the case, the input data varies a lot over time, and we then need to augment our training data with added distortions. ("Crapify" the data as Jeremy Howard jokingly called it :)

The figure below shows a few common data "crapification" steps for images, e.g, random cropping, shift, zoom, flip, change of resolution and brightness.

Evaluate the result

The classification accuracy is evaluated on validation data at this point, which provides a clue on how often the classifier will make the right decisions for subsequent data acquisitions. Now is a good time to evaluate the significance of the variables and discard those that are insignificant. The process of finding good classifier-specific settings, also known as hyperparameters, starts here. This process is usually automated by looping through reasonable combinations of parameter values while keeping track of the classification accuracy for the validation set.

After the results look good enough on the validation data, it is time to test the accuracy on a new test set which the classifier never has seen before. Only then can we be sure that the classifier has not been overtrained and can generalize well to new data.

It is sometimes desireable to manipulate the classifier outputs. For example, if a classifier is to assess if a patient has a certain disease, it might be better to be “safe than sorry”, rather than just provide the most likely answer.

Deployment

Finally, when the model has proven to work well on new test data, we are ready to deploy our model and use it in the real-world application. Best of luck!

要查看或添加评论，请登录

Tobias Pahlberg的更多文章

Boliden: Jaw crusher measurements

2020年2月19日

Boliden: Jaw crusher measurements

Here is another recent project I worked on for the mining company Boliden. Background Boliden has a few, so called, jaw…
Tracking and traceability in the manufacturing industry

2019年12月17日

Tracking and traceability in the manufacturing industry

This is an article discussing the possibilities and things to think about when it comes to tracking products through a…
Advances in computational photography and AI

2019年11月7日

Advances in computational photography and AI

This post discusses computational photography and gives a few examples of newish cool techniques you might soon find in…
Boliden: Real-time bubble segmentation

2019年9月27日

Boliden: Real-time bubble segmentation

Here is a somewhat recent project I worked on for the mining company Boliden. Boliden is a high-tech metals and mining…
The future of machine learning in the manufacturing industry

2019年9月13日

The future of machine learning in the manufacturing industry

Human vs machine Most of us have heard about the artificial intelligence (AI) advances being made in the transport…

See all articles

The machine learning workflow

Tobias Pahlberg

Senior Test Engineer at Flasheye

Acquire data

Pre-processing

Extract information/features

Training

Evaluate the result

Deployment

Tobias Pahlberg的更多文章

社区洞察

其他会员也浏览了

?? Exciting News: Generative models are revolutionizing data creation in machine learning projects.

Machine Learning: Let’s dive into its fundamentals.

XGBOOST CLASSIFIER ALGORITHM IN MACHINE LEARNING

A Deeper Dive into Churn Analysis with Machine Learning

Day 13 : How Machines Learn from Data – An Overview

DIMENSIONALITY REDUCTION

Data Encoding in Machine Learning - Part 08

Understanding XGBoost: A Powerful Machine Learning Algorithm

ML Model's Performance - A Guide to Scoring Methods in Machine Learning

Embeddings explained in plain English

Acquire data

Pre-processing

Extract information/features

Training

Evaluate the result

Deployment

Tobias Pahlberg的更多文章

Boliden: Jaw crusher measurements

Tracking and traceability in the manufacturing industry

Advances in computational photography and AI

Boliden: Real-time bubble segmentation

The future of machine learning in the manufacturing industry

社区洞察

其他会员也浏览了

?? Exciting News: Generative models are revolutionizing data creation in machine learning projects.

Machine Learning: Let’s dive into its fundamentals.

XGBOOST CLASSIFIER ALGORITHM IN MACHINE LEARNING

A Deeper Dive into Churn Analysis with Machine Learning

Day 13 : How Machines Learn from Data – An Overview

DIMENSIONALITY REDUCTION

Data Encoding in Machine Learning - Part 08

Understanding XGBoost: A Powerful Machine Learning Algorithm

ML Model's Performance - A Guide to Scoring Methods in Machine Learning

Embeddings explained in plain English