The machine learning workflow

The machine learning workflow

In the following blogpost the most common steps in the machine learning workflow are described. These are the basics, mixed with some personal insights. The post does not cover everything but I hope you find something useful!

Acquire data 

The first step of the machine learning workflow is usually to acquire or generate data. It can be difficult to assess at an early stage; which variables will prove to be important in the end. Therefore, it is better to measure and acquire all possible reasonable variables. There are many ways to estimate variable importance and reject insignificant variables later in the process.

Pre-processing 

Much of the work associated with machine learning has to do with pre-processing of the training data. We want to start by removing oddities in the data. For example, missing data, data that has not been measured correctly, or data that should not be available to the classifier at a later stage. The formatting of the data might also need to be changed and homogenized to be able to feed it to a learning algorithm. As an example, categorical data might need to be converted from text to numbers.  

For numerical variables it is good practice to mean-center (divide the values by the average value) and scale it to unit variance (divide the values by the standard deviation). This means that variables with different units, for example, “years” and “dollars”, can be compared more easily. Some classifiers are sensitive to data of different scales, for example the k-Nearest Neighbor classifier. However, even if the chosen classifier is supposed to be robust towards different scales, it can prove beneficial when performing classifier hyper-parameter optimization later. 

After this step there is the task of mitigating noise.  

In signal- and image processing it is common to apply filters to enhance the features that we are trying to capture and to suppress everything else. Common steps are, for example, equalization of the image brightness (see figure below) and smoothing.

Histogram equalization of the image brightness and smoothing

(Left) Image with low contrast. (Right) Histogram equalization of the same image.

Extract information/features 

The next step is often to select or extract meaningful higher-level features from the data. The output of a sensor, for example, an image or an audio-recording, might not be very descriptive in its raw form, and only parts of it might be useful. A better measurement value, or feature, could likely be extracted from the data, and it is good to exclude unimportant variables when we have found out who they are. 

In simple cases these variables/features can be figured out manually or handcrafted, but nowadays it becomes more and more common to calculate, or learn, these features using convolutional neural nets. At least in automated image analysis. 

No alt text provided for this image

Training

In this step, the chosen classifier, or classifiers, are trained using the available data. A model is fitted and thresholds for each variable are set using some learning method. To obtain an unbiased classifier it is important that the classes are balanced, that there are roughly equal amounts of observations for each class in the training data.  

Machine learning algorithms can be very sensitive to differences in the input data. To the naked eye, the new data might look very similar to previously acquired data, but if the new variation is not included in the training data, the models will likely fail miserably in reality.

In the best of worlds, we would always have a stabile controlled setup and input data that does not vary over time. But as is usually the case, the input data varies a lot over time, and we then need to augment our training data with added distortions. ("Crapify" the data as Jeremy Howard jokingly called it :)

The figure below shows a few common data "crapification" steps for images, e.g, random cropping, shift, zoom, flip, change of resolution and brightness.

No alt text provided for this image

Evaluate the result

The classification accuracy is evaluated on validation data at this point, which provides a clue on how often the classifier will make the right decisions for subsequent data acquisitions. Now is a good time to evaluate the significance of the variables and discard those that are insignificant. The process of finding good classifier-specific settings, also known as hyperparameters, starts here. This process is usually automated by looping through reasonable combinations of parameter values while keeping track of the classification accuracy for the validation set. 

After the results look good enough on the validation data, it is time to test the accuracy on a new test set which the classifier never has seen before. Only then can we be sure that the classifier has not been overtrained and can generalize well to new data.

It is sometimes desireable to manipulate the classifier outputs. For example, if a classifier is to assess if a patient has a certain disease, it might be better to be “safe than sorry”, rather than just provide the most likely answer. 

Deployment

Finally, when the model has proven to work well on new test data, we are ready to deploy our model and use it in the real-world application. Best of luck!

No alt text provided for this image



要查看或添加评论,请登录

Tobias Pahlberg的更多文章

社区洞察

其他会员也浏览了