Machine Learning Workflow
Machine learning is in the spotlight now, but has actually been around since the 1950s and 1960s. Over time a series of steps has been created which define the machine-learning workflow. Our first step is to define the question we want the solution to answer. We need to define this question in a way that guides the remaining steps in the correct direction. This requires thinking carefully about the goals we want to achieve, the data we need, and the processing we can perform. Once the question is defined, we can gather the data we need to answer the question. This can be tricky, because often we need data from many sources. Fortunately, Azure has tools that can make it easier to handle the processing of data from different sources. With data, we often have issues with the quality and cleanliness of the data. That is, data are often incomplete, inaccurate, and conflicting. So before we use the data in machine learning, we need to clean it. Azure Machine Learning has tools that can aid with data cleaning and transformation. But even with these tools, do not be surprised that you may need to spend a considerable amount of time massaging your data into the format you need. When you have the question defined and the data the way you need it; you can consider which algorithm to use. This is not an easy task, as there are many algorithms available. But if we properly define the question, the question will help us select the proper algorithm. As we will see when we build models, Azure Machine Learning supports a wide variety of algorithms which are optimized to work in the Azure environment. These are grouped according to the type of learning being performed and the type of results we want. Once we have the algorithm selected, we need to use a subset of the data we have to train the algorithm. This training process will result in creating a training model that predicts results on similar data, and like with the rest of Azure Machine Learning, setting up the training is a drag-and-drop operation. Once we have a training model, we need to test its accuracy on new data that was not used to train the model. We do this in Azure Machine Learning with built in modules that provide both graphical and numeric information on the performance of our algorithms. Evaluating the results will generate statistics which we can use to determine if the model will meet our requirements, or needs further refinement. If refinement is needed, it is often necessary to rework steps in the workflow. We may need to alter or get more data, change to a different algorithm, adjust parameters, and often, some combination of all of these.
When using the machine-learning workflow, there are some important things to keep in mind. First, there's an inherent hierarchy of the steps with the earliest steps being the most important, since the later steps are dependent on them. That is, you need to correctly define the question for which you are creating a solution. Then you have to get the correct data which will allow you to train your algorithm to come up with the prediction, and only when you have a model trained can you evaluate its accuracy. When moving through the workflow, it's not unusual to have to return to a previous step. For example, as you work with data it may become apparent that you are asking the question incorrectly. And regarding data, data that you find will almost never be in the format you need. And expect to spend a considerable amount of time locating and transforming the data into a structure that you can use. Also, within reason, more data is usually better. Remember the mathematical equations needed to model your data may be complex with strange quirks. The more corner cases you can cover with your data, the better the model will be trained and the more accurate your results will be. Finally, try not to push a bad solution. It's easy to fall into the trap of thinking with just a few more tweaks, the stars will align and your model will start performing correctly. If you do find yourself in this situation, it's better to take a step back and ask yourself, do I have the right data, do I need to pre-process more data, do I need to do more pre-processing, or do we even have enough information to continue?
See you in my next Blog Which will be about Azure and Machine Learning.......:-)