Building a Machine Learning Pipeline – Modeling
Ankush Seth
CTO @ Mi Analyst | Helping businesses accelerate growth and efficiency with Gen AI
Welcome back everyone. Let’s dive into the Modeling aspect of the machine learning workflow. For those who missed Part 1 on Exploration and Data processing, it can be found here. The main goal of this phase is to determine what kind of prediction model will best suit the data at hand and the problem statement one is trying to solve.
Determining this is not an exact science and requires some amount of experimentation. However, I’ve found that starting the journey by answering a couple of questions (mentioned below) helps one narrow down the options fairly quickly. Each of these potential models will then need to be evaluated by going through a training and validation cycle.
Here are the couple of questions I recommend answering to kick-off the decision-making process: -
- Is the output we are seeking easily achieved by models based on statistical classification or regression analysis? If so, will a simple model like Support Vector Machines suffice? Or do we need to go in the direction of neural networks like CNNs, RNNs, GANs, etc.?
- Do we want to use already tuned and trained models or do we want to build our own and train it with our data? E.g. a NLP (natural language processing) model from AWS that is pre-trained on vast sums of data will probably provide a lower error rate than one that is custom and trained locally (assuming the data available is not at the same scale) . On the other hand, if you are looking to solve for a specific problem going custom by building your own algorithm or modifying one might be the way. A happy medium might be leveraging transfer learning and building on top of that.
Once we have responses to the above questions, we can come up with a list of potential candidates we would like to train and validate. Be mindful of common gotchas like over and under fitting when performing training. Validate the model by running the trained model on the test dataset and analyzing the results to determine how performant the model is. In the event the decision is to go with a custom model, the model designing aspect might be a major pre-cursor task.
Depending upon the test results one maybe ready to move to the next stage of deployment or as it happens many times the test results are not impressive, and one has to try tuning the hyper parameters or potentially go back to the drawing board. Since this whole process is very iterative in nature I recommend taking a lean approach to assessing models. For example, establishing training / test dataset limits, creating key metrics that are able to provide a good indication of how performant the candidate is. However, it should not be at the expense of thinking long term. The opportunity cost of switching from one model to another might be especially high if the training dataset is particularly large.
That’s it for now. Happy modeling. Next week we will go over the deployment phase.