Building a Machine Learning Pipeline – Exploration and Data Processing
Ankush Seth
CTO @ Mi Analyst | Helping businesses accelerate growth and efficiency with Gen AI
In this three-part blog series, we are going to explore how to build a machine learning pipeline (defined below). Each part will cover one key area of the pipeline building process. I am a big believer in hands-on learning, and therefore as part of this journey the entire end-to-end process will be exemplified by building a functional machine learning pipeline. Some experience using machine learning models or frameworks like PyTorch, as well as experience in building web services will come in handy during the practical portion of this series.
So lets get started…
What is a machine learning pipeline ?
Organizations trying to leverage AI or machine learning capabilities generally find themselves in a conundrum of where to begin. They have all this structured and unstructured data and are not sure how to connect it and utilize it. Sometimes this data is available in a data lake and other times spread over multiple data sources. In addition, knowing how different elements can be stitched together to reach the end goal of an intelligent insight can look like a daunting task. A machine learning pipeline is a step-wise approach to making sure one is able to derive value from the already available data. Afterall just having data and/or the machine learning model are no-good if there is no efficient way of connecting all the elements, and utilizing the magic that AI promises to deliver. In general, a pipeline provides a way to derive insights either in a batch or real-time manner from the available data.
The entire machine learning workflow/pipeline building process can be broken down into three major phases -
1) Exploration and Data processing
2) Modeling (Will be covered in Part 2)
3) Deployment (Will be covered in Part 3)
Exploration and Data processing
This is the initial preparation phase where we get the data ready for use. This phase can further be broken down into –
1) Gathering the Data – Organizing the data as files on the file system or database of some sort.
2) Exploration and Sanitization – This involves exploring and visualizing the data to map out the most interesting features within the data set. As part of sanitization we want to remove any outliers or errors that may potentially skew/create bias within the model.
3) Transformation – This step involves transforming the data (normalization, translation, encoding, etc.) so that it can be used to train the model.
In summary a successful machine learning journey starts off with the right data and making sure it is ready for consumption.
In the next article we will cover the modeling phase of machine learning workflow. After we cover all the phases we will go over a practical example of building the entire workflow using industry standard tools like Jupyter Notebooks, PyTorch, Anaconda, AWS API Gateway, etc. I will publish another article on how to get your machine learning environment setup on your laptop. Happy learning!