Data Modeling for AI and ML Infrastructure setup
Harikrishnan Sugumaran
Enterprise Architect | Solution Architect | Digital Transformation Architect | TOGAF | COBIT | Master of Strategy Leadership Certified | SAP LeanIX Certified | Cloud, Big Data, AI, IoT, RPA, Block Chain
Artificial Intelligence (AI) and Machine Learning (ML) are being adopted by businesses in almost every industry. Many businesses are looking towards ML Infrastructure platforms to propel their movement of leveraging AI in their business. Understanding the various platforms and offerings can be a challenge. The ML Infrastructure space is crowded, confusing, and complex. There are several platforms and tools spanning a variety of functions across the model building workflow.
To understand the ecosystem, fundamentally break up the machine learning workflow into three stages — data preparation, model building, and production. Understanding what the goals and challenges of each stage of the workflow can help make an informed decision on what ML Infrastructure platforms out there are best suited for your business’s needs.
Each of these broad stages of the Machine Learning workflow (Data Preparation, Model Building and Production) have several vertical functions. Some of these functions are part of a larger end-to-end platform, while some functions are the focus of some platforms.
Since models are built and learned from data, the first step of building a model is data preparation — the process of extracting inputs to the model from data. There are several tools to help data scientists source data, transform data, and add labels to datasets. In this blog post, we will dive deep into understanding what are the goals of data preparation, challenges organizations face in this stage of the ML workflow, and when data scientists decide it is time to move onto the next stage of the workflow.
What is Data Preparation?
Ask any data scientist and they will tell you A LOT of their time is spent in data preparation. The data preparation phase of the pipeline is used to turn raw data into model input features used to train the model. Features are transformations on the cleaned data that provide the actual model inputs.
In the early stages of the pipeline, raw data is sourced across different data stores and lakes in an organization. The next stage involves data processing to clean, transform and extract features to generate consistent inputs in the feature selection stage. Large tech companies at the forefront of using ML Infrastructure (Google, Facebook, Uber, etc) will typically have central feature storage, so many teams can extract value without duplicate work.
The data preparation stage involves several steps: sourcing data, ensuring completeness, adding labels, and data transformations to generate features.
Sourcing Data
Sourcing data is the first step and often the first challenge. Data can live in various data stores, with different access permissions, and can be littered with personally identifiable information (PII).
The first step in data preparation involves sourcing data from the right places and consolidating data from different data lakes within an organization. This can be difficult if the model’s inputs, predictions, and actuals are received at different time periods and stored in separate data stores. Setting a common prediction or transaction ID can help tie predictions with their actuals.
This stage can often involve data management, data governance and legal to determine what data sources are available to use. The roles working in this stage usually involve the data engineer, data scientist, legal, and IT.
Example ML Infrastructure Companies in Data Storage: Elastic Search, Hive, Qubole
Completeness
Once the data is sourced, there are a series of checks on completeness needed to determine if the data collected can be turned into meaningful features. First, it is important to understand the length of historical data available to be used. This helps understand if the model builder has enough data for training purposes (a year’s worth of data, etc). Having data that has seasonal cycles and identified anomalies can help the model build resiliency.
Data completeness can also include checking if the data has proper labels. Many companies have problems with the raw data in terms of cleanliness. There can be multiple labels that mean the same thing. There will be some data that is unlabeled or mislabeled. Several vendors offer Data Labeling services that employ a mix of technology and people to add labels to data and clean up issues.
Example ML Infrastructure Companies in Data Labeling: Scale AI, Figure Eight, LabelBox, Amazon Sagemaker
It is also important to have some check on whether the data seen is a representative distribution. Was the data collected over an unusual period? This is a tougher question because it is specific to the business and data will continue to change over time.
Data Processing
Once the data is collected and there is enough data across time with the proper labels, there can be a series of data transforms to go from raw data to features the model can understand. This stage is specific to the types of data that the business is using. For categorical values, it is common practice to use one-hot encoding. For numeric values, there can be some form for normalization based on the distribution of the data. A key part of this process is to understand your data, including data distributions.
Data processing can also involve additional data cleaning and adding data quality checks. Since models depend on the data they are training on, it is important to ensure clean data through removing duplicated events, indexing issues, and other data quality issues.
A set of data wrangling companies allow data scientists, business analysts, and data engineers to define rules for transformations to clean and prepare the data. These companies can range from no code, low code, to developer focused platforms.
Lastly, there are ongoing data quality checks that are done on training data to make sure what is clean today will be clean tomorrow.
Data preparation is integral to the model’s performance. There are a lot of challenges to getting complete and clean data. With all of the work that goes into building a training dataset from data sourcing to all of the data transformations, it can be difficult to track all of the versioned data transformations that can impact model performance. As an organization grows, a feature store with common data transformations can reduce duplicative work and compute costs.
ML Infrastructure Companies in Data Wrangling: Trifacta, Pixata, Alteryx
ML Infrastructure Companies in Data Processing: Spark, DataBricks, Qubole, Hive
ML Infrastructure Companies in Data Versioning, Feature Storage & Feature Extraction: Stealth Startups, Pachyderm, Alteryx
What Happens After Data Preparation
Once data scientists have the data ready, In some cases, the handoff between data preparation and model building is structured with a data file or feature store with processed data. In other cases, the handoff is fluid. In larger organizations the Data Engineering team is responsible for getting the data into a format that the data scientists can use for model building.
In many managed notebooks such as Databricks Managed Notebooks, Cloudera Data Science Workbench, Domino Data Labs Notebooks, the data preparation workflow is not separate from the model building. Feature selection is dependent on data so that function begins to blur the line between data preparation and model building.
ML Infrastructure Companies in Notebook Management: Databricks, Cloudera Workbench, Domino, Stealth Startups
What is Model Building?
The first step of model building begins with understanding the business needs. What business needs is the model addressing? This step begins much further at the planning and ideation phase of the ML workflow. During this phase, like the software development lifecycle, data scientists gather requirements, consider feasibility, and create a plan for data preparation, model building, and production. In this stage, they use the data to explore various model building experiments they had considered during their planning phase.
Feature Exploration and Selection
As part of this experimental process, data scientists explore various data input options to select features. Feature selection is the process of finding the feature inputs for machine learning models. For a new model, this can be a lengthy process of understanding the data inputs available, the importance of the input, and the relationships between different feature candidates. There are a number of decisions that can be made here for more interpretable models, shorter training times, cost of acquiring features, and reducing overfitting. Figuring out the right features is a constant iterative process.
ML Infrastructure companies in Feature Extraction: Alteryx/Feature Labs, Paxata(DataRobot)
Model Management
There are several modelling approaches that a data scientist can try. Some types of models are better for certain tasks than others (ex — tree-based models are more interpretable). As part of the ideation phase, it will be evident if the model is supervised, unsupervised, classification, regression, etc. However, deciding what type of modelling approaches, what hyperparameters, and what features is dependent on experimentation. Some AutoML platforms will try several different models with various parameters and this can be helpful to establish a baseline approach. Even done manually, exploring various options can provide the model builder with insights on model interpretability.
Experiment Tracking
While there are several advantages and tradeoffs amongst the various types of models, in general, this phase involves several experiments. There are several platforms to track these experiments, modelling dependencies, and model storage. These functions are broadly categorized as model management. Some platforms primarily focus on experiment tracking. Other companies that have training and/or serving components have model management components for comparing the performance of various models, tracking training/test datasets, tuning and optimizing hyperparameters, storing evaluation metrics, and enabling detailed lineage and version control. Like Github for software, these model management platforms should enable version control, historical lineage, and reproducibility.
A tradeoff between these various model management platforms is the cost of integration. Some more lightweight platforms only offer experiment tracking, but can integrate easily with the current environment and be imported into data science notebooks. Others require some more heavy lifting integration and require model builders to move to their platform so there is centralized model management.
In this phase of the machine learning workflow, data scientists usually spend their time building models in notebooks, training models, storing the model weights in a model store, and then evaluating the results of the model on a validation set. There are several platforms that exist to provide compute resources for training. There are also several storage options for models depending on how teams want to store the model object.
ML Infrastructure AutoML: H20, SageMaker, DataRobot, Google Cloud ML, Microsoft ML
ML Infrastructure companies in Experiment Tracking: Weights and Biases, Comet ML, ML Flow, Domino, Tensorboard
ML Infrastructure companies in Model Management: Domino Data Labs, SageMaker
ML Infrastructure companies in HyperParameter Opt.: Sigopt, Weights and Biases, SageMaker
Model Evaluation
Once an experimental model has been trained on a training data set with the selected features, the model is evaluated on a test set. This evaluation phase involves the data scientist trying to understand the model’s performance and areas for improvement. Some more advanced ML teams will have an automated backtesting framework for them to evaluate model performance on historical data.
Each experiment tries to beat the baseline model’s performance and considers the tradeoffs in compute costs, interpretability, and ability to generalize. In some more regulated industries, this evaluation process can also encompass compliance and auditing by external reviewers to ensure the model’s reproducibility, performance, and requirements.
ML Infrastructure Model Evaluation: Fiddler AI, Tensorboard, Stealth Startups
ML Infrastructure Pre-Launch Validation: Fiddler AI, Arize AI
One Platform to Rule Them All
A few companies that center on AutoML or model building, pitch a single platform for everything. They are vying to be the single AI platform an enterprise uses across DataPrep, Model Building and Production. These companies include DataRobot, H20, SageMaker and a few others.
This set splits into a low-code versus developer centric solutions. Datarobot seems to be focused on the no-code/low code option that allows BI or Finance teams to take up DataScience projects. This is in contrast with SageMaker and H20 which seem to cater to either data scientists or developer first teams that are the more common data science organizations today. The markets in both cases are large and can co-exist but it’s worth noting that not all the ML Infrastructure companies are selling to the same people or teams.
Several the more recent entrants in the space can be thought as best of breed solutions for a specific part of the ML Infrastructure food chain. The best analog would be the software engineering space, where your software solutions GitHub, IDE, production monitoring are not all the same end-to-end system. There are reasons why they are different pieces of software; they provide very different functions with clear differentiation.
Challenges
Unlike the software development parallel, reproducibility of models is often considered a challenge. This is primarily due to lack of version control on the data that the model was trained on.
There are a number of challenges in understanding the model’s performance. How can one compare experiments and determine which version of the model is the best balance of performance and tradeoffs? One trade off might be wanting maybe slightly less performing models, but are more interpretable. Some data scientists use built-in model explainability features or explore feature importance using SHAP/LIME.
Another performance challenge is not knowing how your model performance in this experimental stage will translate to the real world. This can be best mitigated with making sure the data in the training data set is a representative distribution to data the model is likely to see in production to prevent overfitting to the training data set. This is where cross validation and backtesting frameworks are helpful.
What happens next?
For a data scientist, it’s important to determine a criteria for when the model is ready to be pushed to production. If there is a pre-existing model deployed in production, it might be when the new version’s performance is higher. Regardless, setting a criterion is important to move the experiment to a real-world environment.
Once the model has been trained, the model image/weights are stored in a model store. This is usually when the data scientist or engineer responsible for deploying the model into production can fetch the model and use for serving. In some platforms, this deployment can be even simpler, and a deployed model can be configured with a REST API that external services can call.
A sample AI / ML Infrastructure setup: