登录查看更多内容

Data Modeling for AI and ML Infrastructure setup

Harikrishnan Sugumaran

Enterprise Architect | Solution Architect | Digital Transformation Architect | TOGAF | COBIT | Master of Strategy Leadership Certified | SAP LeanIX Certified | Cloud, Big Data, AI, IoT, RPA, Block Chain

发布日期: 2020年12月31日

Artificial Intelligence (AI) and Machine Learning (ML) are being adopted by businesses in almost every industry. Many businesses are looking towards ML Infrastructure platforms to propel their movement of leveraging AI in their business. Understanding the various platforms and offerings can be a challenge. The ML Infrastructure space is crowded, confusing, and complex. There are several platforms and tools spanning a variety of functions across the model building workflow.

To understand the ecosystem, fundamentally break up the machine learning workflow into three stages — data preparation, model building, and production. Understanding what the goals and challenges of each stage of the workflow can help make an informed decision on what ML Infrastructure platforms out there are best suited for your business’s needs.

Each of these broad stages of the Machine Learning workflow (Data Preparation, Model Building and Production) have several vertical functions. Some of these functions are part of a larger end-to-end platform, while some functions are the focus of some platforms.

Since models are built and learned from data, the first step of building a model is data preparation — the process of extracting inputs to the model from data. There are several tools to help data scientists source data, transform data, and add labels to datasets. In this blog post, we will dive deep into understanding what are the goals of data preparation, challenges organizations face in this stage of the ML workflow, and when data scientists decide it is time to move onto the next stage of the workflow.

What is Data Preparation?

Ask any data scientist and they will tell you A LOT of their time is spent in data preparation. The data preparation phase of the pipeline is used to turn raw data into model input features used to train the model. Features are transformations on the cleaned data that provide the actual model inputs.

In the early stages of the pipeline, raw data is sourced across different data stores and lakes in an organization. The next stage involves data processing to clean, transform and extract features to generate consistent inputs in the feature selection stage. Large tech companies at the forefront of using ML Infrastructure (Google, Facebook, Uber, etc) will typically have central feature storage, so many teams can extract value without duplicate work.

The data preparation stage involves several steps: sourcing data, ensuring completeness, adding labels, and data transformations to generate features.

Sourcing Data

Sourcing data is the first step and often the first challenge. Data can live in various data stores, with different access permissions, and can be littered with personally identifiable information (PII).

The first step in data preparation involves sourcing data from the right places and consolidating data from different data lakes within an organization. This can be difficult if the model’s inputs, predictions, and actuals are received at different time periods and stored in separate data stores. Setting a common prediction or transaction ID can help tie predictions with their actuals.

This stage can often involve data management, data governance and legal to determine what data sources are available to use. The roles working in this stage usually involve the data engineer, data scientist, legal, and IT.

Example ML Infrastructure Companies in Data Storage: Elastic Search, Hive, Qubole

Completeness

Once the data is sourced, there are a series of checks on completeness needed to determine if the data collected can be turned into meaningful features. First, it is important to understand the length of historical data available to be used. This helps understand if the model builder has enough data for training purposes (a year’s worth of data, etc). Having data that has seasonal cycles and identified anomalies can help the model build resiliency.

Data completeness can also include checking if the data has proper labels. Many companies have problems with the raw data in terms of cleanliness. There can be multiple labels that mean the same thing. There will be some data that is unlabeled or mislabeled. Several vendors offer Data Labeling services that employ a mix of technology and people to add labels to data and clean up issues.

Example ML Infrastructure Companies in Data Labeling: Scale AI, Figure Eight, LabelBox, Amazon Sagemaker

It is also important to have some check on whether the data seen is a representative distribution. Was the data collected over an unusual period? This is a tougher question because it is specific to the business and data will continue to change over time.

Data Processing

Once the data is collected and there is enough data across time with the proper labels, there can be a series of data transforms to go from raw data to features the model can understand. This stage is specific to the types of data that the business is using. For categorical values, it is common practice to use one-hot encoding. For numeric values, there can be some form for normalization based on the distribution of the data. A key part of this process is to understand your data, including data distributions.

Data processing can also involve additional data cleaning and adding data quality checks. Since models depend on the data they are training on, it is important to ensure clean data through removing duplicated events, indexing issues, and other data quality issues.

A set of data wrangling companies allow data scientists, business analysts, and data engineers to define rules for transformations to clean and prepare the data. These companies can range from no code, low code, to developer focused platforms.

Lastly, there are ongoing data quality checks that are done on training data to make sure what is clean today will be clean tomorrow.

Data preparation is integral to the model’s performance. There are a lot of challenges to getting complete and clean data. With all of the work that goes into building a training dataset from data sourcing to all of the data transformations, it can be difficult to track all of the versioned data transformations that can impact model performance. As an organization grows, a feature store with common data transformations can reduce duplicative work and compute costs.

ML Infrastructure Companies in Data Wrangling: Trifacta, Pixata, Alteryx

ML Infrastructure Companies in Data Processing: Spark, DataBricks, Qubole, Hive

ML Infrastructure Companies in Data Versioning, Feature Storage & Feature Extraction: Stealth Startups, Pachyderm, Alteryx

What Happens After Data Preparation

Once data scientists have the data ready, In some cases, the handoff between data preparation and model building is structured with a data file or feature store with processed data. In other cases, the handoff is fluid. In larger organizations the Data Engineering team is responsible for getting the data into a format that the data scientists can use for model building.

In many managed notebooks such as Databricks Managed Notebooks, Cloudera Data Science Workbench, Domino Data Labs Notebooks, the data preparation workflow is not separate from the model building. Feature selection is dependent on data so that function begins to blur the line between data preparation and model building.

ML Infrastructure Companies in Notebook Management: Databricks, Cloudera Workbench, Domino, Stealth Startups

What is Model Building?

The first step of model building begins with understanding the business needs. What business needs is the model addressing? This step begins much further at the planning and ideation phase of the ML workflow. During this phase, like the software development lifecycle, data scientists gather requirements, consider feasibility, and create a plan for data preparation, model building, and production. In this stage, they use the data to explore various model building experiments they had considered during their planning phase.

Feature Exploration and Selection

As part of this experimental process, data scientists explore various data input options to select features. Feature selection is the process of finding the feature inputs for machine learning models. For a new model, this can be a lengthy process of understanding the data inputs available, the importance of the input, and the relationships between different feature candidates. There are a number of decisions that can be made here for more interpretable models, shorter training times, cost of acquiring features, and reducing overfitting. Figuring out the right features is a constant iterative process.

ML Infrastructure companies in Feature Extraction: Alteryx/Feature Labs, Paxata(DataRobot)

Model Management

There are several modelling approaches that a data scientist can try. Some types of models are better for certain tasks than others (ex — tree-based models are more interpretable). As part of the ideation phase, it will be evident if the model is supervised, unsupervised, classification, regression, etc. However, deciding what type of modelling approaches, what hyperparameters, and what features is dependent on experimentation. Some AutoML platforms will try several different models with various parameters and this can be helpful to establish a baseline approach. Even done manually, exploring various options can provide the model builder with insights on model interpretability.

Experiment Tracking

While there are several advantages and tradeoffs amongst the various types of models, in general, this phase involves several experiments. There are several platforms to track these experiments, modelling dependencies, and model storage. These functions are broadly categorized as model management. Some platforms primarily focus on experiment tracking. Other companies that have training and/or serving components have model management components for comparing the performance of various models, tracking training/test datasets, tuning and optimizing hyperparameters, storing evaluation metrics, and enabling detailed lineage and version control. Like Github for software, these model management platforms should enable version control, historical lineage, and reproducibility.

A tradeoff between these various model management platforms is the cost of integration. Some more lightweight platforms only offer experiment tracking, but can integrate easily with the current environment and be imported into data science notebooks. Others require some more heavy lifting integration and require model builders to move to their platform so there is centralized model management.

In this phase of the machine learning workflow, data scientists usually spend their time building models in notebooks, training models, storing the model weights in a model store, and then evaluating the results of the model on a validation set. There are several platforms that exist to provide compute resources for training. There are also several storage options for models depending on how teams want to store the model object.

ML Infrastructure AutoML: H20, SageMaker, DataRobot, Google Cloud ML, Microsoft ML

ML Infrastructure companies in Experiment Tracking: Weights and Biases, Comet ML, ML Flow, Domino, Tensorboard

ML Infrastructure companies in Model Management: Domino Data Labs, SageMaker

ML Infrastructure companies in HyperParameter Opt.: Sigopt, Weights and Biases, SageMaker

Model Evaluation

Once an experimental model has been trained on a training data set with the selected features, the model is evaluated on a test set. This evaluation phase involves the data scientist trying to understand the model’s performance and areas for improvement. Some more advanced ML teams will have an automated backtesting framework for them to evaluate model performance on historical data.

Each experiment tries to beat the baseline model’s performance and considers the tradeoffs in compute costs, interpretability, and ability to generalize. In some more regulated industries, this evaluation process can also encompass compliance and auditing by external reviewers to ensure the model’s reproducibility, performance, and requirements.

ML Infrastructure Model Evaluation: Fiddler AI, Tensorboard, Stealth Startups

ML Infrastructure Pre-Launch Validation: Fiddler AI, Arize AI

One Platform to Rule Them All

A few companies that center on AutoML or model building, pitch a single platform for everything. They are vying to be the single AI platform an enterprise uses across DataPrep, Model Building and Production. These companies include DataRobot, H20, SageMaker and a few others.

This set splits into a low-code versus developer centric solutions. Datarobot seems to be focused on the no-code/low code option that allows BI or Finance teams to take up DataScience projects. This is in contrast with SageMaker and H20 which seem to cater to either data scientists or developer first teams that are the more common data science organizations today. The markets in both cases are large and can co-exist but it’s worth noting that not all the ML Infrastructure companies are selling to the same people or teams.

Several the more recent entrants in the space can be thought as best of breed solutions for a specific part of the ML Infrastructure food chain. The best analog would be the software engineering space, where your software solutions GitHub, IDE, production monitoring are not all the same end-to-end system. There are reasons why they are different pieces of software; they provide very different functions with clear differentiation.

Challenges

Unlike the software development parallel, reproducibility of models is often considered a challenge. This is primarily due to lack of version control on the data that the model was trained on.

There are a number of challenges in understanding the model’s performance. How can one compare experiments and determine which version of the model is the best balance of performance and tradeoffs? One trade off might be wanting maybe slightly less performing models, but are more interpretable. Some data scientists use built-in model explainability features or explore feature importance using SHAP/LIME.

Another performance challenge is not knowing how your model performance in this experimental stage will translate to the real world. This can be best mitigated with making sure the data in the training data set is a representative distribution to data the model is likely to see in production to prevent overfitting to the training data set. This is where cross validation and backtesting frameworks are helpful.

What happens next?

For a data scientist, it’s important to determine a criteria for when the model is ready to be pushed to production. If there is a pre-existing model deployed in production, it might be when the new version’s performance is higher. Regardless, setting a criterion is important to move the experiment to a real-world environment.

Once the model has been trained, the model image/weights are stored in a model store. This is usually when the data scientist or engineer responsible for deploying the model into production can fetch the model and use for serving. In some platforms, this deployment can be even simpler, and a deployed model can be configured with a REST API that external services can call.

A sample AI / ML Infrastructure setup:

要查看或添加评论，请登录

Harikrishnan Sugumaran的更多文章

Data Lake House vs. Data Warehouse: Tips to pick the Right Solution for Your Stack

2021年1月9日

Data Lake House vs. Data Warehouse: Tips to pick the Right Solution for Your Stack

Twenty years ago, your data warehouse probably wouldn’t have been voted hottest technology on the block. These bastions…
COVID-19 Pandemic is Fast Tracking Digital Transformation In Organization

2020年3月26日

COVID-19 Pandemic is Fast Tracking Digital Transformation In Organization

COVID-19 has seriously disrupted the industry. If "necessity is the mother of invention," coronavirus (COVID-19) forced…
Artificial Intelligence Challenges in Government

2020年2月9日

Artificial Intelligence Challenges in Government

The effective implementation of Artificial Intelligence (AI) technologies in public services entails the public sector…

2 条评论
Artificial Intelligence (AI) and Enterprise Architecture (EA)

2019年12月2日

Artificial Intelligence (AI) and Enterprise Architecture (EA)

As more and more artificial intelligence is entering into the world, more and more emotional intelligence must enter…

1 条评论
Artificial Intelligence (AI): Current Regulatory Standards Landscape

2019年11月20日

Artificial Intelligence (AI): Current Regulatory Standards Landscape

Artificial Intelligence (AI) has already become a significant technology across all sectors, empowering new…

1 条评论
Artificial Intelligence : Middle East and South Asia and the Pacific

2019年11月14日

Artificial Intelligence : Middle East and South Asia and the Pacific

Applications of artificial intelligence to the public sector are broad and growing with early experiments taking place…

1 条评论
Artificial Intelligence in Government

2019年8月5日

Artificial Intelligence in Government

Artificial Intelligence (AI) has the potential to overcome the biggest challenges governments face and dramatically…

1 条评论
Digital Transformation Journey in Government

2019年4月10日

Digital Transformation Journey in Government

The Role of Culture in Digital Transformation Culture leads the adoption of technology. The ability to innovate depends…
Blockchain in Urban Planning (Smarter City)

2018年8月29日

Blockchain in Urban Planning (Smarter City)

Blockchain is the technology of the moment. It is a system based on chains of data blocks that, once published, cannot…

7 条评论

See all articles

Data Modeling for AI and ML Infrastructure setup

Harikrishnan Sugumaran

Enterprise Architect | Solution Architect | Digital Transformation Architect | TOGAF | COBIT | Master of Strategy Leadership Certified | SAP LeanIX Certified | Cloud, Big Data, AI, IoT, RPA, Block Chain

What is Data Preparation?

Sourcing Data

Completeness

Data Processing

What Happens After Data Preparation

What is Model Building?

Feature Exploration and Selection

Model Management

Experiment Tracking

Model Evaluation

One Platform to Rule Them All

Challenges

What happens next?

Harikrishnan Sugumaran的更多文章

社区洞察

其他会员也浏览了

A strategic approach to generative AI

How Can Data Quality be Increased for ML Models?

The Foundational Importance of Data Strategy for AI and Machine Learning Applications

Unifying Enterprise Data for Generative AI

AutoML Revolution: Future of Automated Machine Learning in Transforming Data Science, Industry Applications, and Ethical Considerations

Maximize Your Data Lake: Build Scalable AI Data Chatbots

Galileo adds computer vision and image recognition

Building Feedback Loops for Continuous Model Improvement

Mastering Feature Transformation in Data Science: Key Techniques and Application

5 Best Machine Learning APIs for Data Science

What is Data Preparation?

Sourcing Data

Completeness

Data Processing

What Happens After Data Preparation

What is Model Building?

Feature Exploration and Selection

Model Management

Experiment Tracking

Model Evaluation

One Platform to Rule Them All

Challenges

What happens next?

Harikrishnan Sugumaran的更多文章

Data Lake House vs. Data Warehouse: Tips to pick the Right Solution for Your Stack

COVID-19 Pandemic is Fast Tracking Digital Transformation In Organization

Artificial Intelligence Challenges in Government

Artificial Intelligence (AI) and Enterprise Architecture (EA)

Artificial Intelligence (AI): Current Regulatory Standards Landscape

Artificial Intelligence : Middle East and South Asia and the Pacific

Artificial Intelligence in Government

Digital Transformation Journey in Government

Blockchain in Urban Planning (Smarter City)

社区洞察

其他会员也浏览了

A strategic approach to generative AI

How Can Data Quality be Increased for ML Models?

The Foundational Importance of Data Strategy for AI and Machine Learning Applications

Unifying Enterprise Data for Generative AI

AutoML Revolution: Future of Automated Machine Learning in Transforming Data Science, Industry Applications, and Ethical Considerations

Maximize Your Data Lake: Build Scalable AI Data Chatbots

Galileo adds computer vision and image recognition

Building Feedback Loops for Continuous Model Improvement

Mastering Feature Transformation in Data Science: Key Techniques and Application

5 Best Machine Learning APIs for Data Science