登录查看更多内容

Data Centricity in AI and Model Performance Improvement

Asaad Abdollazadeh

Senior Data Science Leader | AI/ML & MLOps Expert | Custom AI Solution Lead & Specialist | Digital Transformation Advocate | Predictive Analytics Specialist | Oil & Gas Industry Innovator | Agile Leader & Project Manager

发布日期: 2022年4月1日

In artificial intelligence (AI) and data science projects, baseline models are built using available data, domain knowledge, set of original or transformed features, and machine learning algorithms suitable for the domain, problem type, size of data and feature, as well as infra and high-performance computing (HPC) resources available. Then the performance of the baseline model is assessed and usually been attempted to be improved using the following techniques:

Further capturing and embedding domain knowledge and business process into the AI and ML algorithms can provide more perspective and understanding of the problem, variables (inputs and targets), as well as relationships between variables.
Collect and add more data, high-quality data to the training dataset, add provides more information to be learnt by the algorithms and likely to improve the performance of the model. Adding low-quality and non-conformant data is not only going to help but also worsen the model performance.
Quality-control and data wrangling, e.g. cleaning or imputation (replacing missing variables with meaningful data) is likely to improve the model performance; precautions need to be taken, as important data is removed in the data wrangling step or missing data replaced with wrong data, the performance is likely to be deteriorated.
Feature selection, transformation, and engineering are if done appropriately can boost the performance and accuracy of the models significantly. All three require domain knowledge and can help to extract more information from data.
Perform hyperparameter tuning of the algorithms, which is both of art and science; it is usually time consuming, requires experience and usually cannot be fully automated. There are many tools and packages to facilitate or automate the process, however human intelligence combined with those tools usually yields better results than fully automated hyperparameter tuning.
Optimising split and cross-validation strategy highly affect the results of hyperparameter tuning, if one is lucky, a simple train-validation-test split may result in good improvement in the accuracy metrics, however more robust and generalisable model likely to be achieved with k-fold cross-validation (for large sample size) or one-at-a-time (for small sample size) approaches.
Algorithm selection is usually done depending on the problem type, context, and size of the data (number of samples and features). Other factors important for the selection are the ability to be tuned for balancing bias and variance (correction of under- and over-fitting problem), explainability, and runtime and HPC requirements.

From the seven ways above, 1 to 3 are related to the data. Our experience in real-life AI applications has shown that data related techniques are more impactful and helpful in improving the performance of the models. This is in contrast with the toy, learning-purpose and standard AI and ML applications that can be found on the net, for which, the data science and AI enthusiast?are utilising techniques 4 to 7 to improve the performance for.

领英推荐

While AI needs clean data, clean data needs AI too!

Naveen Joshi 5 年前

How much data do I need to use AI for engineering…

Richard Ahlfeld, Ph.D. 3 年前

Designing AI Architectures: A comprehensive guide to…

Vintage Global 9 个月前

We always need to be mindful of the concept garbage in, garbage out (GIGO). To avoid garbage-in situation, errors and biases in the data need to be identified and corrected with help from subject matter experts (SMEs). If not corrected, the input data errors lead to false or biased results, regardless of algorithm, techniques, and the effort we put into model improvement.

Vasily Demyanov

Professor, GeoData Science: Spatial Stats & UQ

2 年

how about the model set up? - the way you feed the data that can vary dramatically

David Robbins

Senior Data Scientist at Marathon Petroleum Corporation

2 年

Exactly Aso, I spend all my time on the quality and transformation of the (subsurface) data, particularly if you can co-erce it into non-dimensional groups or compound features with physical relevance (e.g. material balance groups). Usually the first-pass ML model isn't much improved by changing model family or hyperparameter tuning! (assuming you made a reasonable first selection of the statistical model to be used ...), but generalisability can be really improved by adding physical constraints to the loss function.

1 次回应

查看更多评论

要查看或添加评论，请登录

Asaad Abdollazadeh的更多文章

7 Lessons on Leading Data and Analytics Teams the Agile Way

2024年11月30日

7 Lessons on Leading Data and Analytics Teams the Agile Way

Over the years, leading data and analytics teams has taught me that leadership isn’t about sticking to a rigid…

2 条评论
MLOps Process: An Overview

2022年11月20日

MLOps Process: An Overview

I shared a hybrid CRISP-DM and Agile-Scrum methodology for data science project management in this Article. In another…

9 条评论
Scrum project management in Data Science: a review and pitfalls

2022年10月17日

Scrum project management in Data Science: a review and pitfalls

In a previous Article, I shared a hybrid CRISP-DM and Agile-Scrum methodology for data science project management…

1 条评论
RACI and Agile-Scrum in Data Science projects

2021年5月26日

RACI and Agile-Scrum in Data Science projects

A common question from beginners in Agile-Scrum project team members, who are familiar with RACI model for project…

1 条评论
CRISP-DM and Agile-Scrum methodology for Data Science Project Delivery

2021年5月18日

CRISP-DM and Agile-Scrum methodology for Data Science Project Delivery

A common question by data scientists concerned about project management and delivery, is what is difference between…

8 条评论

See all articles

Data Centricity in AI and Model Performance Improvement

Asaad Abdollazadeh

Senior Data Science Leader | AI/ML & MLOps Expert | Custom AI Solution Lead & Specialist | Digital Transformation Advocate | Predictive Analytics Specialist | Oil & Gas Industry Innovator | Agile Leader & Project Manager

领英推荐

Asaad Abdollazadeh的更多文章

社区洞察

其他会员也浏览了

Feature Selection vs. Feature Extraction: Navigating Dimensionality Reduction in Machine Learning

Understanding the Key Differences between AI and Data Science

What's the next big thing in data preparation for computer vision AI?

Redefining Data Analytics with GenAI

Generalization

Optimizing latency in Generative AI applications: Navigating the Challenges of Cost, Time, and Talent

What is Synthetic Data and Why is it Gaining Popularity?

How to Leverage Computer Vision Data Labeling Through Embeddings

DEEPSEEK AI assertions - not really correct -

Maximizing Data Efficiency with AI: An Introduction

领英推荐

Asaad Abdollazadeh的更多文章

7 Lessons on Leading Data and Analytics Teams the Agile Way

MLOps Process: An Overview

Scrum project management in Data Science: a review and pitfalls

RACI and Agile-Scrum in Data Science projects

CRISP-DM and Agile-Scrum methodology for Data Science Project Delivery

社区洞察

其他会员也浏览了

Feature Selection vs. Feature Extraction: Navigating Dimensionality Reduction in Machine Learning

Understanding the Key Differences between AI and Data Science

What's the next big thing in data preparation for computer vision AI?

Redefining Data Analytics with GenAI

Generalization

Optimizing latency in Generative AI applications: Navigating the Challenges of Cost, Time, and Talent

What is Synthetic Data and Why is it Gaining Popularity?

How to Leverage Computer Vision Data Labeling Through Embeddings

DEEPSEEK AI assertions - not really correct -

Maximizing Data Efficiency with AI: An Introduction