登录查看更多内容

What if you don't know the ground truth?

SujithKumar Chandrasekaran

Engagement Manager @ HSBC | Driving Data Ingestion Growth

发布日期: 2022年7月1日

John, a class 12 student, has studied well, appeared for the revision tests several times, and improved his marks in internal exams. As a result, he is 100% confident that he can score 95% or above in the final board exam. This confidence is based on the assumption that all questions will come from the syllabus he has prepared.

He appeared for the exam, and the board announced they would publish the results after six months. Most of the students awaited results to start to work based on the results.

However, one of the fortune 500 companies hired John based on his internal marks even though the board results are still awaited. So what is the assumption the organisation makes here??

The assumption is that his score will be more or less the same as the internal exams as long as the syllabus is the same. So before the ground truth is available, i.e. before the board exam results are published, the organisation hires him based on how many questions came from the Syllabus John has been preparing. As all questions came from the syllabus, John is expected to get 95% marks in the board exam based on his internal test results.

Why are we discussing this here? Because this is what we do in machine learning as well.?

The ground truth is not often immediately known to assess the model deployed in production. For example, the fraud detection model classifies the customers as fraud or not. But, whether they are actual frauds or not is not known until the Bank's compliance team investigates these customers. This process usually takes six months to one year. It means that you can assess the model performance only after six months. But can we wait until then? What if the model doesn't perform well in production? If the answer is Yes, you would have potentially taken an incorrect decision in those six months. Isn't it a disaster? So what to do? The answer is to check whether or not the data in production has a similar pattern as that of the training dataset. It is like checking whether the board exam question paper contains questions outside the syllabus.

If you find the data is different in production, then the data is drifted. Depending upon how big is the drift, you decide whether to redo the feature engineering and retrain or completely rebuild the model from scratch. Btw, why do we have data drift??

There are potentially three high-level causes for the data drift.

1. Sampling bias: If your training dataset is not the complete refection of the population, then you will find the data drift and a potential drop in performance very soon in production.

2. Non-stationary environment: If you are predicting sales, you should be mindful that seasonal fluctuation can impact sales in the future. Suppose your training data was collected in the year's first half, and the model is deployed to predict the sales in the entire year. In that case, it will encounter a significant data drift in the second half due to major festivals and changes in the purchasing pattern among the customers.

3. Change in the market conditions: This is the most crucial reason for the data drift. Innovation, new product development, digital revolution, Lifestyle changes, and Government policies impact how we do our business, learn, purchase, bank, travel etc. Therefore, the data is bound to change over a while.

领英推荐

5 must reads for the weekend

安永 1 个月前

The Role of Artificial Intelligence in GRC…

Aristiun 1 个月前

Unsupervised Learning in Finance: The Silent…

Amplework Software Pvt. Ltd. 1 个月前

Whatever the reason, high data drift means a strong likelihood that the model's performance is compromised. But, how often do we retrain or rebuild? It depends on many things but let me tell you what factors you should consider.

If the data drifts faster than the lag between the prediction time and the time at which the ground truth is known, it is risky to deploy the model in production.

But, if the data drifts slower than the lag, then usually once a year is a good frequency.

In some cases, the ground truth is available immediately within minutes. For instance, the recommendation engines recommend something and if the customer purchases based on the recommendation, the ground truth is generated immediately. Does that mean we will need to assess the performance every minute? Well, it is an overkill; maybe you can set the min time for retraining it once a day or once a week.?

I hope you enjoyed reading this article. Please like, share and comment.

Views are personal.

Image Credit:

Photo by Eren Li: https://www.pexels.com/photo/hispanic-girl-whispering-secret-on-ear-of-friend-7168996/

References:

https://en.wikipedia.org/wiki/Ground_truth

https://www.oreilly.com/library/view/introducing-mlops/9781492083283/

all about data

708 位关注者

要查看或添加评论，请登录

SujithKumar Chandrasekaran的更多文章

GDPR in 3 mins - 1 of 7 Principles

2022年11月12日

GDPR in 3 mins - 1 of 7 Principles

Having gone through the scope and objective in our earlier Newsletters, let us discuss the protection and…
GDPR in 3 mins - Objective & Rights it protects

2022年10月16日

GDPR in 3 mins - Objective & Rights it protects

Understanding the legal terms is difficult for an Engineer like me. However, I attempted my level best to simplify by…

1 条评论
GDPR in 3 mins - Scope & Definitions

2022年10月9日

GDPR in 3 mins - Scope & Definitions

The General Data Protection Regulation (GDPR) is the world's strictest data privacy and security law. This law was…

1 条评论
Are you becoming a Chicken ?

2022年10月3日

Are you becoming a Chicken ?

I had never taken Tea or coffee until I went to the university and started to stay in the hostel. That was because my…

3 条评论
Differential data privacy - an Overview

2022年9月25日

Differential data privacy - an Overview

Customers' data is private, and the data analyst can't use this sensitive information. But then, the Dataset is full of…
Differential Data privacy - demystified

2022年9月17日

Differential Data privacy - demystified

One of the critical challenges data practitioners face is that we expect them to provide vital information without…

1 条评论
Model extraction using Active Learning

2022年9月10日

Model extraction using Active Learning

Most cloud service providers offer Machine Learning as a Service (MLaas). By the way, what is MLaaS? As the name…
Data Free Model Extraction Attack

2022年9月3日

Data Free Model Extraction Attack

Before we start discussing the data-free model extraction attack, let us understand how the Model extraction typically…
I know what you did last summer

2022年8月26日

I know what you did last summer

You had a common business problem across the industry. So you, as a CDO, secured funding from the Business to develop a…
Adversarial attacks on "Explanation models"

2022年8月21日

Adversarial attacks on "Explanation models"

Before we start our discussion on attacks, let us understand the explanation model, why we need it in the first place…

See all articles

What if you don't know the ground truth?

SujithKumar Chandrasekaran

Engagement Manager @ HSBC | Driving Data Ingestion Growth

领英推荐

all about data

708 位关注者

SujithKumar Chandrasekaran的更多文章

社区洞察

其他会员也浏览了

Risk, Security, Safety and Resilience: 1-Year Review (Articles & Resources)

Harnessing the Power of AI-Powered Risk Scores: Revolutionizing Background Checks for Modern Businesses

New training available now! Plus, check out NW3C certifications.

Breaking the Audit Mold: Fresh Approaches to Internal Controls part one

The EU’s AI act: Implications for compliance in financial institutions

Compliance on a budget

Are AI Companies Going The Way of Enron?

PerilScope Risk Review: Jan 25–31, 2025

Audit Trends

AI's Role and Risks in Finance with RIT Company

领英推荐

all about data

708 位关注者

SujithKumar Chandrasekaran的更多文章

GDPR in 3 mins - 1 of 7 Principles

GDPR in 3 mins - Objective & Rights it protects

GDPR in 3 mins - Scope & Definitions

Are you becoming a Chicken ?

Differential data privacy - an Overview

Differential Data privacy - demystified

Model extraction using Active Learning

Data Free Model Extraction Attack

I know what you did last summer

Adversarial attacks on "Explanation models"

社区洞察

其他会员也浏览了

Risk, Security, Safety and Resilience: 1-Year Review (Articles & Resources)

Harnessing the Power of AI-Powered Risk Scores: Revolutionizing Background Checks for Modern Businesses

New training available now! Plus, check out NW3C certifications.

Breaking the Audit Mold: Fresh Approaches to Internal Controls part one

The EU’s AI act: Implications for compliance in financial institutions

Compliance on a budget

Are AI Companies Going The Way of Enron?

PerilScope Risk Review: Jan 25–31, 2025

Audit Trends

AI's Role and Risks in Finance with RIT Company