What if you don't know the ground truth?

What if you don't know the ground truth?

John, a class 12 student, has studied well, appeared for the revision tests several times, and improved his marks in internal exams. As a result, he is 100% confident that he can score 95% or above in the final board exam. This confidence is based on the assumption that all questions will come from the syllabus he has prepared.

He appeared for the exam, and the board announced they would publish the results after six months. Most of the students awaited results to start to work based on the results.

However, one of the fortune 500 companies hired John based on his internal marks even though the board results are still awaited. So what is the assumption the organisation makes here??

The assumption is that his score will be more or less the same as the internal exams as long as the syllabus is the same. So before the ground truth is available, i.e. before the board exam results are published, the organisation hires him based on how many questions came from the Syllabus John has been preparing. As all questions came from the syllabus, John is expected to get 95% marks in the board exam based on his internal test results.

Why are we discussing this here? Because this is what we do in machine learning as well.?

The ground truth is not often immediately known to assess the model deployed in production. For example, the fraud detection model classifies the customers as fraud or not. But, whether they are actual frauds or not is not known until the Bank's compliance team investigates these customers. This process usually takes six months to one year. It means that you can assess the model performance only after six months. But can we wait until then? What if the model doesn't perform well in production? If the answer is Yes, you would have potentially taken an incorrect decision in those six months. Isn't it a disaster? So what to do? The answer is to check whether or not the data in production has a similar pattern as that of the training dataset. It is like checking whether the board exam question paper contains questions outside the syllabus.

If you find the data is different in production, then the data is drifted. Depending upon how big is the drift, you decide whether to redo the feature engineering and retrain or completely rebuild the model from scratch. Btw, why do we have data drift??

There are potentially three high-level causes for the data drift.

1. Sampling bias: If your training dataset is not the complete refection of the population, then you will find the data drift and a potential drop in performance very soon in production.

2. Non-stationary environment: If you are predicting sales, you should be mindful that seasonal fluctuation can impact sales in the future. Suppose your training data was collected in the year's first half, and the model is deployed to predict the sales in the entire year. In that case, it will encounter a significant data drift in the second half due to major festivals and changes in the purchasing pattern among the customers.

3. Change in the market conditions: This is the most crucial reason for the data drift. Innovation, new product development, digital revolution, Lifestyle changes, and Government policies impact how we do our business, learn, purchase, bank, travel etc. Therefore, the data is bound to change over a while.

Whatever the reason, high data drift means a strong likelihood that the model's performance is compromised. But, how often do we retrain or rebuild? It depends on many things but let me tell you what factors you should consider.

If the data drifts faster than the lag between the prediction time and the time at which the ground truth is known, it is risky to deploy the model in production.

But, if the data drifts slower than the lag, then usually once a year is a good frequency.

In some cases, the ground truth is available immediately within minutes. For instance, the recommendation engines recommend something and if the customer purchases based on the recommendation, the ground truth is generated immediately. Does that mean we will need to assess the performance every minute? Well, it is an overkill; maybe you can set the min time for retraining it once a day or once a week.?

I hope you enjoyed reading this article. Please like, share and comment.

Views are personal.

Image Credit:

Photo by Eren Li: https://www.pexels.com/photo/hispanic-girl-whispering-secret-on-ear-of-friend-7168996/

References:

https://en.wikipedia.org/wiki/Ground_truth

https://www.oreilly.com/library/view/introducing-mlops/9781492083283/

要查看或添加评论,请登录

SujithKumar Chandrasekaran的更多文章

  • GDPR in 3 mins - 1 of 7 Principles

    GDPR in 3 mins - 1 of 7 Principles

    Having gone through the scope and objective in our earlier Newsletters, let us discuss the protection and…

  • GDPR in 3 mins - Objective & Rights it protects

    GDPR in 3 mins - Objective & Rights it protects

    Understanding the legal terms is difficult for an Engineer like me. However, I attempted my level best to simplify by…

    1 条评论
  • GDPR in 3 mins - Scope & Definitions

    GDPR in 3 mins - Scope & Definitions

    The General Data Protection Regulation (GDPR) is the world's strictest data privacy and security law. This law was…

    1 条评论
  • Are you becoming a Chicken ?

    Are you becoming a Chicken ?

    I had never taken Tea or coffee until I went to the university and started to stay in the hostel. That was because my…

    3 条评论
  • Differential data privacy - an Overview

    Differential data privacy - an Overview

    Customers' data is private, and the data analyst can't use this sensitive information. But then, the Dataset is full of…

  • Differential Data privacy - demystified

    Differential Data privacy - demystified

    One of the critical challenges data practitioners face is that we expect them to provide vital information without…

    1 条评论
  • Model extraction using Active Learning

    Model extraction using Active Learning

    Most cloud service providers offer Machine Learning as a Service (MLaas). By the way, what is MLaaS? As the name…

  • Data Free Model Extraction Attack

    Data Free Model Extraction Attack

    Before we start discussing the data-free model extraction attack, let us understand how the Model extraction typically…

  • I know what you did last summer

    I know what you did last summer

    You had a common business problem across the industry. So you, as a CDO, secured funding from the Business to develop a…

  • Adversarial attacks on "Explanation models"

    Adversarial attacks on "Explanation models"

    Before we start our discussion on attacks, let us understand the explanation model, why we need it in the first place…

社区洞察

其他会员也浏览了