Mr. Wolf p-hacked and fooled the team and management | Learn about AB-Testing and p-value

Mr. Wolf p-hacked and fooled the team and management | Learn about AB-Testing and p-value

p-hacking

After reading the HBO intern's case who triggered a test email to a lot of users. It reminded me of an intern who worked with me sometime back. Let's call him Mr. Wolf Gupta (identity hidden).

What wrong did Mr.Wolf do?

Mr. Wolf p-hacked an experiment. Fooled me, the team, and the management. ??

What was Mr. Wolf working on?

Mr. Wolf was working on a recommendation engine data science model and to prove it was an improvement, a-b tested with an existing running model version with equal user splits.

After few (n=30days) days of running the experiment, Mr. Wolf saw 12% CTR improvement (on aggregate numbers) in his new model in comparison to the old version.

Mr. Wolf Rejoiced and Celebrated!!! ?? ??

Perform Hypothesis Testing: Two-Way T-Testing

I asked Mr. Wolf to prove it is not a random occurrence of an event, please perform a two-way t-test on per day CTR distributions of both the model versions.

He found the p-value>0.05 i.e. failed to reject the null hypothesis, which implies both models are the same.

Mr. Wolf got shocked ?? seeing the results, kept this secret ??? ??.

Mr. Wolf being naive and still believing his model is better by just observing the CTR improvement. He thought there is some issue with the hypothesis testing technique.

hypothesis testing
Prayed to God ?? and performed hypothesis testing multiple times on the same distributions. p-value kept on changing but still >0.05; He kept on praying and then...

Finally, he found the p-value<0.05 ?? which means the Null hypothesis is rejected implies the models are different.

He rejoiced and celebrated took screenshots of the test results and shared with the team that his new model is the best. ????

significance rejoice
Everyone shouted Significant Woohoo!! Celebrated the 12% CTR win on the experiment ?? ??

WAIT but Mr. Wolf p-hacked ?? the experiment ?? ( which I learned from him, after a larger phase release and analyzing the metrics ).

Never be like Mr. Wolf if someone would have asked him Power value (probability of correctly rejecting the null hypothesis), he would have gotten himself in trouble. 

It is completely fine if your experiment/model failed. Failure is a stepping stone towards the best model version. Failure of an experiment also gives learning in return. ??

“Do not be embarrassed by your failures, learn from them and start again.” —Richard Branson
Shaurya Uppal

Data Scientist | MS CS, Georgia Tech | AI, Python, SQL, GenAI | Inventor of Ads Personalization RecSys Patent | Makro | InMobi (Glance) | 1mg | Fi

2 年
回复
Shaurya Uppal

Data Scientist | MS CS, Georgia Tech | AI, Python, SQL, GenAI | Inventor of Ads Personalization RecSys Patent | Makro | InMobi (Glance) | 1mg | Fi

3 年

Book defination of p-hacking: We say experiment is p-hacked when we incorrectly exploit the statistical analysis and falsely conclude that we can reject the null hypothesis.

回复

要查看或添加评论,请登录

Shaurya Uppal的更多文章

社区洞察

其他会员也浏览了