Mr. Wolf p-hacked and fooled the team and management | Learn about AB-Testing and p-value
Shaurya Uppal
Data Scientist | MS CS, Georgia Tech | AI, Python, SQL, GenAI | Inventor of Ads Personalization RecSys Patent | Makro | InMobi (Glance) | 1mg | Fi
After reading the HBO intern's case who triggered a test email to a lot of users. It reminded me of an intern who worked with me sometime back. Let's call him Mr. Wolf Gupta (identity hidden).
What wrong did Mr.Wolf do?
Mr. Wolf p-hacked an experiment. Fooled me, the team, and the management. ??
What was Mr. Wolf working on?
Mr. Wolf was working on a recommendation engine data science model and to prove it was an improvement, a-b tested with an existing running model version with equal user splits.
After few (n=30days) days of running the experiment, Mr. Wolf saw 12% CTR improvement (on aggregate numbers) in his new model in comparison to the old version.
Mr. Wolf Rejoiced and Celebrated!!! ?? ??
Perform Hypothesis Testing: Two-Way T-Testing
I asked Mr. Wolf to prove it is not a random occurrence of an event, please perform a two-way t-test on per day CTR distributions of both the model versions.
He found the p-value>0.05 i.e. failed to reject the null hypothesis, which implies both models are the same.
Mr. Wolf got shocked ?? seeing the results, kept this secret ??? ??.
Mr. Wolf being naive and still believing his model is better by just observing the CTR improvement. He thought there is some issue with the hypothesis testing technique.
Prayed to God ?? and performed hypothesis testing multiple times on the same distributions. p-value kept on changing but still >0.05; He kept on praying and then...
Finally, he found the p-value<0.05 ?? which means the Null hypothesis is rejected implies the models are different.
He rejoiced and celebrated took screenshots of the test results and shared with the team that his new model is the best. ????
Everyone shouted Significant Woohoo!! Celebrated the 12% CTR win on the experiment ?? ??
WAIT but Mr. Wolf p-hacked ?? the experiment ?? ( which I learned from him, after a larger phase release and analyzing the metrics ).
Never be like Mr. Wolf if someone would have asked him Power value (probability of correctly rejecting the null hypothesis), he would have gotten himself in trouble.
It is completely fine if your experiment/model failed. Failure is a stepping stone towards the best model version. Failure of an experiment also gives learning in return. ??
“Do not be embarrassed by your failures, learn from them and start again.” —Richard Branson
Data Scientist | MS CS, Georgia Tech | AI, Python, SQL, GenAI | Inventor of Ads Personalization RecSys Patent | Makro | InMobi (Glance) | 1mg | Fi
2 年Mr. Wolf back with another mischief: https://www.dhirubhai.net/posts/shaurya-uppal_datascience-newsletter-dataleakage-activity-6902443991955296256-ZdA0
Data Scientist | MS CS, Georgia Tech | AI, Python, SQL, GenAI | Inventor of Ads Personalization RecSys Patent | Makro | InMobi (Glance) | 1mg | Fi
3 年Book defination of p-hacking: We say experiment is p-hacked when we incorrectly exploit the statistical analysis and falsely conclude that we can reject the null hypothesis.