The Kaggle Kernel That Made Me Grow As a Data Scientist
My inspiration: https://www.kaggle.com/code/mark4h/vsb-1st-place-solution/notebook

The Kaggle Kernel That Made Me Grow As a Data Scientist

4 years ago, just after I graduated from college, I spent quite a few weeks competing (and lurking) in Kaggle competitions. Having just gotten my degree in Math & CS, I was eager to apply all the theories I had studied, laboring under the youthful delusion that the right combination of esoteric mathematical transformations could extract a signal from any data, regardless of whether I actually took the time to understand that data. Luckily for me, this naive view did not last for long, as one day I chanced upon a write-up by a recent competition winner that changed my perspective completely. So much so that I still have his work bookmarked til this day :)


This competition’s goal was to make a prediction about whether a power line was damaged based on 3-phase electrical current data. I was only a lurker during this particular competition, but while perusing what others had written and developed, I now see that many of them had fallen victim to the same mindset that I had. People were using all of the most cutting edge stuff, I thought at the time. Fourier Transforms, LSTMs, K-Fold CV, CNNs! So cool! While I saw a lot of mathy jargon, I saw few people posting deep analyses trying to understand what the pattern was in the data. And, inevitably after submissions closed the hidden test set was revealed and the leaderboard rearranged in a classic Kaggle shakeup. The winner was announced, and after quietly working throughout the competition finally published their solution.


My young self was astounded when I saw that the winning model was nothing more than a gradient-boosted decision tree with only 9 features. A level of complexity that was so low, I couldn’t believe it had beaten the others by a longshot. Despite my skepticism, I began to read and quickly came upon this remark by the author describing his approach to the competition: “I spent most of the time trying to understand the data”. These words from the winner began to shift the gears in my head, but everything really clicked once I saw the diagram they had created to demonstrate their most important feature.?


It turned out the secret sauce to this model was a clever feature which measured how similar the signal in the data was to a sawtooth shape that this competitor had seen over and over right before a fault event while manually inspecting the data (the dotted line in the article image). While extremely creative and resourceful, it was not fancy. Yet this feature (along with a few other carefully selected ones), fed into a simple model, beat out all of the convoluted theoretical contraptions I had so enjoyed reading about. After digesting all of this, I realized I had vicariously learned a crucial lesson in data science.?


You hear it all the time: start with a simple model and build intuition about the problem. But at the same time, we have neural nets and Spark clusters with 50 machines and phantom metrics to chase after. The Devil on your shoulder tempts you to choose the path of complexity. To avoid doing the “boring” work of visualizing and interpreting the seemingly meaningless swaths of data. To neglect truly understanding the pattern under the premise that enough computational firepower and fancy math can brute-force the task faster and more thoroughly than you can. But when we do this, we forget that the most elegant solution is the simplest. And the simplest solution is the one that is hard earned.

The kernel is certainly an indispensable tool. ??. I actually have no idea what I’m talking about but that article was extremely interesting. Nice work, Terry!

要查看或添加评论,请登录

社区洞察

其他会员也浏览了