登录查看更多内容

The Kaggle Kernel That Made Me Grow As a Data Scientist

Terry C.

Senior Data Scientist at TRM Labs

发布日期: 2023年4月4日

4 years ago, just after I graduated from college, I spent quite a few weeks competing (and lurking) in Kaggle competitions. Having just gotten my degree in Math & CS, I was eager to apply all the theories I had studied, laboring under the youthful delusion that the right combination of esoteric mathematical transformations could extract a signal from any data, regardless of whether I actually took the time to understand that data. Luckily for me, this naive view did not last for long, as one day I chanced upon a write-up by a recent competition winner that changed my perspective completely. So much so that I still have his work bookmarked til this day :)

This competition’s goal was to make a prediction about whether a power line was damaged based on 3-phase electrical current data. I was only a lurker during this particular competition, but while perusing what others had written and developed, I now see that many of them had fallen victim to the same mindset that I had. People were using all of the most cutting edge stuff, I thought at the time. Fourier Transforms, LSTMs, K-Fold CV, CNNs! So cool! While I saw a lot of mathy jargon, I saw few people posting deep analyses trying to understand what the pattern was in the data. And, inevitably after submissions closed the hidden test set was revealed and the leaderboard rearranged in a classic Kaggle shakeup. The winner was announced, and after quietly working throughout the competition finally published their solution.

My young self was astounded when I saw that the winning model was nothing more than a gradient-boosted decision tree with only 9 features. A level of complexity that was so low, I couldn’t believe it had beaten the others by a longshot. Despite my skepticism, I began to read and quickly came upon this remark by the author describing his approach to the competition: “I spent most of the time trying to understand the data”. These words from the winner began to shift the gears in my head, but everything really clicked once I saw the diagram they had created to demonstrate their most important feature.?

领英推荐

The Trick That Helps All Statisticians Survive

Keith McNulty 7 个月前

A Complete Guide to Quick Sort Time Complexity

StrataScratch 5 个月前

Fidel Vetino Working with "Arrays"in Spark 3.5

Fidel .V 1 年前

It turned out the secret sauce to this model was a clever feature which measured how similar the signal in the data was to a sawtooth shape that this competitor had seen over and over right before a fault event while manually inspecting the data (the dotted line in the article image). While extremely creative and resourceful, it was not fancy. Yet this feature (along with a few other carefully selected ones), fed into a simple model, beat out all of the convoluted theoretical contraptions I had so enjoyed reading about. After digesting all of this, I realized I had vicariously learned a crucial lesson in data science.?

You hear it all the time: start with a simple model and build intuition about the problem. But at the same time, we have neural nets and Spark clusters with 50 machines and phantom metrics to chase after. The Devil on your shoulder tempts you to choose the path of complexity. To avoid doing the “boring” work of visualizing and interpreting the seemingly meaningless swaths of data. To neglect truly understanding the pattern under the premise that enough computational firepower and fancy math can brute-force the task faster and more thoroughly than you can. But when we do this, we forget that the most elegant solution is the simplest. And the simplest solution is the one that is hard earned.

The Kaggle Kernel That Made Me Grow As a Data Scientist

Terry C.

Senior Data Scientist at TRM Labs

领英推荐

社区洞察

其他会员也浏览了

COVID-19 Public Dataset Program: Unleash the Dragon

ML Pipelines for Model Tuning

Fun with Graphing in Power BI - Part SQRT(POWER(SQRT(8),2) + POWER(SQRT(8),2))

Modern Data Science: Monogamy or Ménage à trois ?

Learning path for Datascience

Ordered sets in GO with treaps

Sorting in Data Structure: Categories & Types [With Examples]

Stuck in “Dev Hell”? Trust Math – Not Data – and your Senses

Formula of the Day: The Softmax Function

Exploring Graphs and Trees: A Journey with DFS and BFS