Let’s not fall in love with our tools
Eduardo Barbaro
Head of Security Analytics ING | Visiting Researcher TUDelft | OpenGroup Thought Leader Data Scientist
In an age where computers are getting more powerful and more “super”, it is so easy to praise how fast we can dedupe/manipulate/transform or access our (always!) “big data” sets. I am most impressed by the number of the data-science discussions remaining annoyingly superficial about “how many tools I know” that can, in fact, do the same thing. These talks often end in a tedious showing-off-competition of “actually, if you use A and B instead, you can do that much faster”. Don’t get me wrong; most of the tools out there are significant and fundamentally important. How could we do our jobs without mastering Python or R, or how could our databases serve us if we didn’t know SQL or Mongo? In an age where Machine Learning is the new black, it is easy to fall in love with Clouds and Clusters, Hadoops and Sparks, Hives and Pigs (I am sure you get the idea by now). They are fancy, there’s a lot of trendy discussions around them, you get to talk encrypted “data-scientist language” (more unintelligible the better). However, most importantly: you can skip the hard work and stay swimming at the shore of tooling. Yes, most of us (data scientists) forget that these are just tools! I find it remarkable not remembering many discussions about integration methods, differential equations, or highly skewed distributions. People tend to forget (ignore?) that behind the curtains of that fancy library there's a lot of Math/Stats we must understand, or at least spend much more time thinking about. I know, you may be thinking: “Mathematics and Statistics are also just tools”. Well, true, they are there to serve us, however in a much more fundamental level. And by the way, I am sure Math and Stats will remain out there for more than a year or two, or until the next ultimate tool pops out.
Let’s spend (much) more time thinking about what we can do with our data! On how we can generate insights on our datasets; or to help our clients answering their important questions. Let’s use our time to design more accurate numerical experiments, think of sharper research questions and hypotheses.
And finally, let’s focus on learning the abstraction instead of praising some tool.
I am sorry, but I have to go now. I just found out that I can read my big data in R much faster as data.table instead of data.frame. Can’t wait to try...
Helping Startups with Business, Data, App, & Tech.
1 年Thank you for sharing, Eduardo! ??
Project Manager Data Analytics EMEA at Altair Engineering
8 年Very true Eduardo. The best tool is always... whatever works! The answer lies into the data, not into tools. And we know that actually with the same data many tools will yield similar results. If you spend your time arguing about tools, you are more doing Science than Data, not focusing on solving real problems.
Product Development & Solution Consultant
8 年Very true! To understand our data better and find out what we can do are more important than the systems or tools themselves.