Not more, get better data!
Focus on the the basics.

Not more, get better data!

There is often this tendency to just dump in whatever you have and let the Deep Learning model figure it out. Indeed, this is one of the biggest promises of?#deeplearning?- minimal feature engineering. But then again, those heavy Deep Learning models come with their own requirements on data, and a common solution has been to amass and throw more data at the model. I believe that such an approach is the lazy way out and it only spoiling us and dumbing us down collectively.

Whatever happened to the importance of parsimony? There are generations of Data Science professionals who have painstakingly learnt the importance of being frugal with data. The importance of parsimony - doing more with less. I myself have built several predictive models with less than 100 examples. To do so, I needed to -

  • be very involved in feature engineering,
  • really question the importance of each predictor,
  • carefully evaluate the model for stability and generalizability
  • be creative in sampling and use statistical methods for estimation

I wasn't alone in working with low volumes of data, it was the case across the board. Many 'older' data science professionals like myself ascribe value to the above steps. But it seems that these steps are losing their importance. More and more practitioners I speak to aren't concerned, or simply not aware of these considerations. Machine Learning seemingly has become a largely automated task, reduced to "just getting the best accuracy" from the predictive model.

No alt text provided for this image



This popular XKCD comic doesn't seem funny to me anymore. It's just scary.





"More data beats clever algorithms, but better data beats more data."

Coming from?Peter Norvig, you can be sure this is wisdom from decades of experiments, experiences from the 'front line' of 'cutting edge'. It is not just him. Several other thought leaders with decades of experience under their belt have the same opinion - good data is far more important than more data.

Not all situations requiring predictive modeling have high volumes of data. Not always do you need the maximum possible accuracy from the model. Not always do you need all possible homeomorphisms of the data to automatically generate powerful features. Not always do you really need that data hungry algorithm promising the highest accuracy!

Remember that in the industry, you employ Machine Learning to solve a problem. Employing a certain algorithm is rarely the goal and the end. Many a time, you do as much as you need to do to solve the problem and move on to the next problem waiting for you. Often there are constraints (business constraints, infrastructure constraints, data volume and many more) that severely limit the kind of modeling algorithms you can employ. In such situations, you won't the easy way out - dumping everything into a Deep Learning model.

Focus on the data.

Use your domain understanding to perform proper cleanup.

Derive features that inject your understanding into the process.

Indeed, a well crafted dataset with the right features can not only get you great results with classical?#machinelearning?algorithms, but can also help you get the most out of your state of the art #deeplearning / #artificialintelligence model. It can never harm you.

Don't lose the 'art' in Data Science! You can do much more than relying on a chain of matrix multiplications to figure everything out. Add your invaluable knowledge, your wisdom, your domain expertise to the process and make it truly awesome.

Shailesh kumar Tripathi

Artificial Lift Specialist |AI Engineer |Poet in Building Sukoon.fit,An Early Diagnostic App for Doctors & Patients.

3 年

Exactly:) Just this evening I was discussing a usecase with my collegue and we were planning to populate missing data to feed in dl models,later realised its better to use already available correct data to solve the problem instead of increasing volume.

Shruti Roy

Data Scientist@Amazon

3 年

Great post Rahim! This is very true. There was this one problem I was solving in which the data source itself wasn’t reliable. No matter what techniques - even deep learning - I use, it will be learning on wrong data! At the end models should be treated as tools and not the solution. It’s important to have the right data.

Rohan Singh

Software engineer at Accenture Song

3 年

Not correct. Actually all data speaks. Wading through unstructured data is science and not the other way around. Google thrives on such data infact. Not sure what kind of AI insights or "who does better insights than google" we are taking about here. Getting "better" data is exactly analogous to saying getting "unbiased" data. Sorry, not happening in real world insights. May be works for some fancy overpriced dashboards powered by "limited" data. XKCD comic strip is not wrong.

回复

要查看或添加评论,请登录

Mirza Rahim Baig的更多文章

  • GenAI: 10 Key Insights for Leadership - Part 1

    GenAI: 10 Key Insights for Leadership - Part 1

    With “game changing” events occurring practically every single day, it is impractical to keep track and figure out what…

    6 条评论
  • The Value of the Contrarian

    The Value of the Contrarian

    1961, The Bay of Pigs, Cuba: In April 1961, America attempted to overthrow Fidel Castro's government in Cuba. About…

    9 条评论
  • Of Cobras and KPIs

    Of Cobras and KPIs

    During the British rule in India, the city of Delhi was once facing a problem of venomous cobras. In an attempt to…

    27 条评论

社区洞察

其他会员也浏览了