登录查看更多内容

Not more, get better data!

Mirza Rahim Baig

Top AI Voice | Educator | Author| Startup Mentor | GenAI, AI, Machine Learning

发布日期: 2021年7月18日

There is often this tendency to just dump in whatever you have and let the Deep Learning model figure it out. Indeed, this is one of the biggest promises of?#deeplearning?- minimal feature engineering. But then again, those heavy Deep Learning models come with their own requirements on data, and a common solution has been to amass and throw more data at the model. I believe that such an approach is the lazy way out and it only spoiling us and dumbing us down collectively.

Whatever happened to the importance of parsimony? There are generations of Data Science professionals who have painstakingly learnt the importance of being frugal with data. The importance of parsimony - doing more with less. I myself have built several predictive models with less than 100 examples. To do so, I needed to -

be very involved in feature engineering,
really question the importance of each predictor,
carefully evaluate the model for stability and generalizability
be creative in sampling and use statistical methods for estimation

I wasn't alone in working with low volumes of data, it was the case across the board. Many 'older' data science professionals like myself ascribe value to the above steps. But it seems that these steps are losing their importance. More and more practitioners I speak to aren't concerned, or simply not aware of these considerations. Machine Learning seemingly has become a largely automated task, reduced to "just getting the best accuracy" from the predictive model.

This popular XKCD comic doesn't seem funny to me anymore. It's just scary.

领英推荐

What are the top challenges around working with…

Machine Learning 2 年前

Demystifying Machine Learning Challenges – Imbalanced…

Amlgo Labs 1 年前

Hypothesis Testing in Machine Learning

Sankhyana Consultancy Services Pvt. Ltd. 2 年前

"More data beats clever algorithms, but better data beats more data."

Coming from?Peter Norvig, you can be sure this is wisdom from decades of experiments, experiences from the 'front line' of 'cutting edge'. It is not just him. Several other thought leaders with decades of experience under their belt have the same opinion - good data is far more important than more data.

Not all situations requiring predictive modeling have high volumes of data. Not always do you need the maximum possible accuracy from the model. Not always do you need all possible homeomorphisms of the data to automatically generate powerful features. Not always do you really need that data hungry algorithm promising the highest accuracy!

Remember that in the industry, you employ Machine Learning to solve a problem. Employing a certain algorithm is rarely the goal and the end. Many a time, you do as much as you need to do to solve the problem and move on to the next problem waiting for you. Often there are constraints (business constraints, infrastructure constraints, data volume and many more) that severely limit the kind of modeling algorithms you can employ. In such situations, you won't the easy way out - dumping everything into a Deep Learning model.

Focus on the data.

Use your domain understanding to perform proper cleanup.

Derive features that inject your understanding into the process.

Indeed, a well crafted dataset with the right features can not only get you great results with classical?#machinelearning?algorithms, but can also help you get the most out of your state of the art #deeplearning / #artificialintelligence model. It can never harm you.

Don't lose the 'art' in Data Science! You can do much more than relying on a chain of matrix multiplications to figure everything out. Add your invaluable knowledge, your wisdom, your domain expertise to the process and make it truly awesome.

Shailesh kumar Tripathi

Artificial Lift Specialist |AI Engineer |Poet in Building Sukoon.fit,An Early Diagnostic App for Doctors & Patients.

3 年

Exactly:) Just this evening I was discussing a usecase with my collegue and we were planning to populate missing data to feed in dl models,later realised its better to use already available correct data to solve the problem instead of increasing volume.

1 次回应

Shruti Roy

Data Scientist@Amazon

3 年

Great post Rahim! This is very true. There was this one problem I was solving in which the data source itself wasn’t reliable. No matter what techniques - even deep learning - I use, it will be learning on wrong data! At the end models should be treated as tools and not the solution. It’s important to have the right data.

1 次回应

Rohan Singh

Software engineer at Accenture Song

3 年

Not correct. Actually all data speaks. Wading through unstructured data is science and not the other way around. Google thrives on such data infact. Not sure what kind of AI insights or "who does better insights than google" we are taking about here. Getting "better" data is exactly analogous to saying getting "unbiased" data. Sorry, not happening in real world insights. May be works for some fancy overpriced dashboards powered by "limited" data. XKCD comic strip is not wrong.

查看更多评论

要查看或添加评论，请登录

Mirza Rahim Baig的更多文章

GenAI: 10 Key Insights for Leadership - Part 1

2024年8月28日

GenAI: 10 Key Insights for Leadership - Part 1

With “game changing” events occurring practically every single day, it is impractical to keep track and figure out what…

6 条评论
The Value of the Contrarian

2022年3月6日

The Value of the Contrarian

1961, The Bay of Pigs, Cuba: In April 1961, America attempted to overthrow Fidel Castro's government in Cuba. About…

9 条评论
Of Cobras and KPIs

2021年7月4日

Of Cobras and KPIs

During the British rule in India, the city of Delhi was once facing a problem of venomous cobras. In an attempt to…

27 条评论

Not more, get better data!

Mirza Rahim Baig

Top AI Voice | Educator | Author| Startup Mentor | GenAI, AI, Machine Learning

领英推荐

Mirza Rahim Baig的更多文章

社区洞察

其他会员也浏览了

Should I Choose Machine Learning or Big Data?

The Gradient Boosted Algorithm Explained!

Breaking BERT?—?How to break into Machine Learning

The Hidden Truth About Data Science (That No One Talks About!)

Data Scientist’s Dilemma: The Cold Start Problem – Ten Machine Learning Examples

Why the Future of Data Science Lies in Generative AI Skills

Unlocking Model Performance: Navigating the Key Factors for Success in Machine Learning

DATA Pill #052 - LLM, observability, Data Catalogs & storage cost reduction again

The Fear in Data Scientist called Autophobia

Powering Predictive Precision: XGBoost and LightGBM

领英推荐

Mirza Rahim Baig的更多文章

GenAI: 10 Key Insights for Leadership - Part 1

The Value of the Contrarian

Of Cobras and KPIs

社区洞察

其他会员也浏览了

Should I Choose Machine Learning or Big Data?

The Gradient Boosted Algorithm Explained!

Breaking BERT?—?How to break into Machine Learning

The Hidden Truth About Data Science (That No One Talks About!)

Data Scientist’s Dilemma: The Cold Start Problem – Ten Machine Learning Examples

Why the Future of Data Science Lies in Generative AI Skills

Unlocking Model Performance: Navigating the Key Factors for Success in Machine Learning

DATA Pill #052 - LLM, observability, Data Catalogs & storage cost reduction again

The Fear in Data Scientist called Autophobia

Powering Predictive Precision: XGBoost and LightGBM