Problems with data

Data is central to almost every kind of problem and is very important. Your model is as good as the data it is built upon! The data collected can require a lot of things from data cleaning, munging, outlier handling, feature engineering and exploratory analysis. Even after that, the same data can be looked upon and analyzed from different perspectives. Among the various data processing problems, two of the most common problems I have faced are missing values and correlation but not causation.

Missing information/missing values - I heard this story of pedestrian traffic pattern study in Amsterdam. It was found that passing through a particular spot was unnecessarily delaying the tourists. As it turned out, a dustbin was kept there and people were walking up to the spot to throw garbage. The map seemingly did not capture the presence of a dustbin at the spot. Imagine how missing information such as this could lead to wrong conclusions.

Correlation without causation - It so happens that murder rates are supposedly higher in summers than in winters. We know that summer heat will mean more ice cream consumption. Now this could also mean that someone finds a high correlation between murder rate and ice cream consumption. It will be obviously wrong to assume here that ice cream causes murders or conversely, murders cause ice cream. They both just happen to follow the same seasonal pattern. Correlation may imply causation but does not necessarily mean causation.

One may have other problems which are more common for the projects one has worked upon. Feel free to share your views!

Rahul Jha

∞ | Data Science Consultant with expertise in AI/ML and Analytics

5 年

Thanks for sharing Madhur. It was really informative.

回复

Rightly said Madhur. Also we must understand the assumptions underlying the data and the purpose for which we are analysing the data.

回复
Asesh Datta

Training / Counselor / Industrial Engineering / Software Developer / Life Planner and General Insurance Proposer

8 年

Madhur Modi Very nicely titled "problems with data'. Both your examples are quite interesting. What is missing in 'data analytics' is our ability to come up with observations which are totally new and could not be reasoned. Like 'murder and ice creams' revealed by data analytics. With the present technology, such like of analytics will reveal interesting inferences. Next step is our ability to infer and take appropriate corrective steps. Let the analytics lead us possible correctives steps with a % confidence level. Else we only research the problems with excessive data. Data is data as long as they are authentic and secured. Rest is in our ability to infer by using those data. Thanks for the post and regards

回复
Pratyush Choudhury ??

Principal @ Together Fund | AI Investor | x-AWS | DMs: superdm.me/177pc

8 年

Wonderful insights!

回复
Amrita Thakur

Senior Manager- Efficio Consulting | Ex Accenture Strategy, EY| IIM Kozhikode

8 年

Hey Madhur, those were pretty good example.

回复

要查看或添加评论,请登录

Madhur Modi的更多文章

  • Software Project management

    Software Project management

    Introduction: What is it? Software companies work by allocating projects with multiple constraints, some of which…

  • Different random forest packages in R

    Different random forest packages in R

    One of the important steps in using analytics to generate insights is model fitting. Typical projects involve a lot of…

    7 条评论
  • Hands-on Spectral clustering in R

    Hands-on Spectral clustering in R

    Spectral Clustering, what is it? Spectral clustering is a class of techniques that perform cluster division using…

    2 条评论

社区洞察

其他会员也浏览了