7 Common Data Science Mistakes and How to Avoid Them

7 Common Data Science Mistakes and How to Avoid Them

By Khushbu Shah, DeZyre.

“Mistakes are the portals of discovery.”- James Joyce(famous Irish novelist). This is true in most cases, but in case of data scientists, making mistakes help them discover new data trends and find more patterns in the data. Having said this, it is imperative to understand that Data Scientists have a very small margin for error. Data Scientists are hired after a lot of deliberation and at a high cost. Organizations cannot afford to disregard bad data practices and repeated mistakes from Data Scientists. Mistakes and bad practices in data science can cost a data scientist her/his career. It is vital for data scientists to track all data science experiments, learn from the mistakes and avoid them in future data science projects.

A famous quote by Sherlock Holmes well defines that the role of a data scientist in business is as similar as to that of a detective-

“My name is Sherlock Holmes. It is my business to know what other people don’t know.”

For a business to stay competitive, it has to do more than just Big Data Analytics. Without assessing the quality of data they have, the kind of outcome they want and how much profit they are expecting from this kind of data analysis – it becomes difficult to correctly figure out which data science projects will be profitable and which will not. When it comes to data science mistakes- it is acceptable once – considering that there is a learning curve but if these mistakes happen more than twice, it can cost the business.

Learn Data Science in Python to become an enterprise data scientist

Common Data Science Mistakes to Avoid

  1. Confusion between Correlation and Causation

Mistaking Correlation with Causation can lead to a costly affair for any data scientist. The best example here is the analysis of Freakonomics in which getting correlation for causation wrong, led Illinois to send books to every student in the state because the analysis revealed that books available at home are directly correlated to high test marks. Further analysis showed, that students from homes which have several books performed better in their academics even if they have never read the books. This helped make corrections in the earlier assumptions with the insight that houses wherein parents usually buy books have an exhilarated learning environment.

Most of the data scientists when working with big data assume that correlation directly implies causation. It is often a good practice to use big data to understand the correlation between two variables, however, always using “cause and effect” analogy might render false predictions and unproductive decisions. To make use of big data for best results, it is necessary that data scientists understand the difference between correlation and root cause. Correlation means X and Y tend to be observed at the same time whereas Causality means X causes Y. These are two completely different things in data science, however the difference is often ignored by many data scientists. A decision based on correlation might be good enough to take an action on, without having to know the cause; but this is completely dependent on the kind of data and the problem being solved.

A lesson every data scientist must learn is that- “Correlation is not Causation in data science”. If two items appear to be related to each other, it does not mean that one causes the other.

  1. Not Choosing the Right Visualization Tools

Most of the data scientists concentrate on learning the technical aspects of analysis. They fail to focus on understanding the data using different visualization techniques which can actually make them derive insight much faster. The value of even the best machine learning models is diluted if a data scientist does not choose the right kind of visualizations to model development, to monitor exploratory data analysis or to represent the results. In fact, many data scientists choose the chart type visual based on their aesthetic taste instead of considering the characteristic of their dataset. This can be avoided by defining the goal of the visualization as the first step.

Even if a data scientist develops an optimum and best machine learning model it will not scream out saying “Eureka”- all that is needed is effective visualization of the results to understand the difference between a data pattern and realizing its existence to be utilized for business outcomes. As the popular saying goes “A picture is worth a 1000 words.”- It is necessary that data scientists not only familiarize themselves with data visualization tools but also understand the principles of effective data visualization to render results in a compelling way.

A crucial step towards solving any data science problem is to get an insight on what the data is about, by representing it through rich visuals that can form the foundation for analysis and modelling it.

Read the rest at KDnuggets: 7 Common Data Science Mistakes and How to Avoid Them

https://www.kdnuggets.com/2016/01/7-common-data-science-mistakes.html

Martina Pugliese, PhD

Data Science | ML | AI | Data vizzing | Open Data | Mentoring

9 年

This will sound pretentious, but someone who makes mistake num. 1 is not a data scientist to me.

回复
Dr. Mark A. Biernbaum

Developmental Psychologist; Researcher; Methodologist, Statistical Analyst, Teacher; Therapist; Painter; Journalist

9 年

I didn't know that Data Scientists made mistakes! This is shaking the ground beneath my feet!

回复
Hena Jose

Genomics | Healthtech | Lifescience | Information Technology

9 年

Good points

回复
pradeep sharma

junior data scientist

9 年

Gregory nice to hear our mistakes

回复

要查看或添加评论,请登录

Gregory Piatetsky-Shapiro的更多文章

社区洞察

其他会员也浏览了