7 Common Data Science Mistakes and How to Avoid Them
Gregory Piatetsky-Shapiro
Part-time philosopher, Retired, Data Scientist, KDD and KDnuggets Founder, was LinkedIn Top Voice on Data Science & Analytics. Currently helping Ukrainian refugees in MA.
By Khushbu Shah, DeZyre.
“Mistakes are the portals of discovery.”- James Joyce(famous Irish novelist). This is true in most cases, but in case of data scientists, making mistakes help them discover new data trends and find more patterns in the data. Having said this, it is imperative to understand that Data Scientists have a very small margin for error. Data Scientists are hired after a lot of deliberation and at a high cost. Organizations cannot afford to disregard bad data practices and repeated mistakes from Data Scientists. Mistakes and bad practices in data science can cost a data scientist her/his career. It is vital for data scientists to track all data science experiments, learn from the mistakes and avoid them in future data science projects.
A famous quote by Sherlock Holmes well defines that the role of a data scientist in business is as similar as to that of a detective-
“My name is Sherlock Holmes. It is my business to know what other people don’t know.”
For a business to stay competitive, it has to do more than just Big Data Analytics. Without assessing the quality of data they have, the kind of outcome they want and how much profit they are expecting from this kind of data analysis – it becomes difficult to correctly figure out which data science projects will be profitable and which will not. When it comes to data science mistakes- it is acceptable once – considering that there is a learning curve but if these mistakes happen more than twice, it can cost the business.
Learn Data Science in Python to become an enterprise data scientist
Common Data Science Mistakes to Avoid
- Confusion between Correlation and Causation
Mistaking Correlation with Causation can lead to a costly affair for any data scientist. The best example here is the analysis of Freakonomics in which getting correlation for causation wrong, led Illinois to send books to every student in the state because the analysis revealed that books available at home are directly correlated to high test marks. Further analysis showed, that students from homes which have several books performed better in their academics even if they have never read the books. This helped make corrections in the earlier assumptions with the insight that houses wherein parents usually buy books have an exhilarated learning environment.
Most of the data scientists when working with big data assume that correlation directly implies causation. It is often a good practice to use big data to understand the correlation between two variables, however, always using “cause and effect” analogy might render false predictions and unproductive decisions. To make use of big data for best results, it is necessary that data scientists understand the difference between correlation and root cause. Correlation means X and Y tend to be observed at the same time whereas Causality means X causes Y. These are two completely different things in data science, however the difference is often ignored by many data scientists. A decision based on correlation might be good enough to take an action on, without having to know the cause; but this is completely dependent on the kind of data and the problem being solved.
A lesson every data scientist must learn is that- “Correlation is not Causation in data science”. If two items appear to be related to each other, it does not mean that one causes the other.
- Not Choosing the Right Visualization Tools
Most of the data scientists concentrate on learning the technical aspects of analysis. They fail to focus on understanding the data using different visualization techniques which can actually make them derive insight much faster. The value of even the best machine learning models is diluted if a data scientist does not choose the right kind of visualizations to model development, to monitor exploratory data analysis or to represent the results. In fact, many data scientists choose the chart type visual based on their aesthetic taste instead of considering the characteristic of their dataset. This can be avoided by defining the goal of the visualization as the first step.
Even if a data scientist develops an optimum and best machine learning model it will not scream out saying “Eureka”- all that is needed is effective visualization of the results to understand the difference between a data pattern and realizing its existence to be utilized for business outcomes. As the popular saying goes “A picture is worth a 1000 words.”- It is necessary that data scientists not only familiarize themselves with data visualization tools but also understand the principles of effective data visualization to render results in a compelling way.
A crucial step towards solving any data science problem is to get an insight on what the data is about, by representing it through rich visuals that can form the foundation for analysis and modelling it.
Read the rest at KDnuggets: 7 Common Data Science Mistakes and How to Avoid Them
https://www.kdnuggets.com/2016/01/7-common-data-science-mistakes.html
Data Science | ML | AI | Data vizzing | Open Data | Mentoring
9 年This will sound pretentious, but someone who makes mistake num. 1 is not a data scientist to me.
Developmental Psychologist; Researcher; Methodologist, Statistical Analyst, Teacher; Therapist; Painter; Journalist
9 年I didn't know that Data Scientists made mistakes! This is shaking the ground beneath my feet!
Genomics | Healthtech | Lifescience | Information Technology
9 年Good points
junior data scientist
9 年Gregory nice to hear our mistakes