Seven deadly sins in the world of Data Science
Ashish Khandelwal
Head of Artificial Intelligence @ Yes Bank| Artificial Intelligence, Machine Learning and Automation
“It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness…” Charles Dickens
These words encapsulate the duality that exists in today’s Data Science landscape. While it brims with limitless possibilities and romanticism, it also presents its fair share of pitfalls and challenges.
?The resounding success of generative models has captivated the imaginative minds of aspiring data scientists. In this article, I have compiled a list of common mistakes often observed among budding data scientists. The hope is that it will help them avoid such mistakes on their path to becoming professional data scientists.
?
1.????Skipping underlying Mathematics / Theory.
I, often, get asked about the shortest path to becoming a data scientist. The enthusiasm behind the question always excites me. However, the word ‘shortest’ almost never fails to make me squirm.
The most deceptive aspect here is that almost anyone can build a quick basic model for a given problem by doing a simple Google search. While there is nothing wrong with that, the problems arise when one is tasked with improving the accuracy of the model. Enhancing the model may involve engineering new features, creating an ensemble of models using data in various formats (images, text, etc.), building a custom loss function, and more. To effectively use any of these techniques, one needs a strong understanding of the fundamental concepts. Knowledge of underlying math greatly helps in solidifying these concepts.
Having said that, most of the useful math to understand Data Science Concepts is not very difficult and there are a lot of videos freely available today that will give you an intuitive understanding of the mathematical theory.
Bottomline – entry-level data scientists may get away with not knowing the math but if you want to make a mark in the field of data science make math your good friend.
?
2.????Using confidence score as the measure of accuracy?–
I have heard people, intending to guarantee the accuracy of predictions, say something like, "Let us use model predictions only if it comes with a high confidence score, e.g., >90%." Can there be anything wrong with it? It sounds so logical! Well, this approach might work in an ideal case where clear decision boundaries exist between the classes. In all practical scenarios, you will have false positives (by definition) predicted with a high degree of confidence.
Probably this confusion stems from a similar but subtly different risk mitigation strategy, where data scientists often adjust the confidence threshold for precision-recall trade-off. It will work well only in specific cases where either the cost of false positives or false negatives is considerably higher than the other.
I have seen people confusing this concept at different levels and in different organizations. It is neither a rare nor a locally isolated phenomenon.
Bottom line: Higher confidence does not always mean more accurate results.
3.????Focusing only on accuracy as a success criterion
This is wrong on so many levels. In many classification scenarios, looking at accuracy in isolation will mislead you. Consider an example of fraud detection. Typically, fraudulent events are rare occurrences. Let's assume a case where only 1 in 1000 transactions is fraudulent. If one creates a rule that predicts all transactions as non-fraudulent, 999 correct predictions out of 1000 result in 99.9% accuracy. However, despite the high accuracy, the model is completely useless for its intended purpose.
There are several other metrics defined that may be used in different business scenarios. For example, recall is the measure of how many of the actual fraud cases were predicted correctly by the model. In the above case, the recall was 0%.
On the flip side, if there is a model that predicts all transactions as fraudulent, then the recall is 1/1 = 100%. However, precision, which is the number of correctly predicted frauds divided by the number of actual fraud transactions, is 1/1000 = 0.1%.
While understanding the precision/recall trade-off is fundamental in data science, I have noticed that many young data scientists struggle to relate it to the business context. The precision/recall relationship is a very interesting topic, and every data scientist should thoroughly understand how it will impact their business outcomes to achieve success in their careers.
Bottom line: Choose the performance metric wisely and ensure that it aligns with your business goals.
?
4.????Not giving enough attention to the Data Quality
Very often we see young Data Scientist spend all their energy on trying out different models in the hope of getting good accuracy. The fact that “data Science is a garbage in, garbage out system”, is many times ignored or overlooked. Even if you have the best in the class model, the quality of the outcome will depend on the sanctity and purity of the data.
Obtaining high-quality data from the system usually becomes a significant challenge for Data Scientists. In practice, data scientists need to spend an enormous amount of time and effort in collecting and cleaning the data. Sounds like a lot of scut work!? Well, there are no shortcuts or free lunches.
Bottomline – Analyzing false positive and false negative cases in the training/validation data, will provide you with good insights into the cause of confusion in the data for the model
?
领英推荐
5.????Target Leakage
Here, I want to emphasize the importance of understanding the business process for which you are building the model. It's crucial to ensure that none of the data features used are generated after the Target variable.
There are a couple of reasons for that a) such a feature will not be available to you at the time of prediction b) it may be influenced by the outcome in the target variable. Let me explain with a story.
We were building a credit risk model for loan collection process at a bank. The first model we built came out with f1 score of 0.99. For an experienced data scientist, this would immediately raise a red flag. On a closer look, we figured that one of the variables “# of contacts made with the borrower” is causing the leakage. Well, the fact is that if a loan goes delinquent, more than the usual number of contacts are made in an attempt to collect. Hence this variable was a source of target leakage. This is something you should be extremely careful about when performing target encoding.
Bottomline - Unless you carefully filter out such features, they can catch you completely off guard.
?
6.????Not being able to recognize Overfitting.
Just the other day a young data scientist in my team reported a “peculiar” case. He had tried several Neural Nets for a document classification problem. In one of the experiments, he tried a bigger model (with more layers) and to his amazement, the accuracy of the model on validation set dropped! Appropriate regularization techniques were applied to resolve the issue.
Even though overfitting is something that is taught rigorously in Data Science courses, my observation is that recognize it takes practice. This may be because it is perceived as an insignificant problem until it starts to manifest itself much more often than one had expected.
Bottomline – There are many tell-tale signs to identify overfitting. Acknowledging the fact that it will appear in all practical situations and addressing it as soon as it is observed will solve a chunk of the problems that Data Scientists face.
?
7.????Not spending time to understand the domain.
Data Science competitions (like those on Kaggle) create misplaced expectation for both Data Scientist and the companies hiring them by providing ready-made data for the competition.
In real life, Data scientists spend a substantial amount of time in data discovery and cleaning. Even if we assume that there is a separate team for data engineering. Data scientist will need to have enough understanding of the domain and the problem statement to even select appropriate fields for extraction.
Furthermore, domain knowledge is also essential for performance measurements. Depending on the business problem, one may have to look at the model output differently. For instance, if the business objective is identifying risk or fraud, one may want to train the model for higher recall, Conversely, for resource optimization-related problems, higher precision is typically desired.
?
Conclusion:
In conclusion, navigating the world of data science requires awareness of the potential pitfalls and challenges. By avoiding common mistakes mentioned in this article aspiring data scientists can enhance their chances of success. Emphasizing the right performance metrics and investing time in comprehending the domain context are essential steps toward becoming a proficient data scientist.
?
References
Dickens, Charles. (1859). A Tale of Two Cities. London: Chapman and Hall.
Master math skills for Data Science, Coursera
Hastie, T., Tibshirani, R., & Friedman, J. (2009). The Elements of Statistical Learning Data Mining, Inference, and Prediction, Second Edition
Bishop, C. M. (2006). Pattern Recognition and Machine Learning. New York: Springer.
Target Leakage, Data Robot
AWS Certified Solutions Architect | Innovator (Patents) | Certified Java, Web, Python Developer
1 年Good read
Product Management/AI Leader ,IIT Kharagpur , IIM Mumbai , Ex Microsoft,Oracle
1 年Great Thoughts Ashish