Predicting Heart Disease using Machine Learning? Don’t!
Credits: https://unsplash.com

Predicting Heart Disease using Machine Learning? Don’t!

I was recently invited to judge a Data Science competition. The students were given the 'heart disease prediction' dataset, perhaps an improvised version of the one available on Kaggle. I had seen this dataset before and often come across various self-proclaimed data science gurus teaching na?ve people how to predict heart disease through machine learning.

I believe the “Predicting Heart Disease using Machine Learning” is a classic example of how not to apply machine learning to a problem, especially where a lot of domain experience is required.

Let me unpack the various problems in applying machine learning to this data set.

Dive straight into the problem syndrome – Well this is the first mistake many people make. Jumping straight into the problem and thinking which Machine learning algorithm to apply. Doing EDA et all as part of this process is not *thinking* about the problem. Rather it is a sign that you have already accepted the notion that the problem needs a data science solution. Instead one of the pertinent questions that needs to be asked before starting any analysis is, “Is this problem even predictable through application of machine learning?”. 

Blind faith in Data – This is an extension of the first point. Diving straight into the problem means you have blind faith in the data. People assume the data to be true and do not make an effort to scrutinize the data. For example, the dataset only provided systolic blood pressure. If you spoke to any doctor or even a paramedic, they would tell you that systolic blood pressure alone does not give the full picture. Reporting of diastolic level is important too. Many don't even ask the question "are the features enough to predict the outcome or more features are needed".

Chart depicting insufficient data.

Not enough data per patient: Let’s take a look at the data set above. If you notice, there is only one data point under each feature for a patient. The fundamental problem here is that features like blood pressure, cholesterol, heart beat are not static. They range. Blood pressure of a person varies hour to hour and on a daily basis, so does heart beat. So when it comes to prediction problem there is no telling weather 135 mm hg blood pressure was one of the factors to cause the heart disease or was it 140, all while the data set might be reporting 130 mm hg. Ideally, multiple measurements need to be had for each feature for a patient.

Now let’s come to the crux of the matter

Applying algorithm without domain experience - One of the reasons for high failure rate of data science application in health care is that the data scientists applying the algorithm do not have adequate medical knowledge.

Secondly, in healthcare, causality is taken very seriously. Many rigorous clinical and statistical tests are conducted to infer causality.

In the case study, any machine learning algorithm is just trying to map the input to the output while reducing some error metric. Also, the machine learning algorithm by themselves are not classifiers, we make them as classifiers by setting some cut-off or threshold. Again, this cut off is not decided to deduce causality but just to get "favorable metrics".

Aggravating this problem is the usage of low code libraries. This case study is a case in point example why low code libraries can be dangerous. Low code libraries fit a dozen or more algorithms. Most are not even aware how some of these algorithms work! They just pick the 'best' algorithm based on metrics like F1, Precision, Recall and Accuracy.

The low code libraries that fixate on accuracy metrics lead to 'Goodhart's law' -"When a measure becomes a target, it ceases to be a good measure."

Goodhart's law

Image credits:https://sketchplanations.com/goodharts-law

If you are predicting, you are implying a causation. In healthcare, mere prediction is not enough, one needs to prove causation. Machine learning classifier algorithms do not answer the ‘causation’ part.

Believing they have solved a real healthcare problem – Last but not the least, many believe that by fitting a ML algorithm to a *healthcare* data set and getting some accuracy metrics, they have solved a real healthcare problem. Nothing can be further from truth than this, especially when it pertains to healthcare domain. 

In conclusion:

?There are perhaps thousands of business problems that genuinely warrant data science / machine learning solutions. But at the same time, one should not fall into the trap of “To a person with hammer, everything looks like a nail”. Seeing everything as a nail (data science problem) and machine learning algorithms as (hammer) can be very counterproductive. Much of 80% failure rate in data science application to business problem could be attributed to this.

Good data scientists are like Good doctors. Good doctors suggest conservative treatments first before prescribing heavy dosage medicines or surgery. Similarly, a good data scientist should ask certain pertinent questions first before blindly applying a dozen ML algorithms to the problem.

Doctor: Surgery :: Data Scientist : Machine learning 

Your comments and opinions are welcome.

Anirudh Upadhyay

ML Engineer @ Capgemini Engineering | Statistical Modelling | NLP | GenAI | Azure

3 å¹´

This is wisdom, which is scarce now a days. Thanks for sharing.

Aishwarya Said

Ex-Amazon | 3+ Years Experience | Stamp 4 Holder

3 å¹´

Thank you for breaking the myths. There are 100s of algorithms, knowing which to apply and when to apply is actually a job of Data scientist.

Dipankar Dey

Machine Learning | Deep Learning | NLP | Tableau | PowerBI | SQL | Apache NiFi | RASA | SpaCy | Web Scraping | Django | React | MS Excel VBA Macro

3 å¹´

I see only a few people reacted on this post, but this is an indication to a potential misapplication of machine learning. I remember in one of his lectures Prof. Andrew Ng also pointed out a very basic thing (which seemed trivial to me, but as I learn day by day it's becoming a little bit clear, though not clear enough), he said something similar that first we need to decide whether ML can be applied to a certain use case or not. I feel that the 2 approaches are different, while statistics relies on assumptions more, but in ML the approach is more data driven, less importance to assumptions. I might be incorrect, but that's what I feel

要查看或添加评论,请登录

Venkat Raman的更多文章

  • Data Science requires heavy dose of statistics not less

    Data Science requires heavy dose of statistics not less

    Recently, there are narratives doing the rounds that there ought to be two courses 1) B.Sc.

    10 条评论
  • “All models are wrong, some are useful” ≠ Modeling is a futile exercise

    “All models are wrong, some are useful” ≠ Modeling is a futile exercise

    The phrase “All models are wrong, some are useful” is quite loosely used. Some take it in a very literal sense to imply…

    2 条评论
  • Abstraction and Data Science - Not a great combination

    Abstraction and Data Science - Not a great combination

    Abstraction - some succinct definitions. “Abstraction is the technique of hiding implementation by providing a layer…

    17 条评论
  • Why MOOCs may not help you get that Data Science Job

    Why MOOCs may not help you get that Data Science Job

    Many aspiring Data Scientists lament that despite doing many massive open online courses (MOOCs) they are not getting…

    3 条评论
  • No ML Algorithms Cheat Sheet, Please

    No ML Algorithms Cheat Sheet, Please

    What is a Cheat Sheet ? Wikipedia defines cheat sheets as a concise set of notes used for quick reference. Now the word…

    4 条评论
  • Ain’t No Such a Thing as a "Citizen Data Scientist"

    Ain’t No Such a Thing as a "Citizen Data Scientist"

    Dear Aspiring Data Scientist, Before you start using ‘low code’ or ‘drag & drop’ data science tools, please learn the…

    45 条评论
  • As a data scientist, what are some dead giveaways that a person is a complete amateur?

    As a data scientist, what are some dead giveaways that a person is a complete amateur?

    I was asked this question on Quora. Well “it gets one to know one” !! Every Data Scientist would have started his/her…

    1 条评论
  • Degrees of Freedom and Sudoko

    Degrees of Freedom and Sudoko

    Intuitive explanation of Degrees of Freedom and How Degrees of Freedom affects Sudoku Lot of aspiring Data Scientists…

    3 条评论
  • How I used NLP (Spacy) to screen Data Science Resumes

    How I used NLP (Spacy) to screen Data Science Resumes

    Resume building is very tricky. A candidate has many dilemmas, whether to state a project at length or just mention the…

    3 条评论
  • Want To Become a Data Scientist? Try Feynman Technique.

    Want To Become a Data Scientist? Try Feynman Technique.

    Many a blogs and articles are written on how to become a Data Scientist. The list normally goes like this: Study…

    1 条评论

社区洞察

其他会员也浏览了