ç™»å½•æŸ¥çœ‹æ›´å¤šå†…å®¹

Predicting Heart Disease using Machine Learning? Donâ€™t!

Venkat Raman

Co-Founder & CEO at Aryma Labs | Building Marketing ROI Solutions For a Privacy First Era | Statistician |

å‘å¸ƒæ—¥æœŸ: 2020å¹´11æœˆ3æ—¥

I was recently invited to judge a Data Science competition. The students were given the 'heart disease prediction' dataset, perhaps an improvised version of the one available on Kaggle. I had seen this dataset before and often come across various self-proclaimed data science gurus teaching na?ve people how to predict heart disease through machine learning.

I believe the â€œPredicting Heart Disease using Machine Learningâ€ is a classic example of how not to apply machine learning to a problem, especially where a lot of domain experience is required.

Let me unpack the various problems in applying machine learning to this data set.

Dive straight into the problem syndrome â€“ Well this is the first mistake many people make. Jumping straight into the problem and thinking which Machine learning algorithm to apply. Doing EDA et all as part of this process is not *thinking* about the problem. Rather it is a sign that you have already accepted the notion that the problem needs a data science solution. Instead one of the pertinent questions that needs to be asked before starting any analysis is, â€œIs this problem even predictable through application of machine learning?â€.

Blind faith in Data â€“ This is an extension of the first point. Diving straight into the problem means you have blind faith in the data. People assume the data to be true and do not make an effort to scrutinize the data. For example, the dataset only provided systolic blood pressure. If you spoke to any doctor or even a paramedic, they would tell you that systolic blood pressure alone does not give the full picture. Reporting of diastolic level is important too. Many don't even ask the question "are the features enough to predict the outcome or more features are needed".

Not enough data per patient: Letâ€™s take a look at the data set above. If you notice, there is only one data point under each feature for a patient. The fundamental problem here is that features like blood pressure, cholesterol, heart beat are not static. They range. Blood pressure of a person varies hour to hour and on a daily basis, so does heart beat. So when it comes to prediction problem there is no telling weather 135 mm hg blood pressure was one of the factors to cause the heart disease or was it 140, all while the data set might be reporting 130 mm hg. Ideally, multiple measurements need to be had for each feature for a patient.

Now letâ€™s come to the crux of the matter

Applying algorithm without domain experience - One of the reasons for high failure rate of data science application in health care is that the data scientists applying the algorithm do not have adequate medical knowledge.

Secondly, in healthcare, causality is taken very seriously. Many rigorous clinical and statistical tests are conducted to infer causality.

In the case study, any machine learning algorithm is just trying to map the input to the output while reducing some error metric. Also, the machine learning algorithm by themselves are not classifiers, we make them as classifiers by setting some cut-off or threshold. Again, this cut off is not decided to deduce causality but just to get "favorable metrics".

Aggravating this problem is the usage of low code libraries. This case study is a case in point example why low code libraries can be dangerous. Low code libraries fit a dozen or more algorithms. Most are not even aware how some of these algorithms work! They just pick the 'best' algorithm based on metrics like F1, Precision, Recall and Accuracy.

The low code libraries that fixate on accuracy metrics lead to 'Goodhart's law' -"When a measure becomes a target, it ceases to be a good measure."

Image credits:https://sketchplanations.com/goodharts-law

If you are predicting, you are implying a causation. In healthcare, mere prediction is not enough, one needs to prove causation. Machine learning classifier algorithms do not answer the â€˜causationâ€™ part.

Believing they have solved a real healthcare problem â€“ Last but not the least, many believe that by fitting a ML algorithm to a *healthcare* data set and getting some accuracy metrics, they have solved a real healthcare problem. Nothing can be further from truth than this, especially when it pertains to healthcare domain.

In conclusion:

?There are perhaps thousands of business problems that genuinely warrant data science / machine learning solutions. But at the same time, one should not fall into the trap of â€œTo a person with hammer, everything looks like a nailâ€. Seeing everything as a nail (data science problem) and machine learning algorithms as (hammer) can be very counterproductive. Much of 80% failure rate in data science application to business problem could be attributed to this.

Good data scientists are like Good doctors. Good doctors suggest conservative treatments first before prescribing heavy dosage medicines or surgery. Similarly, a good data scientist should ask certain pertinent questions first before blindly applying a dozen ML algorithms to the problem.

Doctor: Surgery :: Data Scientist : Machine learning

Your comments and opinions are welcome.

Anirudh Upadhyay

ML Engineer @ Capgemini Engineering | Statistical Modelling | NLP | GenAI | Azure

3 å¹´

This is wisdom, which is scarce now a days. Thanks for sharing.

èµž

å›žå¤

1 æ¬¡å›žåº”

Aishwarya Said

Ex-Amazon | 3+ Years Experience | Stamp 4 Holder

3 å¹´

Thank you for breaking the myths. There are 100s of algorithms, knowing which to apply and when to apply is actually a job of Data scientist.

èµž

å›žå¤

1 æ¬¡å›žåº”

Dipankar Dey

3 å¹´

I see only a few people reacted on this post, but this is an indication to a potential misapplication of machine learning. I remember in one of his lectures Prof. Andrew Ng also pointed out a very basic thing (which seemed trivial to me, but as I learn day by day it's becoming a little bit clear, though not clear enough), he said something similar that first we need to decide whether ML can be applied to a certain use case or not. I feel that the 2 approaches are different, while statistics relies on assumptions more, but in ML the approach is more data driven, less importance to assumptions. I might be incorrect, but that's what I feel

èµž

å›žå¤

3 æ¬¡å›žåº”

æŸ¥çœ‹æ›´å¤šè¯„è®º

è¦æŸ¥çœ‹æˆ–æ·»åŠ è¯„è®ºï¼Œè¯·ç™»å½•

Venkat Ramançš„æ›´å¤šæ–‡ç«

Data Science requires heavy dose of statistics not less

2022å¹´1æœˆ1æ—¥

Data Science requires heavy dose of statistics not less

Recently, there are narratives doing the rounds that there ought to be two courses 1) B.Sc.

10 æ¡è¯„è®º
â€œAll models are wrong, some are usefulâ€ â‰ Modeling is a futile exercise

2021å¹´7æœˆ21æ—¥

â€œAll models are wrong, some are usefulâ€ â‰ Modeling is a futile exercise

The phrase â€œAll models are wrong, some are usefulâ€ is quite loosely used. Some take it in a very literal sense to implyâ€¦

2 æ¡è¯„è®º
Abstraction and Data Science - Not a great combination

2021å¹´7æœˆ8æ—¥

Abstraction and Data Science - Not a great combination

Abstraction - some succinct definitions. â€œAbstraction is the technique of hiding implementation by providing a layerâ€¦

17 æ¡è¯„è®º
Why MOOCs may not help you get that Data Science Job

2020å¹´10æœˆ5æ—¥

Why MOOCs may not help you get that Data Science Job

Many aspiring Data Scientists lament that despite doing many massive open online courses (MOOCs) they are not gettingâ€¦

3 æ¡è¯„è®º
No ML Algorithms Cheat Sheet, Please

2020å¹´10æœˆ1æ—¥

No ML Algorithms Cheat Sheet, Please

What is a Cheat Sheet ? Wikipedia defines cheat sheets as a concise set of notes used for quick reference. Now the wordâ€¦

4 æ¡è¯„è®º
Ainâ€™t No Such a Thing as a "Citizen Data Scientist"

2020å¹´9æœˆ25æ—¥

Ainâ€™t No Such a Thing as a "Citizen Data Scientist"

Dear Aspiring Data Scientist, Before you start using â€˜low codeâ€™ or â€˜drag & dropâ€™ data science tools, please learn theâ€¦

45 æ¡è¯„è®º
As a data scientist, what are some dead giveaways that a person is a complete amateur?

2020å¹´1æœˆ22æ—¥

As a data scientist, what are some dead giveaways that a person is a complete amateur?

I was asked this question on Quora. Well â€œit gets one to know oneâ€ !! Every Data Scientist would have started his/herâ€¦

1 æ¡è¯„è®º
Degrees of Freedom and Sudoko

2019å¹´1æœˆ28æ—¥

Degrees of Freedom and Sudoko

Intuitive explanation of Degrees of Freedom and How Degrees of Freedom affects Sudoku Lot of aspiring Data Scientistsâ€¦

3 æ¡è¯„è®º
How I used NLP (Spacy) to screen Data Science Resumes

2019å¹´1æœˆ15æ—¥

How I used NLP (Spacy) to screen Data Science Resumes

Resume building is very tricky. A candidate has many dilemmas, whether to state a project at length or just mention theâ€¦

3 æ¡è¯„è®º
Want To Become a Data Scientist? Try Feynman Technique.

2018å¹´1æœˆ11æ—¥

Want To Become a Data Scientist? Try Feynman Technique.

Many a blogs and articles are written on how to become a Data Scientist. The list normally goes like this: Studyâ€¦

1 æ¡è¯„è®º

See all articles

Predicting Heart Disease using Machine Learning? Donâ€™t!

Venkat Raman

Co-Founder & CEO at Aryma Labs | Building Marketing ROI Solutions For a Privacy First Era | Statistician |

Venkat Ramançš„æ›´å¤šæ–‡ç«

ç¤¾åŒºæ´žå¯Ÿ

å…¶ä»–ä¼šå‘˜ä¹Ÿæµè§ˆäº†

The Art and Science of Time Series Forecasting: A Practical Guide for Business Professionals

The Bayesian Approach to Data Analysis and Prediction. Implications for the Superbowl

How To Address the Five Things Driving the "Us vs Them" Debate About Data Mesh

Heard about Confidence Interval?

You are into Data Science? Learn Linear Regression first (Introduction, some Pitfalls and how to avoid them)

Gini index for ML (Performance measurement and many more..)

How to Deal With Imbalanced Classification and Imbalanced Regression Data?

COVID-19: Counting the Hidden Cases

Decoding the World with Numbers: Why Statistics is Important

Data Science For Good - For All

Venkat Ramançš„æ›´å¤šæ–‡ç«

Data Science requires heavy dose of statistics not less

â€œAll models are wrong, some are usefulâ€ â‰ Modeling is a futile exercise

Abstraction and Data Science - Not a great combination

Why MOOCs may not help you get that Data Science Job

No ML Algorithms Cheat Sheet, Please

Ainâ€™t No Such a Thing as a "Citizen Data Scientist"

As a data scientist, what are some dead giveaways that a person is a complete amateur?

Degrees of Freedom and Sudoko

How I used NLP (Spacy) to screen Data Science Resumes

Want To Become a Data Scientist? Try Feynman Technique.

ç¤¾åŒºæ´žå¯Ÿ

å…¶ä»–ä¼šå‘˜ä¹Ÿæµè§ˆäº†

The Art and Science of Time Series Forecasting: A Practical Guide for Business Professionals

The Bayesian Approach to Data Analysis and Prediction. Implications for the Superbowl

How To Address the Five Things Driving the "Us vs Them" Debate About Data Mesh

Heard about Confidence Interval?

You are into Data Science? Learn Linear Regression first (Introduction, some Pitfalls and how to avoid them)

Gini index for ML (Performance measurement and many more..)

How to Deal With Imbalanced Classification and Imbalanced Regression Data?

COVID-19: Counting the Hidden Cases

Decoding the World with Numbers: Why Statistics is Important

Data Science For Good - For All

â€œAll models are wrong, some are usefulâ€ â‰ Modeling is a futile exercise

å…¶ä»–ä¼šå‘˜ä¹Ÿæµè§ˆäº†