Research On Clustering And Classification Algorithm Applied To Heart Disease Data
1. Introduction
1.1 Background of heart attack
Heart attack, also known as myocardial infarction, is a situation where a severe reduction of oxygenated blood flows to the heart occurs?(American Heart Association, 2023; Hawaii State Department of Health, 2012). Atherosclerosis, process of coronary arteries getting narrowed by accumulations of plague, fat, and cholesterol, is the cause of heart attack and recognized as coronary artery disease (American Heart Association, 2023; Centre for disease control and prevention, 2021). During the process of accumulation, blood clot might form if the plaque breaks, causing obstruction of blood flow, thus leading to permanent irreversible damage of heart muscles (American Heart Association, 2023). Classifications of heart attack were divided into two types of namely ST segment elevation myocardial infarction (STEMI) and Non-ST segment elevation myocardial infarction (NSTEMI), determined by the electrocardiogram (ECG) (see figure 1) (Hawaii State Department of Health, 2012)
1.2 Problem statement
Heart attack, if left untreated, can lead to a wide range of serious complications which are potential lethal including arrhythmia, heart failure, cardiogenic shock, and heart rupture (NHS, 2023). With respect to British Heart Foundation (2024), heart attack is the leading cause of death worldwide, in every 3 deaths, 1 is always caused by heart attack and it was estimated that there were 20.5 million people killed by heart attack in 2021. In Hong Kong, heart attack is placed in 4th among the cause of death, data has pointed out that heart attack causes 3995 death and approximately 10.9 person died each day from coronary heart diseases in 2022 (Healthy HK, 2023). Taking everything into account, Heart attack is a serious public health issues in the globe.
Three different supervised machine learning algorithms, namely decision trees, naive Bayes, and K-NN algorithms, were used to analyze heart disease data(Chandna, D. 2014). The data were evaluated using 10-fold cross-validation and the results of different models were compared. The decision tree classifier is a supervised method, which is generally relatively simple. It selects nodes and divides samples according to the information gain of attributes. It does not require domain knowledge or setting parameters in advance, and can produce higher-dimensional data. Na?ve Bayes is a statistical classifier that assumes no dependencies between attributes. It attempts to maximize the posterior probability of a determined class, and classifiers perform well in many complex real-world situations. The k-nearest neighbor algorithm (k-NN) is a method used to classify objects based on the nearest training data in the feature. The distance measurement index and the number of reference nearest objects can be selected. For high-dimensional feature spaces, the classifier The classification effect is not good because the distance cannot be measured well.
1.3 Data source
The source of this dataset is from a website named Kaggle, and it was published by a data engineer named Naresha Bhat. Data set url is:
The dataset has collected data related to heart disease such as age, sex, type of chest pain, level of cholesterol in serum, ECG results, etc. On top of that, it has also recorded the possibility of having heart attack in “target” column, integers in “target” column have been divided into two values “0” and “1”, while “0” indicates lower risk of heart attack, “1” refers to a higher risk of heart attack.
1.4 Methods
The database contains 76 attributes, but all published experiments mention using a subset of 14 of them. In particular, the Cleveland database is the only database used by ML researchers. The "target" field refers to whether the patient has heart disease. It is an integer value 0 = no/less chance of heart attack, 1 = more chance of heart attack. The purpose of our experiment is to predict with higher accuracy whether a heart attack will occur based on attribute values after processing the feature data.
Our project revolved around exploring different variables, such as age, sex, type of chest pain, level of cholesterol in serum, and ECG results, then find out the relationships of these variables on the incidence of heart attack. To predict the likelihood of a heart attack, we utilized both () model and () model. By inputting the gathered data into these models, we aim to analyze the relationship of the variables and possibility of heart attack, hence list out the 3 most common variables that is highly associated with heart attack.
2.Data Cleansing
2.1 Initial exploration of data to examine the data quality
The following codes were used to import the dataset into the program. The code in line 1 is to import the package named “pandas” while the code in line 4 is to read the csv file of the dataset, Line 3 is to ensure the output displays all columns and line 5 is used to print out the first 5 row of the dataset for checking. Figure 3 shows the results of after running codes in figure 2.
We can use the?codes for data grouping, the variables (age, resting blood pressure, cholesterol) were sorted to 5 groups, then find the average value of “target” columns in each group, which indicates risk of heart attack in each group. Results shows that lower age group, lower resting blood pressure and higher cholesterol level has a higher risk of heart attack (see figure 4).
Count graph of risk of heart attack
The code starts by importing a package called matplotlib.py for plotting. For plotting and counting the number of values in the "target" column, use the statistics function to count the number of data obtained to create a count bar chart. The plot shows 165 people with a higher risk of heart disease and 138 people with a lower risk (see Figure 15).
Visualizing the relationship between the risk of heart attack by gender gives a more intuitive view of the impact of gender. First, we get the value count for heat attack and group them with sex. Set the background of the image to whitegrid. In the figure, orange bars indicate "male", blue bars indicate "female", and "0" and "1" indicate lower and higher risk of heart attack, respectively. The results showed that men had a higher risk of heart attack than women (see Figure 6).
Correlations of the variables
The image below visualizes the dependency of the variables. The first step is to import the necessary library "seaborn". Next, we draw a heat map to find the correlation between each column. The results showed the highest risk correlation for heart attack and chest pain, followed by maximum heart rate and third by "slope", the slope of the peak motion ST-segment in the dataset.
3.Data Pre-processing
Since the feature values of each column of the given open dataset do not have missing values, our preprocessing operation first removes duplicate elements from the data. The following code detects duplicate rows by keeping the first one that occurs and removing the remaining duplicate rows(He, Li, & Zhang, 2010).
Contradictory data refers to data with exactly the same feature values but different label values. It may lead to false guidance signals during the training process of the model, which affects the convergence and performance of the model, and it also increases the error rate of the model on unseen data and reduces the generalization ability of the model, making it unable to make accurate predictions on new data. We preprocess to remove the contradictory data in it.
Standardized preprocessing is to transform the data into a distribution with mean 0 and standard deviation 1(Rosenson, & Cannon, 2020). Its main purpose is to eliminate the dimensional differences between different features, make the model easier to learn the relationship between features, and improve the training effect and generalization ability of the model. Through standardization, it can ensure that the weights of different features affect the results more balanced in the model training process, and avoid the situation where some features have too much influence on the model due to their large numerical range. Normalization also helps to accelerate the convergence speed of the model, reduce the training time, and improve the stability of the model.
领英推荐
Data equalization and oversampling are strategies adopted to solve the class imbalance problem. When the number of samples of different categories in the data set is very different, the model may be biased to predict the category with a large number of samples, while ignoring the category with a small number of samples, resulting in the degradation of the model performance(Chawla, Bowyer, Hall, & Kegelmeyer,2002). Through data balancing and oversampling, the number of samples of different categories can be relatively balanced, so as to improve the recognition ability of the model for minority categories. Oversampling balances the dataset by increasing the replication of minority classes or synthesizing new samples, so that the model can better learn the features of minority classes in the training process and improve the generalization ability and prediction accuracy of the model. We use the SMOTE oversampling method.
4. Model Planning and Building
Although we can approximately know that people with overweight are the main group of people suffering from heart attack easily, it is curious whether there is a correlation between some of the variables and the risk of heart attack. Hence, we decided to find out the correlated relationship between three possible variables (resting blood pressure, serum cholesterol, and age), by using K-mean algorithm and regression model to analyze this relationship.
Firstly, the K-mean algorithm is a method for grouping n observations into K clusters, by using a chosen value of k, identifying clusters of objects based on the objects’ proximity to the centre of the k groups (Sharma, 2024). We strongly believe that people with: (1) high levels of resting blood pressure; or (2) high levels of serum cholesterol; or (3) older age have a higher risk of heart attack. Our prediction is that the group of people with high levels of resting blood pressure, high levels of serum cholesterol; or older age will be grouped together and focus on the area representing the target numbers (the chance of heart attack: 0= less chance of heart attack 1= more chance of heart attack). We processed the K-mean method by using one variable and the outcome value, one-by-one.
Furthermore, the regression model is used to build a model that predicts the outcome for an unknown case. We predicted that people with higher levels of resting blood pressure, higher levels of serum cholesterol, and older age might have a higher chance of suffering from heart disease. We processed the data by using both a simple regression model and a multiple regression model.
As there are not any missing values in all mentioned variables, we just need to standardize all data in z-score form for the processing of clustering when processing the K-mean model.
We firstly standardize all data of resting blood pressure, serum cholesterol, age, and target into z-score for the clustering later.
Processing: K-Mean
In order to find out the best and suitable value of K (number of clusters), we used the method K-Means() in the scikit-learn library (Figure 7). ?Eventually, we found out that ‘2’ (Figure 29) is the most suitable number of clusters for clustering later.
As shown in Figure 8, the maximum iteration is set to 50 times, and the influence of each attribute column on the prediction accuracy is measured by comparing the use of a single attribute as a feature. Among them, thal, thalach and cp have the greatest influence on the accuracy of the attribute.
Processing: Regression Model
We firstly splitted the dataset in df2 into train data (80%) and test data (20%). With using random_state = 3. Then we used the training data to calculate the coefficients and intercept values for a simple linear regression with "restring blood pressure" as the predictor variable and "target" as the criterion variable. After that, we plot the scatter using the training data and draw the regression lines. Finally, we also calculated the value of the Mean Absolute error (MAE), and by using the test data from the sample, we obtained the squared error (MSE), root mean square error (RMSE) and R-squared values of the model. Regarding the multiple regression model, it is similar to the process of the simple regression model, but there are some changes in some parts.
5. Results and key findings
In general, even though two methods may bit be able to illustrate a clear and accurate result, we can still find out that resting blood pressure, serum cholesterol, and age are correlated to the risk of heart attack (the ‘target’), with using simple and multiple regression models. We also discover that the K-mean method is more suitable for finding out which group of people have higher risk of heart attack, comparing with regression model.
Concerning the result processed by the K-mean method, the K-mean methods provide a useful information which it can tell us which group of people is the highest risk group of suffering from heart attack. ?For example, we can see that people identified as target number ‘1’ are grouped together into the area representing value of serum cholesterol (chol.) in Figure 9?and Figure 10.
??We can suggest that people with about 200-300 level of serum cholesterol have more chance of suffering from heart attack according to this figure result. Figure 9-14 illustrated the result of clustering with using K-mean methods. ?Nevertheless, that the result/figures can demonstrate some clear and significant groups based on the different values of variables after clustering, we cannot strongly guarantee that the group of people having higher risk of heart attack is thanks to the higher values of resting blood pressure, higher values of serum cholesterol, and order age. ? We suggest that the function of K-mean method is clustering the same value and/or attributes of the data together, rather than demonstrating the correlation between two or above variables, which means that it is not able to illustrate the outcome variable is changed because of the upward or downward of one variable.
Regarding the results processed by regression methods, we can approximately find out the significant correlated relationship between the outcome and the other three variables. ?However, the results are not our expectation which we supposed that higher value of resting blood pressure, higher level of serum cholesterol, and older age cause the increasing risk of heart attack.No matter what the predictor variable is, there is a negative correlated relationship between one variable and the outcome, and all the values of R-squared are negative . Even though finding out the correlation by using multiple regression model, the result is same as the result using simple regression model.Undoubtedly, regression model can proves that three predictor variables (resting blood pressure, serum cholesterol, age) are negatively correlated with the criterion variable (target), nevertheless, regression model may be not suitable for finding out the correlation between some predictor variables and the criterion viable, as the outcome variable (target) was represented by the single number (‘1’ or ‘0’), which may cause the inaccurate results. ?Generally, in case the predictor variable and criterion variable are represented by actual values, the points do not focus on a main area in the figure of regression model. Figure 15-17 illustrated the regression results and the table below?illustrates the values of MAE, MSE, RMSE, the R-squared, coefficient, and the intercept of the regression model.
The result of MAE, MSE, RMSE, R-squared, coefficient, intercept of each variable with the outcome variable (target):
Note: all data are gotten around three significant figures
* The average?value of coefficient of the three predictor variables
6.Conclusion
In our research, we proposed a complete algorithm related to data analysis, including the complete process from data preprocessing to model construction. During the experiment, we gained a more comprehensive understanding of the heart disease data set used, and Analyzed what we wanted.Our model adopts two main methods: K-means and logistic regression. K-means is a very important clustering algorithm, mainly used for unsupervised learning. We use it here to identify people at high risk of heart disease. First, we calculate the distance between the sample and the center of the existing cluster, and continue to add new samples to the clusters that have been clustered. According to the characteristics of the newly added sample and the Based on the characteristics of the original sample, the center positions of the clustered clusters are updated and used in the next round of iterations. When predicting a new sample, the distance between the feature value and other samples can be directly calculated to obtain the prediction label of the new sample's heart disease. The final clustering results discovered new clinical intervention points, which provided help for subsequent prevention and treatment. The K-mean method provides useful information by telling us which group of people is at the highest risk for a heart attack. While the experiment can show some clear and significant groups based on different values of the clustered variables, we have no good guarantee that the groups with higher heart attack risk are due to higher values of resting blood pressure, serum cholesterol, and age. We suggest that the function of the K-mean method is to cluster the same values and/or attributes of the data together, rather than to show the correlation between two or more variables, that is, it cannot explain the rise or fall of the outcome variable due to one variable. changes due to decline.Our experiments also employ regression models. For the results processed by the regression method, we can roughly find that there is a significant correlation between the results and the other three variables. However, the results were not in line with our expectations, which suggested that high resting blood pressure, high serum cholesterol levels and increasing age contribute to an increased risk of heart attack. Regardless of the predictor variable, there is a negative relationship between a variable and the outcome, and all r-squared values are negative. This finding may indicate that there is a complex interaction between risk factors for heart attack, or that current models do not adequately capture the true relationship between these variables.
7.Weakness and Prospects
Our research still has certain shortcomings. In the future, we may consider applying more advanced machine learning algorithms, such as random forests or deep learning network-related frameworks, to the heart disease data set used this time. By training a more accurate model, in the future we hope that our work can help patients who are older or have higher resting blood pressure to be more proactive in recommending lifestyle changes, such as adjusting their diet and increasing physical activity. Through further research, we can better identify high-risk groups and formulate effective prevention strategies, which we hope will provide valuable guidance for future research directions and clinical practice.
? References
FrontEnd Developer
9 个月original ai rate:https://app.originality.ai/share/d8gbp1otm6yqwkix