Data Analytics Capstone Project: Part 1, Studying Heart Disease Symptoms to derive insights from a public medical data.
Satyaki Ray
General Manager | Custodian | Geoscience | Technical Assurance at PETRONAS | Geomodeller | Petroleum Engineer | E&P | P. Geo | M.S.P.E | M.Sc
This article is part of a Capstone project by Satyaki Ray, that was conducted while he enrolled in the Texas A&M, Kingsville, Data Analytics Program recently. This is part 1 where the author investigates a heart disease symptoms public dataset to understand the factors that impact the chances of the disease.
Introduction The Capstone Project in TAMK Data Analytics course “Working in organizations and industry†required the download from Kaggle site and study of several datasets. Before starting with the hands-on exercise, we downloaded the data sets that are used for analysis and creating visualizations in Excel and Power BI from the given link: https://www.kaggle.com/datasets
Public Datasets Used:
The following two datasets were downloaded as csv/ xls files from the Kaggle portal.
1. https://www.kaggle.com/ronitf/heart-disease-uci
2. https://www.kaggle.com/gustavomartinez92/student-performance
Data Analytics Steps
Example 1: Heart Disease symptoms Imported csv data into excel and performed data analysis.
Step-1: Launched Excel and opened the heart.csv file. This is a data set for heart disease related symptoms or attributes. Once it opens, checked columns and data. The dataset has 303 records (rows) and 14 attributes (columns) per row. Created headers with meaningful names (for example trestbps is “resting blood pressure mm Hgâ€, Gender 1 is male, Gender is female, fbs is “fasting blood sugar†for which 1 is true and 2 is false and so on. Then created histograms on all columns after selecting all data per column and Insert > charts > All Charts > Histogram. We then resized the histogram and made chart label ( Figure 1). We would like to see if there are any relationship of gender to the symptoms such as blood pressure and cholesterol.
Figure 1: Creating histograms per column
Step-2: Create histograms for all 14 attributes (columns) by selecting the data (drag-select). Visualize the data at top of each column to create insights on heart disease symptom variability by Age, Gender, Chest pain, Blood Pressure level, cholesterol level, fasting blood sugar, resting ECG, max heart rate, exercise induced angina, old peak ST depression by exercise, slope of ST depression, Major vessel color, defect and target.
Figure 2: Get Data into Excel, compute histograms on all columns to see the variability of attributes or heart disease symptoms.
Step-3: Analyze the data closely. The histograms below show age variability of studied 303 data points range between 33 to 72 years. Two genders (male and female) have chest pain variabilities of 3 types. Blood pressure range between 94 – 167 with a peak of 112 in 70 cases. Serum cholesterol has peak value of 200 in 75 cases. Very high cholesterol cases are low in number 2-5 cases.
Figure 3: Data analysis by age, gender, chest pain, blood pressure, cholesterol.
Step-4: Analyze the data closely. The histograms below show heart disease symptom variability of 303 data points. The blood sugar true cases are high (250) and false cases are low. The maximum heart rates range between 143-150. There are 200 exercise induced angina cases.
领英推è
Figure 4: Data analysis of blood sugar, ECG, max heart rate and angina cases
Step-5: Analysis of data by ST depression due to exercise, slope of ST depression, fluoroscopy major vessel color change (0-30), Normal and abnormal defects and target. There are nearly 140 cases of ST depression by exercise, 3 types of fluoroscopy, dominated by type 0, and 160 cases of dominant defects.
Figure 5: Data analysis of ST depression, slope of ST, major vessel color (fluoroscopy), defect and target.
Step-6: Now we examine the gender counts and percentages of the candidates from the 303 cases. We use the =COUNTIF(B14:B316,"0") and =COUNTIF(B14:B316,"1"), to identify number of Females (96 cases) and Males (207 cases). We use the =SUM(Q14:Q15) to verify that the total cases add to 303.Next we calculate the percentages by dividing Female cases / Total cases and Male cases/ Total cases. There are 31.68% females and 68.32% males in this study. We use these calculated tables to generate a bar graph and a pie chart for visualization of gender count and percentages (See Figure 6)
Figure 6: Bar Graph and Pie chart of gender count and percentages.
Step-7: In this step we examined the average “at rest blood pressure†by gender (Female average and Male average. To do that we used the =AVERAGE(D14:D316), =AVERAGEIFS(D14:D316,B14:B316,"0") and =AVERAGEIFS(D14:D316,B14:B316,"1") to extract the 3 averages. The first is the Aveg trestbps (overall average blood pressure, 131.62), the second is Female average blood pressure (133.08) and the third is male average blood pressure 130.95. It appears that females suffer from higher blood pressure in this study ( Figure 7). Blood pressure analysis based on gender.
Figure 7: At rest blood pressure analysis by gender (male, female).
Step-8: In this step we study the average cholesterol variability by gender. We use AVERAGE and AVERAGEIFS functions to achieve the results. It appears that female cases show higher average (261.30) versus male cases (239.29). The overall average cholesterol in this dataset is 246.25.?
Figure 8: Average cholesterol by gender
This completes the part 1 of my Capstone Project. It shows a simple analysis. Next article will be on part 2....please stay tuned. Any suggestions are welcome.
Author: Satyaki Ray, Houston, Texas, 2021, raysatyaki305@gmail.com