Heart Attack Prediction in the U.S. Using Databricks and AWS

Heart Attack Prediction in the U.S. Using Databricks and AWS

Episode 2: Exploratory Data Analysis

Introduction?

In this episode, Mike and I focus on loading and preparing our dataset for analysis in Databricks. We explore different methods for importing data from Kaggle into Databricks. After setting up secure access to Amazon S3, we conduct an exploratory data analysis (EDA) to assess data quality, identify trends, and evaluate key factors contributing to heart attack risk.

Let’s get started!

Content:

  • Load Dataset from Kaggle into Databricks
  • Exploratory Data Analysis
  • Conclusion
  • Next?

Load and Read Dataset from Kaggle

When one wants to get data into Databricks, there are many options. We evaluated three for ease of setup, repeatability in a pipeline, and access from Python.?

Secrets

A note about sensitive values in code: For items where we needed keys or tokens to access APIs, we used Databricks’ Secrets feature. Using a notebook just for this one-time Secrets setup, we loaded secrets for Kaggle, AWS, and even Databricks API. We did this because we do not want these values stored in code or in GitHub. Instead, they should be stored in encrypted vaults and retrieved at runtime, never logged to disk and never otherwise visible to a person.


Loading from Kaggle

Kaggle offers a CLI and API, so it seems attractive to pull data from Kaggle into our Databricks environment. That would give us a dataset that sits in local storage without having to browser-download the file to our workstations and then upload it to Databricks. A simple python snippet logs in to Kaggle and downloads our dataset’s files. From there we can work with it as long as our compute session stays alive.?


Uploading to Databricks Catalog

A Databricks Catalog offers a way to store the data in a persistent manner. We can download the file from Kaggle, create a catalog, and do a simple upload? to get the file accessible to create our dataframe by a reference string.


Pipeline using S3 Integration

AWS S3 loading is another convenient way to access external content, and applicable to more than just Kaggle datasets of course. Files can be staged there from any place else, and the s3 bucket is a common integration point. WIth our POC environment being a trial version of Databricks, outside our normal Databricks environment, we didn’t have the full UI features but were able to write code to set up an External Location. In a pipeline, this requires programmatically setting up the Storage Credentials as our AWS credentials to get into the s3 bucket; the external location; a Databricks Catalog, and a schema to reference our data. It was straightforward to chain together a script that downloads the file from Kaggle, uploads to s3, and creates the catalog which gives us opportunities to work with this data many times without re-downloading.

Load to s3


Set up Databricks to AWS Integration - Service Credential


Set up Databricks to AWS Integration - Storage Credential and External Location

Create Tables from Files in s3 Buckets

The dataset is now available with this code.

# Load the dataset
df = spark.table("default.heart_attack_dataset")        

Option Pros and Cons

The Kaggle method was very easy to include with docs from Kaggle, all in python, and definitely repeatable. It isn’t the most efficient as network bandwidth is required to get that file every time.

The Catalog method with a manual upload is by definition not a pipeline. It was easy to do and get started. The manual method of course is not done through Python code, but click-ops.

THe s3 method is a pipeline that brings the best of the other worlds together. It was not the easiest to do, but do-able. It is definitely a pipeline-type model, and can be achieved within a Databricks Python notebook.

Because we wanted to have a process that could be stored securely in GitHub and revived whenever we wanted to work on our project again, we chose the s3 method.?

Exploratory Data Analysis (EDA)

Before diving into the analysis, we first examined the dataset statistics, including:

  • Distribution of values per variable
  • Missing data (if any)
  • Data types and ranges

Data Sample

A sample of the dataset shows various demographic and lifestyle attributes along with heart attack outcomes.??


Data Distribution?

The distributions below are calculated for numerical values. We observe three types of data:

  • Continuous values (e.g., Age)
  • Binary values (e.g., Smoker)
  • Binned values (e.g., Physical Activity)



Heart attack rate per hereditary and lifestyle factor

One of our key questions is whether hereditary and lifestyle factors contribute equally to heart attack risk. Below, we analyze heart attack rates based on these factors.

Heart Attack Rate by Family history


Heart Attack Rate by Physical Activity Level?


Key Takeaways?

The heart attack rate across hereditary and lifestyle factors appears nearly equal, which seems counterintuitive. This could be due to the omission of key variables in this visualization. Analyzing specific subgroups—such as individuals with the same ethnicity, family history, or pre-existing conditions—might reveal more meaningful differences.

To validate the point above, we examined heart attack rates based on alcohol consumption among white individuals with a history of cancer living in a specific residential area. Surprisingly, as shown in the bar chart below, the heart attack rates remain almost equal across all levels of alcohol consumption—an unexpected result. Even more puzzling, the highest level of alcohol consumption is slightly associated with lower heart attack rates compared to other levels. This contradiction suggests potential data quality issues or missing confounding factors, raising concerns about the dataset’s reliability.


Employment Status vs. Heart Attack Risk

From the data, let's analyze the percentage of heart attacks within each employment category:

Employment Status

No Heart Attack (%)

Heart Attack (%)

Unemployed

(62532 / 124706) * 100 ≈ 50.1%

(62174 / 124706) * 100 ≈ 49.9%

Retired

(62206 / 124373) * 100 ≈ 50.0%

(62167 / 124373) * 100 ≈ 50.0%

Employed

(61920 / 123895) * 100 ≈ 50.0%

(61975 / 123895) * 100 ≈ 50.0%


Key Observations

  1. No Strong Correlation: The percentage of heart attacks is nearly 50% across all employment statuses. This suggests that employment status alone does not significantly impact heart attack occurrence.
  2. Balanced Distribution: Each category has nearly an equal split between heart attack and no heart attack cases.
  3. Possible External Factors: Since employment status doesn't show a clear trend, other factors (e.g., age, lifestyle, pre-existing conditions) likely play a bigger role in determining heart attack risk.


Key Takeaways?

Looking at the numbers, I expected employment to contribute to heart attacks—maybe from work stress or even having too much fun at the office. But, surprise! The data says otherwise!

Employment status does not appear to be a strong predictor of heart attack occurrence in this dataset. Further analysis incorporating additional health and lifestyle variables is needed for deeper insights.


Correlation Between Health/Life Factors and Heart Attack Outcome

Beyond simple comparisons, we conducted a correlation analysis to measure how strongly each factor is associated with heart attack occurrence.


The correlation analysis between various input factors and heart attack outcomes presents key insights:

Key Observations

  1. Positive Correlations (Higher Risk Factors)
  2. Negative Correlations (Potential Protective Factors)

Key Takeaways

  • The results align with established medical research: hereditary conditions, unhealthy lifestyle choices (smoking, alcohol consumption), and pre-existing conditions (diabetes, high blood pressure) increase heart attack risk.
  • Factors such as stress management, medication adherence, and socioeconomic stability may have protective effects.
  • Further analysis, including causation studies and deeper statistical modeling, would be beneficial to confirm and refine these findings.

Conclusion

We created a storage credential for connecting to AWS S3, enabling seamless data access for analysis. However, upon examining the dataset, we found significant data quality issues, suggesting it may be artificially generated rather than derived from real-world cases. This became evident in our analysis, where key risk factors—such as hereditary conditions, lifestyle choices, and employment status—showed nearly identical effects on heart attack occurrence, regardless of their severity. Such inconsistencies undermine the dataset’s reliability for meaningful medical insights. Therefore, we do not recommend using this dataset for realistic heart attack analysis, as it may lead to misleading conclusions.

Next!

The challenge with working with such datasets when developing models is that feature engineering becomes impractical due to the similar correlation across all features. Additionally, achieving high accuracy is extremely difficult, as the best-case scenario often results in predictions no better than random guessing (50/50), making evaluation challenging. For this reason, we have decided to find a more realistic dataset for our analysis.

Authors:?

Noor Sabahi and Mike Olivieri

#HeartAttackPrediction #DataScience #MachineLearning #Databricks #AWS #PySpark #BigData #AI?

要查看或添加评论,请登录

Noor S.的更多文章

社区洞察