登录查看更多内容

Heart Attack Prediction in the U.S. Using Databricks and AWS

Noor S.

Principal AI Engineer (AWS & HashiCorp Ambassador)

发布日期: 2025年3月19日

+ 关注

Episode 2: Exploratory Data Analysis

Introduction?

In this episode, Mike and I focus on loading and preparing our dataset for analysis in Databricks. We explore different methods for importing data from Kaggle into Databricks. After setting up secure access to Amazon S3, we conduct an exploratory data analysis (EDA) to assess data quality, identify trends, and evaluate key factors contributing to heart attack risk.

Let’s get started!

Content:

Load Dataset from Kaggle into Databricks
Exploratory Data Analysis
Conclusion
Next?

Load and Read Dataset from Kaggle

When one wants to get data into Databricks, there are many options. We evaluated three for ease of setup, repeatability in a pipeline, and access from Python.?

Secrets

A note about sensitive values in code: For items where we needed keys or tokens to access APIs, we used Databricks’ Secrets feature. Using a notebook just for this one-time Secrets setup, we loaded secrets for Kaggle, AWS, and even Databricks API. We did this because we do not want these values stored in code or in GitHub. Instead, they should be stored in encrypted vaults and retrieved at runtime, never logged to disk and never otherwise visible to a person.

Loading from Kaggle

Kaggle offers a CLI and API, so it seems attractive to pull data from Kaggle into our Databricks environment. That would give us a dataset that sits in local storage without having to browser-download the file to our workstations and then upload it to Databricks. A simple python snippet logs in to Kaggle and downloads our dataset’s files. From there we can work with it as long as our compute session stays alive.?

Uploading to Databricks Catalog

A Databricks Catalog offers a way to store the data in a persistent manner. We can download the file from Kaggle, create a catalog, and do a simple upload? to get the file accessible to create our dataframe by a reference string.

Pipeline using S3 Integration

AWS S3 loading is another convenient way to access external content, and applicable to more than just Kaggle datasets of course. Files can be staged there from any place else, and the s3 bucket is a common integration point. WIth our POC environment being a trial version of Databricks, outside our normal Databricks environment, we didn’t have the full UI features but were able to write code to set up an External Location. In a pipeline, this requires programmatically setting up the Storage Credentials as our AWS credentials to get into the s3 bucket; the external location; a Databricks Catalog, and a schema to reference our data. It was straightforward to chain together a script that downloads the file from Kaggle, uploads to s3, and creates the catalog which gives us opportunities to work with this data many times without re-downloading.

Load to s3

Set up Databricks to AWS Integration - Service Credential

Set up Databricks to AWS Integration - Storage Credential and External Location

Create Tables from Files in s3 Buckets

The dataset is now available with this code.

# Load the dataset
df = spark.table("default.heart_attack_dataset")

Option Pros and Cons

The Kaggle method was very easy to include with docs from Kaggle, all in python, and definitely repeatable. It isn’t the most efficient as network bandwidth is required to get that file every time.

The Catalog method with a manual upload is by definition not a pipeline. It was easy to do and get started. The manual method of course is not done through Python code, but click-ops.

THe s3 method is a pipeline that brings the best of the other worlds together. It was not the easiest to do, but do-able. It is definitely a pipeline-type model, and can be achieved within a Databricks Python notebook.

Because we wanted to have a process that could be stored securely in GitHub and revived whenever we wanted to work on our project again, we chose the s3 method.?

Exploratory Data Analysis (EDA)

Before diving into the analysis, we first examined the dataset statistics, including:

Distribution of values per variable
Missing data (if any)
Data types and ranges

Data Sample

A sample of the dataset shows various demographic and lifestyle attributes along with heart attack outcomes.??

Data Distribution?

The distributions below are calculated for numerical values. We observe three types of data:

Continuous values (e.g., Age)
Binary values (e.g., Smoker)
Binned values (e.g., Physical Activity)

Heart attack rate per hereditary and lifestyle factor

One of our key questions is whether hereditary and lifestyle factors contribute equally to heart attack risk. Below, we analyze heart attack rates based on these factors.

Heart Attack Rate by Family history

Heart Attack Rate by Physical Activity Level?

Key Takeaways?

The heart attack rate across hereditary and lifestyle factors appears nearly equal, which seems counterintuitive. This could be due to the omission of key variables in this visualization. Analyzing specific subgroups—such as individuals with the same ethnicity, family history, or pre-existing conditions—might reveal more meaningful differences.

To validate the point above, we examined heart attack rates based on alcohol consumption among white individuals with a history of cancer living in a specific residential area. Surprisingly, as shown in the bar chart below, the heart attack rates remain almost equal across all levels of alcohol consumption—an unexpected result. Even more puzzling, the highest level of alcohol consumption is slightly associated with lower heart attack rates compared to other levels. This contradiction suggests potential data quality issues or missing confounding factors, raising concerns about the dataset’s reliability.

Employment Status vs. Heart Attack Risk

From the data, let's analyze the percentage of heart attacks within each employment category:

Employment Status

No Heart Attack (%)

Heart Attack (%)

Unemployed

(62532 / 124706) * 100 ≈ 50.1%

(62174 / 124706) * 100 ≈ 49.9%

Retired

(62206 / 124373) * 100 ≈ 50.0%

(62167 / 124373) * 100 ≈ 50.0%

Employed

(61920 / 123895) * 100 ≈ 50.0%

(61975 / 123895) * 100 ≈ 50.0%

Key Observations

No Strong Correlation: The percentage of heart attacks is nearly 50% across all employment statuses. This suggests that employment status alone does not significantly impact heart attack occurrence.
Balanced Distribution: Each category has nearly an equal split between heart attack and no heart attack cases.
Possible External Factors: Since employment status doesn't show a clear trend, other factors (e.g., age, lifestyle, pre-existing conditions) likely play a bigger role in determining heart attack risk.

Key Takeaways?

Looking at the numbers, I expected employment to contribute to heart attacks—maybe from work stress or even having too much fun at the office. But, surprise! The data says otherwise!

Employment status does not appear to be a strong predictor of heart attack occurrence in this dataset. Further analysis incorporating additional health and lifestyle variables is needed for deeper insights.

Correlation Between Health/Life Factors and Heart Attack Outcome

Beyond simple comparisons, we conducted a correlation analysis to measure how strongly each factor is associated with heart attack occurrence.

The correlation analysis between various input factors and heart attack outcomes presents key insights:

Key Observations

Positive Correlations (Higher Risk Factors)
Negative Correlations (Potential Protective Factors)

Key Takeaways

The results align with established medical research: hereditary conditions, unhealthy lifestyle choices (smoking, alcohol consumption), and pre-existing conditions (diabetes, high blood pressure) increase heart attack risk.
Factors such as stress management, medication adherence, and socioeconomic stability may have protective effects.
Further analysis, including causation studies and deeper statistical modeling, would be beneficial to confirm and refine these findings.

Conclusion

We created a storage credential for connecting to AWS S3, enabling seamless data access for analysis. However, upon examining the dataset, we found significant data quality issues, suggesting it may be artificially generated rather than derived from real-world cases. This became evident in our analysis, where key risk factors—such as hereditary conditions, lifestyle choices, and employment status—showed nearly identical effects on heart attack occurrence, regardless of their severity. Such inconsistencies undermine the dataset’s reliability for meaningful medical insights. Therefore, we do not recommend using this dataset for realistic heart attack analysis, as it may lead to misleading conclusions.

Next!

The challenge with working with such datasets when developing models is that feature engineering becomes impractical due to the similar correlation across all features. Additionally, achieving high accuracy is extremely difficult, as the best-case scenario often results in predictions no better than random guessing (50/50), making evaluation challenging. For this reason, we have decided to find a more realistic dataset for our analysis.

Authors:?

Noor Sabahi and Mike Olivieri

#HeartAttackPrediction #DataScience #MachineLearning #Databricks #AWS #PySpark #BigData #AI?

要查看或添加评论，请登录

Noor S.的更多文章

Heart Attack Prediction in the U.S. Using Databricks and AWS

2025年3月11日

Heart Attack Prediction in the U.S. Using Databricks and AWS

Episode 1: Setting Up the Foundation – Data, Tools, and Environment Introduction In collaboration with Mike Olivieri…
What If Mind Cloning Becomes an Option?

2025年3月7日

What If Mind Cloning Becomes an Option?

Imagine being able to clone anyone’s mind and integrate it as an extension of your own! For commercial purposes…

2 条评论
Episode 4: Developing Anomaly Detection on AWS

2025年3月5日

Episode 4: Developing Anomaly Detection on AWS

Training, Deploying, and Integrating a Real-Time Anomaly Detection System In this final episode of our anomaly…
Super Mario Deployment on EKS

2025年3月4日

Super Mario Deployment on EKS

In this article, I’ll Walk you through deploying the Super Mario game on Amazon EKS using a forked example from…

1 条评论
Episode 6: Redefining the Hiring Process: AI-Powered Recruitment Using Natural Selection and Genetic Algorithms

2025年2月26日

Episode 6: Redefining the Hiring Process: AI-Powered Recruitment Using Natural Selection and Genetic Algorithms

From Idea to Reality – App Development! Introduction Welcome back to the final episode of our AI-powered recruitment…
Episode 3: Developing Anomaly Detection on AWS

2025年2月25日

Episode 3: Developing Anomaly Detection on AWS

Implementing Infrastructure & Exploratory Data Analysis In this episode, David and I take the next big step in our…
Securely Deploy Docker Images to Amazon ECR with GitHub Actions & OIDC

2025年2月21日

Securely Deploy Docker Images to Amazon ECR with GitHub Actions & OIDC

In modern cloud-native development, securely building and deploying containerized applications is a top priority…
Episode 2: Developing Anomaly Detection on AWS

2025年2月19日

Episode 2: Developing Anomaly Detection on AWS

Designing the Solution Architecture & Selecting ML Models Introduction In this episode, we dive into the architecture…
Episode 12: Time Series Analysis & Forecasting using Exponential Smoothing Method on Amazon SageMaker

2025年2月16日

Episode 12: Time Series Analysis & Forecasting using Exponential Smoothing Method on Amazon SageMaker

Hello, everyone! Welcome back! In the last episode, we explored the SARIMA model—today, we’ll explore Exponential…
Episode 1: Developing Anomaly Detection on AWS

2025年2月14日

Episode 1: Developing Anomaly Detection on AWS

Understanding the Problem & Our Plan Introduction Welcome to this new series, where we tackle the challenge of…

See all articles

Introduction?

Load and Read Dataset from Kaggle

Secrets

Loading from Kaggle

Uploading to Databricks Catalog

Pipeline using S3 Integration

Load to s3

Set up Databricks to AWS Integration - Service Credential

Set up Databricks to AWS Integration - Storage Credential and External Location

Create Tables from Files in s3 Buckets

Option Pros and Cons

Exploratory Data Analysis (EDA)

Data Sample

Data Distribution?

Heart attack rate per hereditary and lifestyle factor

Key Takeaways?

Employment Status vs. Heart Attack Risk

Key Observations

Key Takeaways?

Correlation Between Health/Life Factors and Heart Attack Outcome

Key Observations

Key Takeaways

Conclusion

Next!

Noor S.的更多文章

Heart Attack Prediction in the U.S. Using Databricks and AWS

What If Mind Cloning Becomes an Option?

Episode 4: Developing Anomaly Detection on AWS

Super Mario Deployment on EKS

Episode 6: Redefining the Hiring Process: AI-Powered Recruitment Using Natural Selection and Genetic Algorithms

Episode 3: Developing Anomaly Detection on AWS

Securely Deploy Docker Images to Amazon ECR with GitHub Actions & OIDC

Episode 2: Developing Anomaly Detection on AWS

Episode 12: Time Series Analysis & Forecasting using Exponential Smoothing Method on Amazon SageMaker

Episode 1: Developing Anomaly Detection on AWS

社区洞察