登录查看更多内容

Heart Attack Prediction in the U.S. Using Databricks and AWS

Noor S.

Principal AI Engineer (AWS & HashiCorp Ambassador)

发布日期: 2025年3月11日

+ 关注

Episode 1: Setting Up the Foundation – Data, Tools, and Environment

Introduction

In collaboration with Mike Olivieri, we are analyzing heart attack data in the U.S. and developing a predictive model based on various health and lifestyle factors.

In this short series, we will walk you through the required tools, setup process, and full implementation. The complete code will be shared on GitHub and Kaggle for reference.

This is an easy-to-follow, step-by-step tutorial designed to guide you through the entire analysis process.

Content

Dataset
Technology Stack?
Setup?
Area of Analysis?
Next Step?

Dataset

We selected the Heart Attack dataset from Kaggle, which consists of 373,000 rows of data collected across the U.S. This dataset includes information on various health and lifestyle factors that may contribute to heart attacks. Key attributes include age, cholesterol levels, blood pressure, smoking habits, and heart attack occurrence.

Our goal is to analyze this data to identify potential risk factors and trends that can enhance heart health awareness and prevention strategies.

Technology Stack

Databricks – A cloud-based analytics platform that provides scalable computing and storage resources. We will use Databricks on AWS for this project.
PySpark – We will use PySpark for big data processing, analysis, and model training.
Amazon S3 – The dataset will be stored in an Amazon S3 bucket, and Databricks will be configured to read from it.
GitHub Repository – A central repository to store and version control the Databricks notebooks, scripts, and configurations for collaboration and reproducibility.

Setup

Step 1: Created a Databricks account and linked it to our AWS account.

Create a Databricks account – Sign up at Databricks.

Choose AWS as the cloud provider.

Verify your email – You'll receive a validation email to start your trial.
Set up a password for your Databricks account.
Select a subscription plan.
Create a workspace – Choose a name and region.

Databricks automatically runs a pre-populated AWS CloudFormation template to set up the necessary IAM roles and Amazon S3 bucket while deploying your workspace.

You can track the resource deployment process in AWS CloudFormation.

Step 2: Create GitHub Repository?

Step 3: Syncing Code Between GitHub and Databricks

Go to Repos under Workspace, and click on “Create Git Folder”

Populate with Git Repository URL (Public repository)?

You can select specific folders by checking “Sparse checkout mode” option.

Repository files, including a notebook we created and pushed to the repository are loaded in Databricks.

To commit changes from your Databricks notebook to a GitHub repository, follow these steps:

Open the Git integration dialog: Next to the notebook’s name, click the Git branch button.
Select the changes to commit.
Enter commit details.
Commit and push the changes: Click commit and push button.

If you encounter the error message "Error pushing changes", navigate to User Settings → Git Integration and authenticate using either your GitHub-linked account or a Personal Access Token (PAT). This will enable Git integration, allowing you to push changes seamlessly.??

Step 4: Share the notebook for real-time collaboration.

?To share "Heart-Attack-Prediction-in-the-US", you must share the parent Git folder (Heart-Attack-Prediction-in-the-US). Permissions will apply to all contents of the Git folder.

Note:

The Databricks application must be installed on the GitHub account for that repository to enable proper sharing, allowing all collaborators to commit changes.

Areas of Analysis

When thinking about heart disease, one might view it in the context that Peter Attia describes in his 2023 book Outlive: The Science and Art of Longevity by Peter Attia MD, Hardcover | Barnes & Noble?. (not an affiliate marketer.)? Cardiovascular disease is one of the 4 major categories of illnesses that shorten not only lifespan but "healthspan", which is the span of years when a person is healthy enough to enjoy life. His perspective is that so many diseases need to be prevented years before symptoms start to show with lifestyle changes.

This analysis should not be construed as medical advice (consult your medical doctor) but rather is an interesting topic to explore in data science that mirrors real life.

The dataset from Ankush Panday, "Heart Attack Prediction US" offers opportunities to see some of these leading indicators. With so many variables to choose from, our analysis starts with those that are of a hereditary or lifestyle choice nature. This excludes from input the many factors that are medical symptoms or measures at a point in time, such as heart rate or blood pressure.? One can’t control heredity, but can control choices that one makes. The factors we will explore include the below.

Heredity

Family History
Ethnicity
Thalassemia

Lifestyle Choices

Smoker
Alcohol Consumption
Physical Activity
Diet
Marital Status
Residence

This analysis will find which of the statuses of these factors in combination with others result in a higher likelihood of heart attack. The hope is that awareness of these brings about change to result in longer life and "healthspan" for all.

Next Steps

Now that we’ve completed the setup, the next episode will focus on data analysis using PySpark ML, where we will explore patterns and build a predictive model.

Dataset on Kaggle

https://www.kaggle.com/datasets/ankushpanday2/heart-attack-prediction-in-united-states/data

Authors:?

Noor Sabahi and Mike Olivieri

#HeartAttackPrediction #DataScience #MachineLearning #Databricks #AWS #PySpark #BigData #AI?

要查看或添加评论，请登录

Noor S.的更多文章

Heart Attack Prediction in the U.S. Using Databricks and AWS

2025年3月19日

Heart Attack Prediction in the U.S. Using Databricks and AWS

Episode 2: Exploratory Data Analysis Introduction In this episode, Mike and I focus on loading and preparing our…
What If Mind Cloning Becomes an Option?

2025年3月7日

What If Mind Cloning Becomes an Option?

Imagine being able to clone anyone’s mind and integrate it as an extension of your own! For commercial purposes…

2 条评论
Episode 4: Developing Anomaly Detection on AWS

2025年3月5日

Episode 4: Developing Anomaly Detection on AWS

Training, Deploying, and Integrating a Real-Time Anomaly Detection System In this final episode of our anomaly…
Super Mario Deployment on EKS

2025年3月4日

Super Mario Deployment on EKS

In this article, I’ll Walk you through deploying the Super Mario game on Amazon EKS using a forked example from…

1 条评论
Episode 6: Redefining the Hiring Process: AI-Powered Recruitment Using Natural Selection and Genetic Algorithms

2025年2月26日

Episode 6: Redefining the Hiring Process: AI-Powered Recruitment Using Natural Selection and Genetic Algorithms

From Idea to Reality – App Development! Introduction Welcome back to the final episode of our AI-powered recruitment…
Episode 3: Developing Anomaly Detection on AWS

2025年2月25日

Episode 3: Developing Anomaly Detection on AWS

Implementing Infrastructure & Exploratory Data Analysis In this episode, David and I take the next big step in our…
Securely Deploy Docker Images to Amazon ECR with GitHub Actions & OIDC

2025年2月21日

Securely Deploy Docker Images to Amazon ECR with GitHub Actions & OIDC

In modern cloud-native development, securely building and deploying containerized applications is a top priority…
Episode 2: Developing Anomaly Detection on AWS

2025年2月19日

Episode 2: Developing Anomaly Detection on AWS

Designing the Solution Architecture & Selecting ML Models Introduction In this episode, we dive into the architecture…
Episode 12: Time Series Analysis & Forecasting using Exponential Smoothing Method on Amazon SageMaker

2025年2月16日

Episode 12: Time Series Analysis & Forecasting using Exponential Smoothing Method on Amazon SageMaker

Hello, everyone! Welcome back! In the last episode, we explored the SARIMA model—today, we’ll explore Exponential…
Episode 1: Developing Anomaly Detection on AWS

2025年2月14日

Episode 1: Developing Anomaly Detection on AWS

Understanding the Problem & Our Plan Introduction Welcome to this new series, where we tackle the challenge of…

See all articles

Introduction

Dataset

Technology Stack

Setup

Step 1: Created a Databricks account and linked it to our AWS account.

Step 2: Create GitHub Repository?

Areas of Analysis

Heredity

Lifestyle Choices

Next Steps

Noor S.的更多文章

Heart Attack Prediction in the U.S. Using Databricks and AWS

What If Mind Cloning Becomes an Option?

Episode 4: Developing Anomaly Detection on AWS

Super Mario Deployment on EKS

Episode 6: Redefining the Hiring Process: AI-Powered Recruitment Using Natural Selection and Genetic Algorithms

Episode 3: Developing Anomaly Detection on AWS

Securely Deploy Docker Images to Amazon ECR with GitHub Actions & OIDC

Episode 2: Developing Anomaly Detection on AWS

Episode 12: Time Series Analysis & Forecasting using Exponential Smoothing Method on Amazon SageMaker

Episode 1: Developing Anomaly Detection on AWS

社区洞察