Heart Attack Prediction in the U.S. Using Databricks and AWS
Episode 1: Setting Up the Foundation – Data, Tools, and Environment
Introduction
In collaboration with Mike Olivieri, we are analyzing heart attack data in the U.S. and developing a predictive model based on various health and lifestyle factors.
In this short series, we will walk you through the required tools, setup process, and full implementation. The complete code will be shared on GitHub and Kaggle for reference.
This is an easy-to-follow, step-by-step tutorial designed to guide you through the entire analysis process.
Content
Dataset
We selected the Heart Attack dataset from Kaggle, which consists of 373,000 rows of data collected across the U.S. This dataset includes information on various health and lifestyle factors that may contribute to heart attacks. Key attributes include age, cholesterol levels, blood pressure, smoking habits, and heart attack occurrence.
Our goal is to analyze this data to identify potential risk factors and trends that can enhance heart health awareness and prevention strategies.
Technology Stack
Setup
Step 1: Created a Databricks account and linked it to our AWS account.
Create a Databricks account – Sign up at Databricks.
Choose AWS as the cloud provider.
Databricks automatically runs a pre-populated AWS CloudFormation template to set up the necessary IAM roles and Amazon S3 bucket while deploying your workspace.
You can track the resource deployment process in AWS CloudFormation.
Step 2: Create GitHub Repository?
Step 3: Syncing Code Between GitHub and Databricks
Go to Repos under Workspace, and click on “Create Git Folder”
Populate with Git Repository URL (Public repository)?
You can select specific folders by checking “Sparse checkout mode” option.
Repository files, including a notebook we created and pushed to the repository are loaded in Databricks.
To commit changes from your Databricks notebook to a GitHub repository, follow these steps:
If you encounter the error message "Error pushing changes", navigate to User Settings → Git Integration and authenticate using either your GitHub-linked account or a Personal Access Token (PAT). This will enable Git integration, allowing you to push changes seamlessly.??
Step 4: Share the notebook for real-time collaboration.
?To share "Heart-Attack-Prediction-in-the-US", you must share the parent Git folder (Heart-Attack-Prediction-in-the-US). Permissions will apply to all contents of the Git folder.
Note:
The Databricks application must be installed on the GitHub account for that repository to enable proper sharing, allowing all collaborators to commit changes.
Areas of Analysis
When thinking about heart disease, one might view it in the context that Peter Attia describes in his 2023 book Outlive: The Science and Art of Longevity by Peter Attia MD, Hardcover | Barnes & Noble?. (not an affiliate marketer.)? Cardiovascular disease is one of the 4 major categories of illnesses that shorten not only lifespan but "healthspan", which is the span of years when a person is healthy enough to enjoy life. His perspective is that so many diseases need to be prevented years before symptoms start to show with lifestyle changes.
This analysis should not be construed as medical advice (consult your medical doctor) but rather is an interesting topic to explore in data science that mirrors real life.
The dataset from Ankush Panday, "Heart Attack Prediction US" offers opportunities to see some of these leading indicators. With so many variables to choose from, our analysis starts with those that are of a hereditary or lifestyle choice nature. This excludes from input the many factors that are medical symptoms or measures at a point in time, such as heart rate or blood pressure.? One can’t control heredity, but can control choices that one makes. The factors we will explore include the below.
Heredity
Lifestyle Choices
This analysis will find which of the statuses of these factors in combination with others result in a higher likelihood of heart attack. The hope is that awareness of these brings about change to result in longer life and "healthspan" for all.
Next Steps
Now that we’ve completed the setup, the next episode will focus on data analysis using PySpark ML, where we will explore patterns and build a predictive model.
Dataset on Kaggle
Authors:?
Noor Sabahi and Mike Olivieri
#HeartAttackPrediction #DataScience #MachineLearning #Databricks #AWS #PySpark #BigData #AI?