Heart Attack Prediction in the U.S. Using Databricks and AWS

Heart Attack Prediction in the U.S. Using Databricks and AWS

Episode 1: Setting Up the Foundation – Data, Tools, and Environment

Introduction

In collaboration with Mike Olivieri, we are analyzing heart attack data in the U.S. and developing a predictive model based on various health and lifestyle factors.

In this short series, we will walk you through the required tools, setup process, and full implementation. The complete code will be shared on GitHub and Kaggle for reference.

This is an easy-to-follow, step-by-step tutorial designed to guide you through the entire analysis process.


Content

  • Dataset
  • Technology Stack?
  • Setup?
  • Area of Analysis?
  • Next Step?


Dataset

We selected the Heart Attack dataset from Kaggle, which consists of 373,000 rows of data collected across the U.S. This dataset includes information on various health and lifestyle factors that may contribute to heart attacks. Key attributes include age, cholesterol levels, blood pressure, smoking habits, and heart attack occurrence.

Our goal is to analyze this data to identify potential risk factors and trends that can enhance heart health awareness and prevention strategies.


Technology Stack

  • Databricks – A cloud-based analytics platform that provides scalable computing and storage resources. We will use Databricks on AWS for this project.
  • PySpark – We will use PySpark for big data processing, analysis, and model training.
  • Amazon S3 – The dataset will be stored in an Amazon S3 bucket, and Databricks will be configured to read from it.
  • GitHub Repository – A central repository to store and version control the Databricks notebooks, scripts, and configurations for collaboration and reproducibility.


Setup

Step 1: Created a Databricks account and linked it to our AWS account.

Create a Databricks account – Sign up at Databricks.


Choose AWS as the cloud provider.


  1. Verify your email – You'll receive a validation email to start your trial.
  2. Set up a password for your Databricks account.
  3. Select a subscription plan.
  4. Create a workspace – Choose a name and region.

Databricks automatically runs a pre-populated AWS CloudFormation template to set up the necessary IAM roles and Amazon S3 bucket while deploying your workspace.

You can track the resource deployment process in AWS CloudFormation.


Step 2: Create GitHub Repository?

Step 3: Syncing Code Between GitHub and Databricks

Go to Repos under Workspace, and click on “Create Git Folder”

Populate with Git Repository URL (Public repository)?

You can select specific folders by checking “Sparse checkout mode” option.


Repository files, including a notebook we created and pushed to the repository are loaded in Databricks.


To commit changes from your Databricks notebook to a GitHub repository, follow these steps:

  1. Open the Git integration dialog: Next to the notebook’s name, click the Git branch button.
  2. Select the changes to commit.
  3. Enter commit details.
  4. Commit and push the changes: Click commit and push button.


If you encounter the error message "Error pushing changes", navigate to User Settings → Git Integration and authenticate using either your GitHub-linked account or a Personal Access Token (PAT). This will enable Git integration, allowing you to push changes seamlessly.??



Step 4: Share the notebook for real-time collaboration.

?To share "Heart-Attack-Prediction-in-the-US", you must share the parent Git folder (Heart-Attack-Prediction-in-the-US). Permissions will apply to all contents of the Git folder.

Note:

The Databricks application must be installed on the GitHub account for that repository to enable proper sharing, allowing all collaborators to commit changes.


Areas of Analysis

When thinking about heart disease, one might view it in the context that Peter Attia describes in his 2023 book Outlive: The Science and Art of Longevity by Peter Attia MD, Hardcover | Barnes & Noble?. (not an affiliate marketer.)? Cardiovascular disease is one of the 4 major categories of illnesses that shorten not only lifespan but "healthspan", which is the span of years when a person is healthy enough to enjoy life. His perspective is that so many diseases need to be prevented years before symptoms start to show with lifestyle changes.

This analysis should not be construed as medical advice (consult your medical doctor) but rather is an interesting topic to explore in data science that mirrors real life.

The dataset from Ankush Panday, "Heart Attack Prediction US" offers opportunities to see some of these leading indicators. With so many variables to choose from, our analysis starts with those that are of a hereditary or lifestyle choice nature. This excludes from input the many factors that are medical symptoms or measures at a point in time, such as heart rate or blood pressure.? One can’t control heredity, but can control choices that one makes. The factors we will explore include the below.

Heredity

  • Family History
  • Ethnicity
  • Thalassemia

Lifestyle Choices

  • Smoker
  • Alcohol Consumption
  • Physical Activity
  • Diet
  • Marital Status
  • Residence

This analysis will find which of the statuses of these factors in combination with others result in a higher likelihood of heart attack. The hope is that awareness of these brings about change to result in longer life and "healthspan" for all.


Next Steps

Now that we’ve completed the setup, the next episode will focus on data analysis using PySpark ML, where we will explore patterns and build a predictive model.


Dataset on Kaggle

https://www.kaggle.com/datasets/ankushpanday2/heart-attack-prediction-in-united-states/data


Authors:?

Noor Sabahi and Mike Olivieri

#HeartAttackPrediction #DataScience #MachineLearning #Databricks #AWS #PySpark #BigData #AI?

要查看或添加评论,请登录

Noor S.的更多文章

社区洞察