Predicting upcoming illness with Whoop biometric data and AI
The author's well-worn Whoop band

Predicting upcoming illness with Whoop biometric data and AI

By Ed Medina, Wednesday, June 26, 2024

What if you could predict when you’re going to be sick? Given one week notice, would you reschedule a vacation? Or maybe try to head it off by getting some extra rest?

It’s possible because I just built a system that uses my personal WHOOP biometric data and artificial intelligence to do just that: predict when I’m going to get sick within seven days, with roughly 90 percent accuracy.

Read on to see how I used data science and AI to do it, and please DM me with questions or thoughts ...


Summary:

  • The author can predict their upcoming illness based on biometric Whoop Data.
  • A Pruned Random Forrest AI model yielded the best results.
  • Accuracy of better than 90% was observed on Altair RapidMiner (chosen for rapid prototyping) and OpenAI 's ChatGPT (for validation).
  • Mean Weighted Recall was lower, but still above 70%. This was due to limited instances of days when author was sick — the target variable — and an imbalance in the classes of this attribute.
  • The SMOTE technique was used to create synthetic data and correct this issue, though more data would naturally improve this number. This technique improved Mean Weighted Recall to above 90% in the Random Forrest model.
  • Injury may also be similarly predictable, but more testing is needed.


WHOOP is a tech company that makes a powerful wearable device that I’ve worn since 2019. In addition to heart rate, the device records other biometrics like respiratory rate, blood oxygen, skin temperature, and sleep duration.?

My thinking went that I might have recorded enough biometric data over the years to train a machine learning classification model to determine whether I was sick, based purely on biometric data. And if that is possible, then it might also be possible to predict whether I was about to become sick.

In any case, it would be a good opportunity to test what I’ve learned over the past year as a student in two AI-related courses:

Both of these courses are available through the Great Learning online platform, and I have been taking these courses to help transition my career to the AI field.


Disclaimers and Background

Before I get started, a few important facts ...

What I did worked for me. I am not claiming my models will work for anyone else. So, understand this was a personal project based on my own personal data, and I am not a physician.

Also, I don’t work for Whoop or claim to represent them in any way. They didn’t pay me or sponsor this post. I am, however, a big fan. The power and potential of the Whoop device will become obvious here, but there are also other great wearables out there, like the Apple Watch, which I have also worn in the past. I just happened to wear the Whoop consistently for years.

Finally, to understand the story the data is telling, it helps to know a little about the test subject (me). Know that I have been blessed with good health. I’m a middle aged male, 51 years old, slightly smaller build, 155 pounds, and I don’t have any chronic diseases or health issues. Whoop generally markets its devices to athletes, and I’m an avid cyclist. I didn’t get sick often, which we will see created a data weakness.

In the end, this is an informal health case study of four years of detailed biometric data for one person, and by definition, this is a very biased data set.


Getting the Data

To get started, I needed to get my hands on my Whoop data, which Whoop, helpfully, provides.

To get it, you have to request it through the Whoop app.

1. In the WHOOP app, go to More > App Settings > Data Export.

2. Confirm your email.

3. Select Create Export.

4. Receive your data export via email within 24 hours.

You can request one export every 24 hours, according to Whoop.

There’s also an API, but I ignored that for now. (I’ll likely use the API to deploy the model in the future.) https://developer.whoop.com/api/

Whoop returns a .zip file, with four separate CSV files which roughly correspond to the four main areas that it tracks:

  • journal_entries.csv - A record of optional user entries
  • sleeps.csv - Sleep records
  • workouts.csv - A record of workouts
  • physiological_cycles.csv - A collection of other biometrics like resting heart rate, skin temp, etc.


Data Cleanup

The separate CSV files posed my first challenge. I needed to merge them into one, so that I had only one record for each day in a single row that contained all the columns from all the CSV files combined. Thankfully, Whoop included a (mostly) common value — “Cycle start time” — in the first column of each file, so I was able to use this field to merge the files in Excel using the Power Query Editor.

There’s a great explanation of this technique here:?https://youtu.be/325GKIXPsSI?si=DOtzKt9eZ3uZvztw

But before I did this, I had to solve a bigger problem: the journals.csv file, which was not in good shape. For each journal field, Whoop created duplicate records for each day, and entered the journal field in successive rows with a True/False value included in the next column. This included the basis for the target variable “Feeling sick or ill?”

Also, my daily journal “notes” entry (a brief recap that I wrote of my previous day) was repeated for the each row. It’s so hard to envision this that I’m just including a screen grab here, below. Needless to say this all had to get cleaned up.

The Whoop joural_entries.csv, before cleanup

I decided to exercise my Python programming skills and write a script to clean-up the journals.csv file. I also had to write a couple short scripts to clean-up the dates in sleeps.csv and workouts.csv so they didn’t have seconds in the cycle_start_time values. This allowed me to use this value to match the other two csv files and merge it all using the Excel technique mentioned above.?

The scripts gave me a good opportunity to work with the Pandas library and the Python datatetime module, read and write files, iterate over lists, and work with functions and data frames … basic programming stuff. In the end, it all worked. In the future I’ll probably merge these scripts into one so that it can process all four files in a folder at once, and spit out one clean CSV file.?

Part of the Python file showing the iteration over the data frame

Overall, cleaning up the data was a good, real-world experience in prepping data for a model. My teachers at MIT had warned us all that we would spend a lot of time on this step, and they were right. It took more than a week to write the scripts, clean it all up, and merge the files into a format that would work with the models. You can see how I iterated through the data frame above, to clean up the journals.

Here’s what the journal_entries.csv looked like after cleanup.



Data Pre-Processing & Exploratory Data Analysis (EDA)

After this was done, I had a single CSV file that was 869 rows and 69 columns (attributes), or 59,961 data fields.

The 69 attributes contained a mix of true/false binomials, date fields, a text field of daily notes, and many integers representing various biometrics. I’m not going to list all of them here, as is customary in case studies. There are just too many for this recap, but if anyone wants a list, DM me.

A few basic pre-processing steps included disregarding the dates, changing the boolean fields into numbers — 1 for TRUE or 0 for FALSE — as is common. There were also more than 8,000 missing values, which I removed or programmed the models to disregard.

The most important of all values was the target variable, defined as a True/False class value for the “Sick” field. My personal definition of whether I was sick was if I felt flu-ish. I had to have more than “the sniffles” to check that “Sick” box on Whoop’s app. Admittedly, this is very subjective, as another Whoop user would have a different personal definition of whether they were sick or not, and a wider study would need to address this.

Anyway, I didn’t want to simply classify whether or not I was sick. I wanted to know if I was about to get sick in the near future.

To do this I created two new True/False variables that represented a delta of two days before I said I was sick (SD2) and a similar seven-day delta (SD7). Originally I had planned to write a script to set these in the data based on the date of the entry. But because there weren’t that many records, I decided to do this by hand in Excel because it gave me an opportunity to review the data and notes for that day. So for instance, if I said I was sick on a Wednesday, then I manually set Monday, Tuesday, and Wednesday to SD2:True (within a two-day delta of being sick). Likewise, I did the same for the SD7 attribute.

My models would use the SD7 attribute as the target variable, as this would give me a week’s heads up about a possible sickness. After all, when you’re one week from being sick, it might be useful to know.

Other important variables included some well-known Whoop biometrics such as HRV (heart rate variability), Recovery Score, Day Strain, Sleep performance, heart rate zones, as well as lots of basic biometrics like Blood oxygen, REM duration, Respiratory Rate, etc. Whoop has an overview here: https://www.whoop.com/us/en/the-data/

There is also a field called “Notes” of unstructured data in my daily text journal descriptions. I did not use these in my model, because I wanted to limit this case study to biometric data only. However, I did run these models with NLP text vectorization techniques out of curiosity, and they did improve the models’ accuracy and recall. This was especially true for the Neural Network. So, perhaps this suggests that I did subconsciously signal that I was about to get sick in my writings over the years. It also suggests there is potential value in alternative sources of data in the health science, but as I mentioned, my focus for this informal case study was on the biometrics.

My data went back to 2019, but I had to discard everything before 2020, because I didn’t track sickness back in 2019. As mentioned above, I discarded various columns with empty values, and whatever duplicates I could in order to reduce the noise in my models. A few insights from the univariate and bivariate analysis ...

There were only 54 instances where I was sick. The lack of enough instances of the target variable was the biggest weakness in this case study. This is one of the key reasons why this model will only work for me, at best, because this class has so few instances where I was sick. Likely, a bigger study of more people with more diverse backgrounds would create more data with which the model could train and validate in the test set.

A simple count of the "Sick" attribute/target variable, showing a low count in the TRUE class


The creation of the SD7 target variable somewhat overcame this limitation by creating more data: in theory, seven new instances of a target variable for each day I was sick. In reality I didn’t always wear the Whoop for each of the seven preceding days. In all I was able to create 106 SD7 variables, which was nearly 14% of the data.


A count of the SD7 target variable, showing more values in the TRUE class

Some variables appeared correlated to SD7 (as we will see the models reveal below), including HRV, Respiratory Rate, Day Strain, and Awake Duration.

Below is a simple count.

The Correlation Matrix showed this a little more clearly. Note how even the values with highest correlation — for example, Day Strain — didn’t show strong correlations. This suggested it might be a challenge to create a model that could accurately predict SD7 based on such weakly correlated values.

Correlation matrix for the entire data set



Model Building and Evaluation

I targeted all three variables (Sick, SD2 and SD7) during my tests, and created the models to classify these variables as True/False.

To compare models, I used SD7 as the target variable. Also, SD2 and “Sick” values were not used in this data set for training these models to classify the SD7 target variable. In fact, I did not use the other sick variables to help predict the target sick variable for any model.?

I used overall Accuracy and Recall as key metrics for evaluating the models, and also Weighted Mean Recall.

I figured a False Negative value was more costly for a health model. We wouldn’t want the model telling us we are fine when we are, in fact, sick. So, because Recall reveals False Negatives, I decided to keep a close eye on this value, especially on the class recall for “true FALSE” values, see below.

Recall = True Positive / (True Positive + False Negative)

To build the models I used Rapid Miner, and I chose it for a few different reasons. First, I used Rapid Miner during the MIT class, and I was familiar with this platform. We built many similar models with the instructors during our case studies.

Rapid Miner is also very visual, and easy to use, and this makes it an excellent rapid prototyping tool. I built all these models within a few days, and was quickly able to adjust and compare them so I could decide on the best model. I also compared some of these with similar models on Dataiku and OpenAI in order to validate the work I did in Rapid Miner.

I chose not to code these tests for the initial prototyping; however, for next steps I will code this in Python in order to connect the best-performing model to the Whoop API and deploy it for everyday use in a real-world blind test.?

Overall, quickly determining which model performed best was a great use case for Rapid Miner as well as the techniques taught by MIT.


Basic Classification (Decision Tree)

Part of Rapid Miner Decision Tree AI model

A simple Decision Tree was the first model I tried, with a maximal depth of 10. I also split the Training and Test data sets at 70/30, the same split I used for all models here.?

The simple Decision Tree was fairly accurate in the Test data set (85.06% Accuracy). It was also slightly overfit, with a Training Accuracy of 99.02%, a difference of almost 13%.

The Test set also had 91.7% “true FALSE” class recall, which wasn’t bad. However, the class recall for true TRUE dropped to 37.5% for a Mean Weighted Recall of 64.6%, neither of which was great. Essentially, this model was much better at predicting when I wasn’t about to be sick than it was at predicting when I was actually on the verge.

Likely this was because there aren’t that many “true TRUE” values in the data (only 54, as noted in the EDA above). This is especially true in the Test set.?

Either way, this indicates overfitting in the Test set, and I would need to fix this.

It is worth noting that I did apply pruning to this tree, which removes some branches of the trees after generation. This can reduce complexity, increase accuracy and prevent overfitting. So this is about as optimized as I could make this model.

The bottom line is I just wasn’t sick often enough to create enough data to train this particular model very well. Obviously if this case study included more users, we would have more data.

I’ve included the confusion matrices for both Training and Test sets here.

Confusion matrix for the training data set
Confusion matrix for the test set

It’s also worth looking at the decision tree itself. It split on Heart Rate Variability (HRV), which validates this metric. Whoop puts a lot of emphasis on HRV in its app, and it’s validating to see how important it is in predicting whether or not I’m about to get sick.

These are the Attribute Weights for the decision tree, in descending order of weight, for the top 20 attributes.



Random Forest Pruned

Part of the Pruned Random Forrest model in Rapid Miner

The Pruned Random Forest model was the most accurate of all the Rapid Miner models.

Random Forrest models prevent overfitting, and they are more accurate than single decision trees.?

Accuracy for this model increased to 92.72% in the Test set, class recall for “true FALSE” was 99.13% and Mean Weighted Recall improved to 73% for the test set. Also, class recall for “true TRUE” improved to 47% (better than the 37.50% from the simple decision tree, but still, not great).

This model was better fit, with less than a 7% difference between test and training Accuracy. This model generated 67 decision trees.

It was also good to run this model, because a random forrest model was also used by Dataiku and OpenAI, and this made for more better comparisons when validating the model.

Confusion matrix for the training set
Confusion matrix for the test set

Neural Network

Part of the Neural Network Rapid Miner model showing epochs and activation function

The Neural Network did not perform as well as the Pruned Random Forrest.?

Accuracy for this model was 86.21% in the Test set (slightly lower than the Random Forrest model), and “true FALSE” class recall was also quite good at 96.07%. However, the “true TRUE” class recall dropped to 15.62%, which was very low. Overall, the Weighted Mean Recall improved with this model was 55.8%. Fit also dropped slightly, with the difference between Training and Test Set Accuracy was greater than 10%.

This model was set at 10 epochs and used the TANH activation function.

Confusion Matrix for the test data set
Confusion Matrix for the training set

SD7 vs SD2 and “Sick” Target Variables

To compare the different between the Sick, SD2, and SD7 values, I used the Pruned Random Forrest model mentioned above, and selected the three variables separately. Remember that in the models above the target variable was SD7, and I did not include the other two values — Sick and SD2 — in the data for those models.

Accuracy for this model increased to 92.72% in the Test set, class recall for “true FALSE” was 99.13% and Mean Weighted Recall improved to 73% for the test set. Also, class recall for “true TRUE” improved to 47% (better than the 37.50% from the simple decision tree, but still, not great).

Setting the target variable to SD2 and “Sick” did not show a huge change in Accuracy, though it did improve to 94.62 for the Sick target variable. Though this change is small, it may signal that the biometric phenomenon being observed strengthens with the onset of sickness.

However, the Recall for the “true TRUE” class fell to 18.18% for SD2, and Mean Weighted Recall was 58%. It fell a bit more for Sick: 12.50% and 58%. This likely reflects the shrinking size of the Recall class for “true TRUE,” which was only 22 records for SD2 and just 16 for Sick. The much higher Recall for the “true FALSE” class for both target variables suggests that more data would improve this number.

For SD2, Accuracy was 92.69%, and Weighted Mean Recall was 58%:

Confusion Matrix for the model with the SD2 target variable


For “Sick,” Accuracy was 94.62%, and Weighted Mean Recall was 56.25%:

Confusion Matrix for the model with the SD7 target variable

Cross-model Validation

To validate my approach in Rapid Miner, I used ChatGPT and Dataiku to do similar classification tasks, and I saw similar results.

Dataiku chose Random Forrest and a Logistic Regression models for me. The Random Forrest was the closest thing to my Rapid Miner file. It showed 93.10% Accuracy, 70% Precision, and 43.75% Recall. I didn’t tune the hyper parameters in Dataiku or add pruning. But the model was similar to Rapid Miner’s 92.72% Accuracy, and it was similar to the simple decision tree as well.

ChatGPT (3.5 model accessed through OpenAI) offered more insight.?

Using a Random Forest Classifier, ChatGPT achieved an Accuracy of 90.4%, a Precision of 100%, and a Recall of 18.75%. ChatGPT also noted the low Recall.

“These results suggest that while the model is very reliable in confirming when you are not going to be sick, it needs improvement in detecting upcoming sickness to reduce missed detections,” the chatbot concluded, and suggested I “consider balancing the dataset or adjusting the model to improve recall if predicting all potential sickness events is crucial.”

Specifically, it was referring to the SD7 class, and how there were 763 instances of SD7:False (not about to get sick) vs 106 instances of SD7:True (about to get sick). It suggested various techniques to rebalance the classes, but the one I chose was the Synthetic Data Generation (SMOTE - Synthetic Minority Over-sampling Technique).

ChatGPT returned this definition for SMOTE sampling technique

After asking the model to apply the SMOTE technique to balance the SD7 classes, and also asking it to apply pruning, ChatGPT 3.5 returned a better result with an Accuracy of 87.74%, Precision of 50%, and a Recall of 37.5% (an improvement of nearly 20%). As well, its Mean Weighted Recall was 87%. This suggests that the model identified more false positives as the Precision decreased from (72.73% to 50%).

Overall, these results were in line with the closest model from Rapid Miner (the Random Forrest with Pruning and Text Vectorization model).?

Finally, many of the top features for the GPT model were similar to the features that the Rapid Miner simple decision tree identified. These features are statistically weighted based on information gain. Features such as Respiratory rate, HRV, Day Strain, and Deep (SWS) duration appear in both lists of top features, and this shows that, while the models may differ, there is likely a real relationship between these variables and the SD7 target variable.



Classifying Injury

Over the years I also indicated whether I was injured in the journal entries. So, I decided to see if I could use the biometric data to determine whether or not I was sick — to classify it as True or False, as I had been doing with sickness.

I set the target variable to Injured for the Pruned Random Forrest model, and got some encouraging results.

Accuracy for this model was 98.08% in the test set, class recall for “true FALSE” was 99.60% and Mean Weighted Recall was 79.8% for the test set. Also, class recall for “true TRUE” was 60%.

This model was a good fit, too, with less than a 2% difference between test and training Accuracy. This model generated 29 decision trees.

This result shows that it is likely possible to determine whether a person is injured based on biometric data.

Confusion matrix for the training set
Confusion matrix for the test set

Conclusions

Using my personalized biometric information, it is possible to predict an oncoming sickness with high Accuracy and better-than-average Recall.

Models including ChatGPT 3.5 Random Forest, Dataiku Random Forrest, and a custom Rapid Miner Pruned Random Forrest model all show a strong ability to predict the SD7 variable with 90% Accuracy or better.

Recall for all models was lower, varying between 37.5% (for ChatGPT 3.5) to 47% Class Recall for the Rapid Miner Pruned Rapid Miner “true TRUE” class. Likely this is due to too few instances of sick days in the data, and an imbalance between true and false in this class. More data would help improve this weakness.

However, Rapid Miner showed better ability to predict the “true FALSE” class, and this is important when considering this model is meant to warn of potential sickness. It is quite good at determining when I’m not about to get sick, so I won’t be encouraged to go run a marathon when I should be resting. Similarly, I should take its warnings of upcoming illness with a grain of salt, knowing they are far from perfect.

Mean Weighted Recall was high for the Pruned Random Forrest models, with ChatGPT 3.5 showing an 87.74% and Rapid Miner showing 73%. This number averages both Recall classes, and the high “true FALSE” predictions drove this number higher.?

Again, the success of the “true FALSE” class was likely due to the abundance of days in the data when I was not sick.?

Predictably, adding the SMOTE Upsampling operator in Rapid Miner to the Pruned Rapid Forrest model improved Recall for the “true TRUE” class to 90.83%, which again suggests a bigger data set would improve this number. It also shows how helpful OpenAI can be when building these models, as it identified an important improvement to the model.

Confusion matrix (test set) for the Rapid Miner Pruned Random Forrest model after applying the SMOTE Upsampling operator

Sleep seems to be very important in predicting oncoming illness, as multiple sleep-related variables appear in both lists of weighted variables. Day Strain is also important. Clustering analysis and its data visualization might have been useful here to identify specific patterns or profiles like this. Encountering a new pathogen is inherently random; however, there is also some level of predictability that can be gleaned from the complicated relationship between these biometric markers. This relationship might appear more clearly with Clustering analysis.

Also, as the data is temporal, my choice of SD7 and SD2 was quite arbitrary. Would three days have been more accurate? What about four or five? Also, looking at more days out (say 10, 20 or 30 days from sickness onset) should show decreasing accuracy and further validate that the biometric phenomenon is real. A temporal data model might have more accurately identified the correct timespan needed to predict oncoming sickness, and it could also show these upward/downward trends over time.

Overall, I believe this exercise demonstrates the promise of personalized medicine, and the importance of access to our individual health data. There are already many other well-document use cases of machine learning diagnosing different types of diseases. To name just a few, machine vision has been used to predict cancer in mammograms, diabetic retinopathy in retinal images, and disease outbreak in epidemiology.

Overall, though sickness is somewhat random, it is fascinating to consider that there also appears to be a significant element of predictability that can be observed using biometric data.?


Next Steps

As it appears Injury can be classified based on biometrics, and considering that illness can be predicted, it seems reasonable to wonder whether it’s possible to predict when a person is entering the danger zone for a possible injury. Creating synthetic “prediction” delta variables for Injury — as I did with the SD2 and SD7 attributes — and running models to classify these would be a logical next step, to test this hypothesis.?

Adding a temporal data model for both injury and sickness target variables would help further validate these models. It would also help identify the correct prediction delta, and I will likely create these models soon.

After this I plan to encode the Pruned Random Forrest model in Python, and connect it with the Whoop API, so that I can begin comparing my own real-world experiences with the predictions from the model. I will likely do a six month blind test and compare my notes with the model’s daily prediction results.

With the model encoded, I may eventually connect it with OpenAI’s API in Microsoft Azure, so that I can connect my data with the growing body of health insights on ChatGPT. This will give me more experience working with the tools I’ve acquired this spring and allow me the opportunity to tackle some data privacy issues. It will also give me a place to add data not gathered by Whoop, such as health data from doctor’s visits, etc.

And speaking of doctor, I’d love to discuss this project with any of you physicians or health experts out there, so please contact me with thoughts or questions.

I’m also looking for employment, or new projects, so please connect with me here.

Shelly Calkins

Consultant / Prospect Research Manager at Marymount University

4 个月

Hello Ed, I love this concept! I've been wearing my Amazfit Bip 3 Pro for 1 1/2 years and I am amazed by the sleep tracking! I'm always so much better with proper sleep. And I have it notifying me when my heart rate exceeds 100 and I love to stop and reflect on what I'm doing at the time that is making my heart so excited lol. For me, low sleep foreshadows poor health and poor performance the next day for sure. I have trouble making myself go to bed because I'm always doing something. So I will go back to sleep in the morning to ensure I can have a smooth ride for the day. Better for me to get a little bit of a late start, refreshed, rather than crawl through the day on a crippling 5 hrs. Just not worth it. I love measuring data, & I could use something like this. So I'm guessing that you are working to bring this to the masses? Put me on the list for one. Noice! Shelly ??

回复
Andy Lam

Fractional Product Design Leader / Outcomes-driven UX / Confidence in the right ideas / Delivering value with each release

4 个月

Ah yes I tried that recently, got a glimmer of hope but it really hasn’t done many things different other than telling me sleep is important. I was hoping it would take in all my data and BE the coach where I don’t have to learn how to train it. The personal assistant that knows you better than you. I’d love to just get notifications telling me when is ideal to go to bed each day based on my activity, when do I have the most optimal energy and should train that week or day, or like what you’ve done.. look at my past illness occurrences and match it to strain and recovery patterns and tell me to chill out Is that so much to ask? ??

回复
Andy Lam

Fractional Product Design Leader / Outcomes-driven UX / Confidence in the right ideas / Delivering value with each release

4 个月

This is so cool, Ed! Impressive! I have yet to find a lot of value wearing my WHOOP every day, but this predictive info would be helpful as I write this being sick right now ?? From a UX perspective of using the tech, I guess I was expecting to suggest to me through notifications and based on past data how to optimize my upcoming week, what to do, and how hard to do it… am I just using it wrong??

Aaron Kurland

Software is everything

5 个月

Wow. Great work. Is your code publicly available?

要查看或添加评论,请登录

社区洞察

其他会员也浏览了