Using CyclicBoosting to predict my heart rate
Maybe you heard, that Blue Yonder recently open sourced CyclicBoosting (https://github.com/Blue-Yonder-OSS/cyclic-boosting). The current state is something like a beta release to get early feedback.
The algorithm is mainly used for demand predictions at Blue Yonder but is general purpose. I haven't used it in along time as I am working on very different things at Boue Yonder. So, I wanted to play around a bit, reactivating my data modeling skills.
I have to admit it is not yet in a shape where you can start right away, but even in it's current state it was quite easy to get it working. To install it just do:
pip install cyclic-boosting
Finding a dataset
First I need a dataset to play with. Of course there are boring house prices in Boston, some bike rides or cabs in NY. But working for so long in BY, this algorithm is a rather emotional thing, I need something personal. So, I thought: hey, I have my fitness tracker, the Xiaomi Mi band 5 and it turns out, thanks to GDPR you can download all of your data as a nice set of csv. Perfect!
Let's have a look at the data. There is the continuous measurement of the heart rate:
date,time,heartRat
2021-02-15,13:49,55
2021-02-15,13:51,64
2021-02-15,13:54,71
2021-02-15,13:59,64
2021-02-15,14:01,70
2021-02-15,14:02,82
....
And the data of the step counter:
date,time,step
2021-02-15,14:23,22
2021-02-15,14:24,1
2021-02-15,14:25,18
2021-02-15,14:27,34
2021-02-15,14:57,19
...
First we need to load the data as pandas dataframes. We load both files and resample to 1 hour equidistant sample rate by averaging the heart rate and summing up the step counts in one hour windows and then concatenating both dataframes.
import pandas as pd
df = pd.read_csv('data/HEARTRATE_AUTO_1672265608333.csv', parse_dates=[['date', 'time']])
df = df.set_index('date_time').resample("1h").mean()
df_step = pd.read_csv('data/ACTIVITY_MINUTE_1672265604979.csv', parse_dates=[['date', 'time']])
df_step = df_step.set_index('date_time').resample("1h").sum()
df = pd.concat([df, df_step], axis=1)d
This gives us a dataset with over 15k rows which should be fine to train a decent model. Here is a histogram of my heart rate. If you are a cardiologist and you think there is a problem here, please send me a message, I have no clue if this looks good or bad or just normal.
Let's have a look at the correlation between step count and the heart rate. As you can see, there is a correlation, but it is not that strong as I would have expected. In principle you can say: At night I usually don't walk and when I sleep my heart rate is low.
Enough exploration, let's build the model!
First we need to massage the data a bit. The idea is, to take the data available "now" and predict the heart rate in the next hour.
The most frequent mistake in forecasting by far is the leakage of future (target) information into the training data. So I hope I am doing a decent job here, feedback welcome:
# this is the target
df['hr_next_hour'] = df['heartRate'].shift(-1, freq="1h")
# last thre hours of the heart rate measurement
df['hr_past_hour_1'] = df['heartRate'].shift(1, freq="1h")
df['hr_past_hour_2'] = df['heartRate'].shift(2, freq="1h")
df['hr_past_hour_3'] = df['heartRate'].shift(3, freq="1h")
# last thre hours of the heart rate measurement
df['stp_past_hour_1'] = df['steps'].shift(1, freq="1h")
df['stp_past_hour_2'] = df['steps'].shift(2, freq="1h")
df['stp_past_hour_3'] = df['steps'].shift(3, freq="1h")
# drop samples with NaN in the target column
df.dropna(subset=['hr_next_hour'], inplace=True)
For the actual training, we need to split the target colums from the feature dataframe and calculate some more features, in this case
领英推è
- Month of the year (for seasonality)
- Day of the week (for weekly patterns)
- Hour of the day (for daily patterns)
import numpy as n
def prepare_data(df):
??? df['dayofweek'] = df['date_time'].dt.dayofweek
??? df['month'] = df['date_time'].dt.month
??? df['timeofday'] = df['date_time'].dt.hour
??? df = df.drop(columns='date_time')
?? ?
??? y = np.asarray(df['hr_next_hour'])
??? X = df.drop(columns='hr_next_hour')
??? return X, yp
X, y = prepare_data(df.reset_index())
Now that we have the data prepared, we need to define the features we want to use and of what types these features are. In CyclicBoosting this is done like this:
from cyclic_boosting import flag
def feature_properties():
??? fp = {}
??? fp['dayofweek'] = flags.IS_ORDERED
??? fp['month'] = flags.IS_ORDERED
??? fp['timeofday'] = flags.IS_ORDERED
??? fp['heartRate']? = flags.IS_CONTINUOUS | flags.HAS_MISSING
??? fp['hr_past_hour_1']? = flags.IS_CONTINUOUS | flags.HAS_MISSING
??? fp['hr_past_hour_2']? = flags.IS_CONTINUOUS | flags.HAS_MISSING
??? fp['hr_past_hour_3']? = flags.IS_CONTINUOUS | flags.HAS_MISSING
??? fp['steps']? = flags.IS_CONTINUOUS | flags.HAS_MISSING
??? fp['stp_past_hour_1']? = flags.IS_CONTINUOUS | flags.HAS_MISSING
??? fp['stp_past_hour_2']? = flags.IS_CONTINUOUS | flags.HAS_MISSING
??? fp['stp_past_hour_3']? = flags.IS_CONTINUOUS | flags.HAS_MISSING
??? return fps
fp = feature_properties()
That's quite neat, isn't it? It's like a composable ML type system. And not having to fix the data to not have NaNs is a big relief and it is much better if the algorithm, like CyclicBoosting, takes care of missing values.
Now, to define the CyclicBoosting model, I have to admit I just copied it from an example somehwere else, because I actually don't know what exactly is going on. There is a chance I am doing something very stupid here:
from sklearn.pipeline import Pipeline
from cyclic_boosting import binning, CBFixedVarianceRegressor, observers, common_smoothers
def cb_model(fp):
??? plobs = [observers.PlottingObserver(iteration=-1)]
??? est = CBFixedVarianceRegressor(
??????? feature_properties=fp,
??????? feature_groups=fp.keys(),
??????? observers=plobs,
??????? maximal_iterations=50,
??????? smoother_choice=common_smoothers.SmootherChoiceGroupBy(
??????????? use_regression_type=True,
??????????? use_normalization=False,
??????? ),
??? )
??? binner = binning.BinNumberTransformer(n_bins=100, feature_properties=fp)
??? ml_est = Pipeline([("binning", binner), ("CB", est)])
??? return ml_est
Looks like a lot of stuff we could configure here, binners, smoothers, oh boy, really looking forward for improved documentation.
But now we are mostly done, just the training and evaluation left, let's go!
X = X[fp.keys()]
ml_est = cb_model(fp)
# training
ml_est.fit(X.copy(), y)
# prediction
df['yhat'] = ml_est.predict(X)
If you are familiar with scikit-learn, you immediately feel home, as CyclicBoosting implements the scikit-learn interface.
The training is done is done in less then 200ms which is quite ok for 15k samples I would say.
CyclicBoosting has some support for evaluation and analytics of the training performance, but that is something for another post.Let's do what every good data scientist does first instead of looking at some KPIs like MAD: Let's look at the target and the prediction.
Hey, for a first try, not much fine-tuning and even not much clue what I am doing, that doesn't look too bad I would say!
Fun fact: Do you see this gap, where it seems instead of resting at night I am constantly at max heart rate? Maybe an error? Nope, if you look at the date, that's 3 days of PyCon.DE & PyData Berlin 2022 and I was one of the organizers. (Just buy a fitness tracker and a ticket for pycon.de and checkout if you can reproduce.)
Conclusion
Yes, documentation needs to be improved (contributions more then welcome). Some rough edges here and there. But overall, a good, robust and fast result with minimal effort.
Try it out yourself (feel free to open issues if you have questions) and looking forward for any feedback or interesting result.
Pushing the AI frontier
2 年Great to see that CyclicBoosting has found its way into open source. Sebastian Neubauer: I am not a cardiologist, but looking at the hear rate diagram I would suggest you add some exercise to your timetable. Pushing your heart rate above 130 bpm every now and then is a good thing to do ??♂?
I like this code review. Felix Wick might need these services in the future ??
applied ML
2 å¹´And maybe one comment to avoid confusion: Cyclic Boosting is not a dedicated time series algorithm, but uses the typical i.i.d. assumption of ML for the different time steps. So, sure, you can use it for time series, but the autocorrelation is only included via the features itself.
applied ML
2 å¹´And a little typo: from cyclic_boosting import flag should be from cyclic_boosting import flags By the way, these flags already define the smoothers to be used in the different smoothing functions of the GAM model.
applied ML
2 å¹´There was a small interface chance just before the before the pre-release: CBFixedVarianceRegressor is now CBNBinomRegressor