Just Used Machine Learning in My Workout!
Antonello Calamea
IT Mentor | Coach | Tech Recruiter | Fractional CTO | 25+ anni di esperienza nel settore | Another Brain Academy maker
I’m a big fan of a bodyweight approach and generally of doing workouts, but I don’t like too much going to the gym.
Besides, in this time of forced lockdown due to Coronavirus, it could be useful to try a different way to approach fitness and training.
So I asked myself: is there a way to use Machine Learning in this area? Can I join two passions together to make something useful?
One of the main problems is having a way to validate the correctness of an exercise, so I did some experiments and tried an approach and found that…
Ok, don’t want to spoil anything, just continue reading to find out!
Frame the problem
As always, let’s start with framing the problem. What we want to achieve is having a way to assess the correctness of an exercise, using a video as input.
The optimum should be using live streaming, but let’s keep it simple using a file for now as, after all, we want to validate an approach before building something on top.
So, having a video of a (hopefully) proper execution should be the first step, as it can be used as a baseline to compare other ones.
Here’s the first video, did in my underground fatigue room :)
The baseline ok execution
My first thought was to use CNNs to build a classifier but, besides the number of needed examples, I’m not sure images sequence pixels can be useful to train a model about what is wrong and what is right in the exercise execution.
So, I did some research to find if there is the possibility to have different features using a video as input and found a great library, OpenPose, a “Real-time multi-person keypoint detection library for body, face, hands, and foot estimation”.
Seeing the demo videos, I understood could be very useful, so I tried to apply it to my problem and had this…
Using OpenPose
(I’ll write later, in the Appendix, all the necessary steps to setup)
As you can see in the video, the library works very well tracking different body parts (used the COCO configuration with 18 key points)
The cool thing is it’s possible to have as output a json file too, with all the positions, frame by frame, so it should be possible to have an alternative numeric representation of an exercise.
So, doing some helper function and using Plotly, this is how this exercise looks considering the y-axis movements — skipping x-axis as less useful given the camera position.
Let’s call it “ok1”
Breakdown analysis of good exercise ok1
Nice, the next step is now to find a way to compare two different executions to spot if there are significant differences.
Let’s make first a visual comparison based on these metrics and let’s call this execution “fail1”
fail1
Let’s compare the graphs of the movements
Comparison between ok1 and fail1
There are evident differences.
Let’s try with another failed performance (“fail2”)
fail2
and let’s compare with the baseline proper execution ok1
Comparison between ok1 and fail2
Let’s try now to compare two good performances (let’s call it the second “ok2”)
ok2
Comparison between ok1 and ok2
The curves look very similar, so we empirically tested this approach.
Now the question is: is there a way to evaluate the similarity between these univariate time-series curves, considering could have different timescale too?
It turns out there is something called Dynamic Time Warping that can be used “for measuring similarity between two temporal sequences”. More here
Is there an implementation in Python? Of course, using tslearn.metrics
So let’s crunch some numbers
Fist compare “ok1” with itself
dtw_value for feature nose_y is 0.0 dtw_value for feature right_shoulder_y is 0.0 dtw_value for feature right_elbow_y is 0.0 dtw_value for feature right_wrist_y is 0.0 dtw_value for feature left_shoulder_y is 0.0 dtw_value for feature left_elbow_y is 0.0 dtw_value for feature left_wrist_y is 0.0 dtw_value for feature right_hip_y is 0.0 dtw_value for feature right_knee_y is 0.0 dtw_value for feature right_ankle_y is 0.0 dtw_value for feature left_hip_y is 0.0 dtw_value for feature left_knee_y is 0.0 dtw_value for feature left_ankle_y is 0.0 dtw_value for feature right_eye_y is 0.0 dtw_value for feature left_eye_y is 0.0 dtw_value for feature right_ear_y is 0.0 dtw_value for feature left_ear_y is 0.0 dtw_value for feature background_y is 0.0
So 0 values is the maximum similarity and a lower score means more similarity
Let’s try now measuring ok1 and fail1
dtw_value for feature nose_y is 188.00378744123748 dtw_value for feature right_shoulder_y is 155.97642562435527 dtw_value for feature right_elbow_y is 156.39925059973916 dtw_value for feature right_wrist_y is 17.982641407757672 dtw_value for feature left_shoulder_y is 13.5329438534267 dtw_value for feature left_elbow_y is 158.0005797757085 dtw_value for feature left_wrist_y is 27.544745106825722 dtw_value for feature right_hip_y is 12.151614599714703 dtw_value for feature right_knee_y is 191.94638493339747 dtw_value for feature right_ankle_y is 223.23781654997444 dtw_value for feature left_hip_y is 263.0165952996121 dtw_value for feature left_knee_y is 195.8379463587177 dtw_value for feature left_ankle_y is 227.95958454954243 dtw_value for feature right_eye_y is 288.64055642788685 dtw_value for feature left_eye_y is 192.9321060365538 dtw_value for feature right_ear_y is 192.15753964939807 dtw_value for feature left_ear_y is 190.20149442225735 dtw_value for feature background_y is 189.09276308989186
I found useful adopting an overall value to have a more condensed info, such as the median
dtw_median : 189.6471287560746
Comparison between ok1 and fail2
dtw_value for feature nose_y is 65.28319682858675 dtw_value for feature right_shoulder_y is 38.87442004120449 dtw_value for feature right_elbow_y is 37.75683113715981 dtw_value for feature right_wrist_y is 18.907807197028447 dtw_value for feature left_shoulder_y is 19.50736795264806 dtw_value for feature left_elbow_y is 45.031636992674414 dtw_value for feature left_wrist_y is 36.101698713495466 dtw_value for feature right_hip_y is 13.248353503737741 dtw_value for feature right_knee_y is 39.45295418596681 dtw_value for feature right_ankle_y is 49.27277845829276 dtw_value for feature left_hip_y is 65.78598402395453 dtw_value for feature left_knee_y is 38.59586190254078 dtw_value for feature left_ankle_y is 44.54850474482842 dtw_value for feature right_eye_y is 64.17832564035923 dtw_value for feature left_eye_y is 50.02819053653649 dtw_value for feature right_ear_y is 50.233695101993064 dtw_value for feature left_ear_y is 45.21480605000976 dtw_value for feature background_y is 42.15576012017812 dtw_median : 43.35213243250327
Comparison between ok1 and ok2
dtw_value for feature nose_y is 16.023831603583467 dtw_value for feature right_shoulder_y is 11.24889546622242 dtw_value for feature right_elbow_y is 11.94796246520719 dtw_value for feature right_wrist_y is 20.509653605070962 dtw_value for feature left_shoulder_y is 19.65007578484111 dtw_value for feature left_elbow_y is 14.486468134089847 dtw_value for feature left_wrist_y is 7.208783392501132 dtw_value for feature right_hip_y is 14.17544715061928 dtw_value for feature right_knee_y is 25.759515076957445 dtw_value for feature right_ankle_y is 43.123581089700735 dtw_value for feature left_hip_y is 83.91171946754521 dtw_value for feature left_knee_y is 23.860467116131673 dtw_value for feature left_ankle_y is 44.80603683656928 dtw_value for feature right_eye_y is 91.27560108813313 dtw_value for feature left_eye_y is 31.263050533657154 dtw_value for feature right_ear_y is 25.735729785455852 dtw_value for feature left_ear_y is 12.39151408383979 dtw_value for feature background_y is 11.887661376402017 dtw_median : 20.079864694956036
So it seems this value can be used as an indicator to compare the correctness of two executions based on a threshold to be found.
As an empirically counter check, let’s try with other examples starting from this value
ok1 and check1 -> median 82.22671018607622
ok2 and check2 -> median 196.313312415643
ok and check3 -> median 25.03920782168309
It seems that a median lower than 30 could be a starting threshold
Let’s see them on video
No jumps allowed!
Incomplete
Ok!
Conclusion
This is just the beginning of this experiment: assuming this is the right approach, there are a lot of open points such as:
- What about different persons with different heights? They need a personal baseline too or can be generalized?
- What about a different camera position?
- How can the threshold be inferred?
- How to give more detailed suggestions about what was wrong in the execution?
- How to process the relevant part of an exercise during a continuous video stream?
- Can exercises with tools such as dumbbells be tracked? (hint: yes but with specific object detection libraries too)
I had some ideas to check, and I’ll do in the future, even because the possibilities are fantastic.
Imagine a workstation with a camera that
- recognize you when you enter it with face identification
- loads your “wod” (workout of the day)
- checks the correctness of the exercises giving hints
- signals a bad execution to a trainer who’s present or maybe attending a remote session with dozens of people, allowing him/her to take corrective action.
Even training could be customized on the fly based on previous sessions and overall person condition.
As always, I’m amazed about what is possible to achieve and imagine with these technologies and it’s big fun to use them.
In the meantime, happy workout and stay safe.
Appendix
Docker+OpenPose
Instead of installing directly OpenPose with all the necessary dependencies, I opted for a Docker approach. You can found here the image: https://hub.docker.com/r/garyfeng/docker-openpose/
Keep in mind that probably for a real-time approach using a container is not the right solution as there is a lot of lag but I haven’t tried other solutions so I cannot say it for sure.
But before running it, you need to run containers using GPU, otherwise, OpenPose will not start. Here all the instruction to do it (with Invidia GPUs): https://github.com/NVIDIA/nvidia-docker
You’ll see in the command The “privileged” and -e DISPLAY=$DISPLAY -v /tmp/.X11-unix:/tmp/.X11-unix parts that are used to access the camera inside the container if you need it.
Before launching the docker command, be sure to execute:
xhost +
so the container can connect.
Then, just launch
docker run --privileged --gpus all -v <host path to share>:/data -e DISPLAY=$DISPLAY -v /tmp/.X11-unix:/tmp/.X11-unix -it garyfeng/docker-openpose:latest
After a while, you’ll enter in a bash shell inside the container
if you check OpenPose documentation there are a lot of parameters but let’s see a couple of examples
build/examples/openpose/openpose.bin --face
It should turn on the camera and start to detect keypoint in your face.
The command I used to create the data used before:
build/examples/openpose/openpose.bin --video /data/<input file> --write_video /data/<ouptut file> --no_display --write_keypoint_json /data/<folder with json output files>
Notice the “data” folder that was mounted while launching the container. If you change it, be sure to adapt accordingly to the command.
Python code
Let’s see now some Python code to deal with the data used in the article
import pandas as pd import os import numpy as np def read_pose_values(path, file_name): try: path, dirs, files = next(os.walk(path)) df_output = pd.DataFrame() for i in range(len(files)): if i <=9: pose_sample = pd.read_json(path_or_buf=path+'/' + file_name + '_00000000000' + str(i) + '_keypoints.json', typ='series') elif i <= 99: pose_sample = pd.read_json(path_or_buf=path+'/' + file_name + '_0000000000' + str(i) + '_keypoints.json', typ='series') else: pose_sample = pd.read_json(path_or_buf=path+'/' + file_name + '_000000000' + str(i) + '_keypoints.json', typ='series') df_output = df_output.append(pose_sample, ignore_index = True) return df_output except Exception as e: print(e)
This is used to return a DataFrame with all the json found in an OpenPose json output path (beware, it will break if there are 1000+ files — definitely to fix :)
''' Nose – 0, Neck – 1, Right Shoulder – 2, Right Elbow – 3, Right Wrist – 4, Left Shoulder – 5, Left Elbow – 6, Left Wrist – 7, Right Hip – 8, Right Knee – 9, Right Ankle – 10, Left Hip – 11, Left Knee – 12, LAnkle – 13, Right Eye – 14, Left Eye – 15, Right Ear – 16, Left Ear – 17, Background – 18 ''' from sklearn.preprocessing import MinMaxScaler def transform_and_transpose(pose_data, label): output = pd.DataFrame() for i in range(pose_data.shape[0] -1): if len(pose_data.people[i]) > 0: output = output.append(pd.DataFrame(pose_data.people[i][0]['pose_keypoints']).T) # drop confidence detection for y in range(2,output.shape[1] ,3): output.drop(columns=[y], inplace=True # rename columns output.columns = ['nose_x', 'nose_y', 'right_shoulder_x', 'right_shoulder_y', 'right_elbow_x', 'right_elbow_y', 'right_wrist_x', 'right_wrist_y', 'left_shoulder_x', 'left_shoulder_y', 'left_elbow_x', 'left_elbow_y', 'left_wrist_x', 'left_wrist_y', 'right_hip_x', 'right_hip_y', 'right_knee_x', 'right_knee_y', 'right_ankle_x', 'right_ankle_y', 'left_hip_x', 'left_hip_y', 'left_knee_x', 'left_knee_y', 'left_ankle_x', 'left_ankle_y', 'right_eye_x', 'right_eye_y', 'left_eye_x', 'left_eye_y', 'right_ear_x', 'right_ear_y', 'left_ear_x','left_ear_y','background_x', 'background_y'] # interpolate 0 values output.replace(0, np.nan, inplace=True) output.interpolate(method ='linear', limit_direction ='forward', inplace=True) return output
Here we’re doing columns renaming based on COCO setup and a basic interpolation if there are 0 values (for example when a nose is behind the pull-up bar):
def model_exercise(json,name,label): df_raw = read_pose_values(json,name) return transform_and_transpose(df_raw,label) df_exercise_1 = model_exercise('<path to json>','<file_name>','<label>')
Putting all together, the function to use to have the final DataFrame.
Let’s see some graphs now:
import plotly.graph_objects as go from plotly.subplots import make_subplots def plot_y_features(df): fig = make_subplots(rows=3, cols=6, start_cell="top-left") r = 1 c = 1 X = pd.Series(range(df.shape[0])) for feature in df.columns: if '_y' in feature: fig.add_trace(go.Scatter(x=X, y=df[feature], name=feature), row=r, col=c) fig.update_xaxes(title_text=feature, row=r, col=c) if c < 6: c = c + 1 else: c = 1 r = r + 1 fig.update_layout(title_text="Exercise y-axis movements breakdown", width=2000, height=1000) fig.show() plot_y_features(df_exercise_1)
Drawing the subplots for all the positions.
Now drawing the comparison for two exercises:
def plot_comparison_y_features(df1,df2): fig = make_subplots(rows=3, cols=6, start_cell="top-left") r = 1 c = 1 X1 = pd.Series(range(df1.shape[0])) X2 = pd.Series(range(df2.shape[0])) for feature in df1.columns: if '_y' in feature: fig.add_trace(go.Scatter(x=X1, y=df1[feature], name=feature + '_ok'),row=r, col=c) fig.add_trace(go.Scatter(x=X2, y=df2[feature], name=feature + '_fail'),row=r, col=c) fig.update_xaxes(title_text=feature, row=r, col=c) if c < 6: c = c + 1 else: c = 1 r = r + 1 fig.update_layout(title_text="Exercise y-axis movements breakdown comparison", width=2000, height=1000) fig.show() plot_comparison_y_features(df_exercise_1, df_ok2)
Finally the Dynamic Time Warping part:
def evaluate_dtw(df1,df2,feature, plot=False): x1 = range(df1.shape[0]) y1 = df1[feature].values x2 = range(df2.shape[0]) y2 = df2[feature].values dtw_value = evaluate_dtw(df1[feature],df2[feature]) print("dtw_value for feature {} is {}".format(feature, dtw_value)) return dtw_value def evaluate_dtw_values(df1,df2,plot = False): dtw_values = [] for feature in df1.columns: if '_y' in feature: dtw_values.append(dtw(df1,df2,feature,plot)) return pd.DataFrame(dtw_values)
That’s all! Thank you.
Palestra casalinga?