Activity Recognition Through Temporal Templates
I really enjoyed my computer vision final project, so I wanted to share:
At each frame, a random forest, used as a discriminative classifier, is identifying the current action (top left) through transforming the motion history image (MHI, top right) and the motion energy image (MEI, bottom left) into Hu Moments that account for location, rotation, and most importantly scale.
The math is a bit verbose, but pretty straight forward.
The "motion" between images is their subtracted value subject to some threshold θ set to a value τ.
Then the MEI & MHI can be calculated respectively:
Now that the action is captured, the invariants are calculated:
I know.. kind of ugly, but it works well! That last vector is fed into a model, in this case a random forest, which over many frames recognizes action. I ended up with a 99% accuracy in training and a 71% with cross validation. Clearly, things were overfit, however, the main objective was achieved for finding out that action could be represented by MHI, MEI, & Hu.
This work followed Bobick's and Davis' 2001 paper, however, new methods involving HMMs (source) and 3D CNNs (source) with deep learning are where things seem to be today.