Computational Time-Lapse
Whenever you've seen a time-lapse video, it probably looked fairly 'jumpy' from one frame to the next with objects and people 'popping' in and out of the field of vision. This is because the goal of time lapse is to take a long video and compress it down to a more digestible form (perfect for our limited attention spans!). In order to do that, most time-lapse videos sample every couple of frames (i.e. every 10-15 frames) in order to compress the original video. As shown below, the streets of NY are filmed over several hours and then compressed into a 2:42 video which gives the viewer a nice overall summary.
However, what if there was a way to compress this video into the same length but also preserve as much motion as possible? Put another way, can we remove the choppiness as shown in the NYC time lapse above? Why would you want to do that? Maybe you're creating a time-lapse of someone cooking in the kitchen. The person is in and out of the kitchen, grabbing supplies and other ingredients necessary. It would be great if an algorithm could cut out all of the video without motion, i.e. when no one was in the kitchen cooking, and focused on preserving all of the frames where someone was actively moving in the video.
This was the question that was tackled by Eric Bennett and Leonard McMillan in their paper Computational Time-Lapse Video. The authors explored two avenues to approach time-lapse videos. The first, a non-uniform sampling method to maximize the user's visual objective. In this case, my objective was to maximize the amount of motion present between frames. The second approach, which I did not focus on implementing, was a method to create a virtual shutter that extends the exposure time of time-lapse videos. This is a way to create photos similar to the cover photo for this blog post.
In order to create a motion-saving video, you need to define an objective function to maximize. Therefore, the authors define the following min-error metric which computes the cost of jumping from initial frame i to frame j. Where Aij^(xy) is the y-intercept for each pixel and Bij^(xy) is the slope for each pixel. The gist is that we are summing up the amount of 'motion' that we are missing in between each pair of frames that we jump to. For example, if we jump from frame 1 to from 5, our error is the summation of the metric below for frames 1 to 5.
In order to solve the problem of which frames to choose, the authors implement a dynamic-programming technique outlined below. This approach, function D(s, M), results in the optimal sampling set v that is the min-error reconstruction of your set of frames s. In order to solve this, I created a matrix D with rows equal to the length of s, which is the number of input frames I had and columns equal to M, which was the number of frames I wanted to sample to create this time lapse video from. Then, for each column, i.e. for each frame I wanted to sample, I picked the entry in my matrix D that had the lowest cost. At the end, I had a series of frames that minimized the given error metric (min-error in this case as described above).
You can see my results of this implementation in the video below.
All of the code for this assignment lives on my Github.
Thanks for reading!