Making Videos Searchable at Scale

Making Videos Searchable at Scale

I have been meaning to write an article going through the interesting parts of my graduation project for quite a while now and finally I got the chance! So here we go.

The Story behind Videra

At first, let's quickly discuss one of the use cases of Videra with an example. Let's say that for example you have a system of CCTV cameras monitoring a parking lot. Such a system would naturally produce an absurd amount of data on daily basis (24 hours a day * the number of cameras in the system). And for obvious reasons, it's logical to assume that you would want to keep a history of this footage for some time in case you needed to go back and look for something.

The nature of such data is that 99.99% of it is uninteresting and you're usually interested in certain events. In our parking lot example, you might be interested in knowing when a car entered the parking lot, what make was it, what was its color and plate number, etc. Or when a car left the parking lot. This is to say, that at some point in the future you might want to search for that red Chevy that parked over night last week around mid day, and obviously you don't want to watch hours of video to find it.

So it would be nice to be able to filter those many hours of video for those scenes that you are really interested in. We can see that this functionality can extend to numerous other use cases, such as, watching all those scenes for your favorite character in a show, or searching for the highlights at a sports event, etc.

The Great Misconception

By now, you, like almost every other person I tried to explain Videra to, think that we came up with that one-for-all computer vision model that can search inside any video in existence. And I can see why you would think that, but as a matter of fact, we did not. We didn't build any computer vision models at all!

The thing that most people would overlook is that there are already lots of great computer vision models out there designed for various use cases, but they all lack a mainstream way to automate the use of those models to make videos searchable and most importantly, being able to do so at scale.

Enters Videra

Videra was designed to streamline the process of using a computer vision model to make videos stremable and searchable using the the results produced by that model and being able to maintain that even with an ever growing amount of data (which is pretty much the case).

How does that work? (the short version)

A normal usage can be divided into two sub-flows that can overlap. The first flow is the ingestion flow and the second is the search flow.

An ingestion flow goes like:

  • A video is uploaded from the client side (using our SDK).
  • This video gets ingested into the system.
  • The specified computer vision model is run on that video and the classification results get indexed into a data store.

A search flow goes like:

  • A list of available filters is shown (filters are tags that the computer vision model is able to produce) along with time range filters to narrow down the search.
  • The user selects a combination of filters and hits search.
  • A grid of all matching videos is shown and the user can choose whichever video they are interested in viewing its results.
  • When a certain video is clicked, that video is streamed to the client along with a list of all scenes matching the search criteria. Any item of that list can be clicked and it directly seeks into the video to that specific scene.
  • A digest of the search result can be optionally downloaded as well.

How does it work? (the long version)

overall system architecture

To understand how this works in depth, let's first go through the architecture of Videra. The system consists of three main modules, namely, the Storage module, the Ingestion module and the Streaming module. Obviously, the storage module is responsible for storing user data relating to the process (in this case, it can be videos, computer vision models. custom code, configuration files, ...), on the other hand, the ingestion module is responsible for applying the specified computer vision model on the uploaded videos and thus making them searchable. Finally, the streaming module is responsible for applying the needed video encoding transformation to make the video streamable and actually performing the streaming process to a client side dashboard.

Now let's go deeper into the design of each of those modules and highlight its design challenges and how it addresses availability and scalability concerns.

Storage Module

storage module architecture

The storage module consists of two different types of nodes, namely, a Name Node which is acts as the orchestrator of the storage cluster and any number of Data Nodes each acting as the storage unit in the system.

When the storage cluster is booted, the Name Node goes up and the Data Nodes start communicating with the Name Node to join the storage cluster. Once a Data Node joins the cluster, the Name Node would start tracking the liveness of the Data Node by regularly performing health checks and issuing an alert if it lost connection to a Data Node (during x consecutive health checks).

Aside from keeping cluster integrity, the Name Node is entry point for any data access operation in the cluster, as store/read requests would first go through the Name Node as it knows where every file is stored. It also acts as the cluster load balancer as it balances storage among all the Data Nodes in the cluster and it also balances which Nodes are serving which requests to avoid creating hot spots which would result in overloading certain Data Nodes. Likewise, when the data needed (either for download or streaming) the Name Node looks up where this certain file is stored and chooses a Data Node that has this file to serve this request.

So when a storage request is received from the SDK:

  • The Name Node decides which Data Nodes will serve this request and responds with a connection to a target Data Node for primary storage and it also determines target Data Nodes for replication.
  • The SDK would start sending data to the chosen primary Data Node (authentication happen before accepting data from the SDK to be able to avoid unauthorized uploads to the system).
  • The upload process uses a resumable format where data is broken into small packets while being uploaded. This helps us support pause/resume and also be able to retry failed packets with minimum overhead. During the upload, when the primary Data Node receives a data packet, it synchronously replicates it to the replica Data Nodes that were designated by the Name Node (this guarantees that an upload is not considered done unless we have multiple replicas of the same file. The number of replicas is controlled by the replication ratio). Further replication may take place asynchronously (the idea here is that synchronous replication provides durability, so in the event of losing the primary node, we will have replicas on other nodes, on the other hand, the asynchronous replication is meant to lessen the overhead of synchronous replication and mainly address availability concerns).

Ingestion module

ingestion module architecture

The catch here is that the data is expected to be as large as it can be which makes it unpractical to have to move it around the system for processing. So the idea is to bring the computation to where the data is stored. This is to say that we don't have to move a 10GB video from a Data Node to some other node in order to get processed, we will instead, schedule the compute unit on the Data Node where this video is stored.

And thus the Ingestion module consists of two main components, namely, the orchestrator which acts the coordinator for the ingestion process, and the executors which are basically the workers that do the actual processing (applying the computer vision model on that video).

The other main idea here is that a video can be processed in parallel. Theoretically, we can apply the computer vision model on frames[1..10] at the same time apply the model on frames[11...20] and so on. Of course, some models, like those which work in a sliding window fashion, would require some overlap, for example frames[1...10] and frames[8...17] have an overlap of 3 frames which can be configured by a user provided configuration file.

But the idea stands (for most cases), we can split the video into much smaller chunks and process them in parallel and be able to horizontally scale (add more executor replicas) when needed. This also comes with the added benefit of minimizing failure overhead, because now if a processing batch failed, this will only affect this batch and it can be retried. The system can also recover in case of fatal failures and continue processing from the point it left off without having to repeat any already processed batches.

So when a video is successfully uploaded on some Data Node, it triggers an ingestion process that when scheduled does the following:

  • The orchestrator spawns as many executor replicas as needed based on a configurable number that can be scaled up and monitors their liveness by performing regular health checks and automatically spawning new replicas in case of a replica failure.
  • It loads the video meta data provided by the Data Node (stuff like the how many frames, from which frame to start processing, paths to custom code and configuration files, etc.).
  • It splits the video into smaller jobs using the previous data. It's important to note here, that the actual video is not being read from disk at this point. It just computes the job boundaries, for example, if we start processing a 10 frame video from frame 1 and the job size is 5, it enqueues 2 jobs, i.e. job{1, 6}, job{6, 10}, etc.
  • Each executor replica pulls a job from the messaging queue (we built our own messaging queue for this task) and it loads this tiny bit of video (using the custom code provided by the user to provide maximum configurability) and applies the computer vision model on it and finally it indexes the results {timestamp: tag} pairs in a database.
  • The orchestrator monitors the execution progress and automatically detects when jobs fail and need re-trying. When it detects that all jobs were successfully executed, it automatically terminates.
  • Multiple ingestion processes can be scheduled concurrently on the same machine. The thing here is that ingestion processes usually require a GPU enabled machine. This takes us back to the Storage module which is smart enough to choose a primary machine that has a GPU and schedules the ingestion process on that machine while choosing other cheaper machines for replication.

Streaming Module

dashboard

When a file is successfully uploaded, the Name Node chooses one of the replicas and schedules a streaming transformation process on that replica. Here we chose to run the streaming transformations on a replica instead of the primary because we already use the primary to run the ingestion process (which requires GPU).

At this step, we apply the HLS (HTPP Live Streaming) on the video. This transforms the video into a format that we can use for streaming as it controls video quality and most importantly constructs a map that attaches certain intervals in the video to certain frames in the video, i.e. when the user seeks into the video from sec 1-5 it maps that to a tiny part of the video to be streamed. This comes with the added benefit of lazily loading parts of the video on demand. This is essential here because the user is unlikely to stream the entire the video since they are already searching for specific events, so it makes sense to stream on demand. HLS also supports buffering for seamless experience.

So when a streaming request is received from the dashboard, the Name Node would choose a Data Node that has the video in the streamable format and that Data Node would act as a file server to stream this video.

As seen in the previous picture, an entire episode from Silicon Valley is streamed with a list of all scenes where the character Gilfolye appears (the model is trained to detect gilfolye so that's the only searchable tag in this example). This was done by just uploading a video, a computer vision module and waiting for a while (search results are available in real time, we don't really have to wait for the entire process to complete).

Show me the numbers!

A main target of using Videra to perform the aforementioned is its ability to parallelize work loads and easily scale horizontally as the amount of data grows. This adds a lot of overhead in the system design and its operation, but it's bound to deliver some time savings. The following the table highlights some of the experiments we have done. It compares the Baseline time (time obtained by running the computer vision model as it was intended to run) Vs. the Videra time (time obtained by running the same video but using Videra).

benchmarks

We can see from the previous data that using Videra offers around 60% reduction in the total time needed for ingestion. It’s worth noting the following:

  • The more replicas used, the better performance we get until we reach a point where the performance starts degrading due to resources bottlenecks (mainly CPUs). This can be solved by using more resourceful machines which is the case in any production system.
  • The larger the job size is, the better because it minimizes the overhead needed for scheduling jobs. But the larger the job size is, the more overhead in case of failure recovery because now larger jobs will have to be repeated.
  • Using large jobs increases the memory footprint which is undesirable as it limits the system’s ability to process more jobs in parallel.

The cherry on top

Videra ships with an SDK (currently only golang) that handles all needed communication with the Videra Backend and can be easily directly included in the client application code.

Shut up and Show me the Code

Conclusion

We now have a standard tool that can take any (we currently support certain categories of models, but supporting models is quite easy) model, any video (we can see that this can easily extend to cover audio) and make it streamable, searchable, and at scale!

Shady ElDawy

Backend Software Engineer

2 年

Very informative and well-explained. Is there a reason behind choosing Golang for implementation?

回复
Ahmed Shaaban

Helping Data Teams Evolve || Data Engineer @ integrant

4 年

Very interesting. Are you using Hadoop for storage and spark for processing?

Mahmoud El Magdoub

Group Manager - Product at Instabug ? I love building software ????

4 年

Great work!

要查看或添加评论,请登录

Sayed Alesawy的更多文章

  • High Performance MySQL [Ch.6 | part1]

    High Performance MySQL [Ch.6 | part1]

    So the last article discussed chapter 5 of High Performance MySQL titled Indexing for High Performance. Today's article…

  • High Performance MySQL [Ch.5]

    High Performance MySQL [Ch.5]

    So the last article discussed chapter 4 of High Performance MySQL titled Optimizing Schema and Data Types. Today's…

  • High Performance MySQL [Ch.4]

    High Performance MySQL [Ch.4]

    So the last article discussed chapter 2 and 3 of High Performance MySQL titled Benchmarking MySQL and Profiling Server…

    3 条评论
  • High Performance MySQL [Ch.2&3]

    High Performance MySQL [Ch.2&3]

    So last week we talked about chapter 1 of High Performance MySQL which discussed MySQL Architecture and History, this…

    4 条评论
  • High Performance MySQL [Ch.1]

    High Performance MySQL [Ch.1]

    So a couple of weeks ago, one of my colleagues at Instabug recommended this book called High Performance MySQL to me. I…

    3 条评论
  • Lessons I learned while working with Go for over a year

    Lessons I learned while working with Go for over a year

    So I have been writing Go services for over a year now, both professionally and as personal projects. Using a certain…

    2 条评论

社区洞察

其他会员也浏览了