Video Analytics in Natural Language

Video Analytics in Natural Language

Gy?rgy Balogh , CTO Ultinous

‘Please, show me the moment when a forty-something man gets out of a red sports car.’ - Imagine ordering your computer just that and receiving the right image from millions of minutes of video recording in return. Pretty futuristic! Or is it? Let’s give this idea a reality check!

THE EXPERIMENT

Similarly to my previous experiment, I put the largest neural network, GPT-3 to the test. This time, I wanted to see how far or close we are to realize the previously described scenario. Technically speaking, I checked GPT-3’s ability to generate SQL queries from video analytics datasets.

In order for GPT-3 to be able to solve such problems, I first had to give it an example and some explanation. The input in blue below includes data schemas, helper functions and syntax hints. This part is not seen by the end-user, but necessary for GPT-3 to understand what it needs to do. I described here a typical use case for many Ultinous partners: an observed area, where we monitor how many people are not wearing a helmet.

Once the code was in, I engineered a question in bold below, asking the model to count and show those 5 minute-long windows when 10 or more people are not wearing a helmet within the observed area. The question is quite complex and if you are a programmer you might also realize that many parts of the code are loosely defined. Despite all that, GPT-3 generated a solid SQL query! The answer in green below can be displayed in simple data or chart format for the end-user.

Let’s see the code

No alt text provided for this image

RESULTS

The output is quite impressive! Let’s see some non trivial details it figured out.

  1. Time is in milliseconds, it can only know this from the comment in the data structure. We asked for 5 second windows so the time / 5000 is correct (assuming integer division). If we change the comment to second resolution the output is still correct!
  2. It separated the object type (person) from the object attribute (wearing helmet) and used separate filtering for the two.
  3. FvMatch is only loosely defined but it understood this is the missing piece to use for attribute matching. ‘Wearing helmet’ constant was not mentioned anywhere but quite a good guess that we need to have something like that to match against.
  4. GPT-3 assumes we are specifying the rectangle by top left and bottom right coordinates. With this assumption the rectangle arithmetic is correct!
  5. It correctly used the ‘->’ operator to access nested fields. It was not shown with examples but explained in a short sentence. (It works correctly if we change it to ‘.’ as well.)

THE FUTURE

In a real-life scenario the user could interact with the system using voice or text. The more examples we upload to the model, the more complex analysis we will be able to get. Reaching a threshold could even enable GPT-3 to interpret any types of questions. Even finding that man with the red sports car!? In essence, clients would no longer need engineers to develop new applications every time they need to analyze a different event. Although GPT-3 still has its limitations, its ability to solve such a difficult use case suggests that the topic is worth digging deeper:)

Gabriel R.

International Business Development

2 年

Using everyday language when making complex forensic searches in video is certainly a game changer

要查看或添加评论,请登录

社区洞察

其他会员也浏览了