Experiments in Automated Brand Flipping in Videos
Bharath Kumar Mohan
Couch Potato | CEO @ Sensara.tv | PhD in Recommender Systems | ex-Google
The Video Industry has been under pressure the last couple of years to monetize better. Pure sVoD and pure aVoD represent extremes - where the former focuses on excellent uninterrupted viewer experience but is expensive, and the latter focuses on inserting many ads to make up for the subsidized (or zero fee) plans. Digital brand insertion (or In-content advertising) has emerged as a new form factor that allows for a balance between the two extremes. Brand ads and banners can be seamlessly inserted into the storyline without interrupting the flow of the story. High attention during the story ensures a strong brand uplift.
Identifying suitable spots inside the storyline and placing a brand banner or object is a big technical challenge in in-content advertising. It slows down the process and makes this heavily manual.
To understand this better, lets understand how entertainment video is actually made. A scene is a an incident part of the storyline that happens at a given place at a given time. However, a scene is shot using several cameras giving different perspectives of people and the location. A scene is often broken down into "shots". Shots are often interspersed and every time a person speaks, stops speaking and another person starts - the shots change.
Even inside a shot, the camera or the subjects move around.
In realistic scenarios, and given that its very easy for the human eye to spot abnormalities, in-content advertising is challenging:
The constraints and time & quality demands make in-content advertising an experts job. Companies like Whisper Media (also a partner of Sensara.tv) have been doing this for several years, and have an effective (human) process to identify such contexts and create photo-realistic brand placements.
In the world of personalized ad targeting and programmatic advertising, brands ask for dynamic replacement of such banners and objects - to suit the target segments, time of watching, form factor of the device, distribution channels and geographies. An already time-pressed hard problem only gets harder because multiple variants need to be produced in almost no time - sometimes dynamically.
Automated brand flipping can help. At Sensara.tv, we like to solve hard problems and set out to solve this problem. Here's the story of our journey so far.
We approached at this problem as a generalized "find & replace" problem in video. Identify an existing object (simplified to be a 2D flat rectangular area), and replace that in every perspective and shot - irrespective of occlusion, panning or zooming. A "video modifier" system is given what it is supposed to look for, and what it is supposed to replace that with. It goes about its job analyzing the original video, reverse engineering the "extents" of the original poster across shots, pans, occlusions, produces a mask and does a pixel to pixel replacement of the new poster.
Once implemented, a campaign studio accepts these pairs (find-replace) and the video, and renders the new video. Here are some sample renders from our labs.
Case 1: A simple scene/shot in which a poster in the background has been replaced by a movie poster. Realistic render, even though the camera is panning around.
Case 2: A simple scene/shot in which a poster in the background has been replaced by a video itself! Realistic render, even though the camera is panning around.
This type of a transformation makes it appear as if an LED screen is placed there.
Case 3: A test of perspective. The wind makes the curtains flow. The replacement should follow the original orientation naturally.
Notice how even the shadows are so perfectly aligned. In fact, the only way to figure that the right side is fake is to observe the reflection of the original poster in the windows.
Case 4: Static Occlusion. What if the poster is occluded by some other object? What if the whole video is still panning around?
This case is rather tough. We have the main poster of interest being occluded by another small one. The camera movement is also jerky - and follows the footsteps of the camera holder as he takes the shot. The panning also drastically changes perspective.
The render is still good - although you will notice some glitches in some frames. This is being worked out.
Case 5: Dynamic Human occlusion. What if a human actually comes in front of the poster?
We have some distortion in the replaced poster here - the foreground objects (the hand) needs to be segmented pixel perfectly.
We've reached a stage where non-occlusive and certain occlusive cases can be photo-realistically rendered with new posters - even if there is panning and movement. The ones related to massive occlusion need to be improved upon.
The Automated Brand Flipping system is available in early beta for interested parties. Like the examples show, they work extremely well on simple cases, acceptably well in some tough conditions, and the toughest of conditions need review and some editing. Our system is also perspective and shot invariant.
Here's a summary table of the capabilities we support in automated brand flipping.
This work has been a year long effort from our research team - Anirudh Koti Abhay Garani. Thanks and looking forward to more improvements and innovation.
We've employed the best of computer vision and deep learning techniques to get to this stage. We are exploring new advances in vision transformers to make this even more realistic. We'd love to learn about other efforts in automated in-content advertising - especially covering the hard corner cases.
Media Tech Leader
9 个月Great work. Kudos to you and your team.
??Your Wellbeing Coach??Career : Happiness : Wellbeing??xGoogle
9 个月I guess your brand flipping is targeting criteria 5 and 6 from the above list by picking a safe zone with which you can play by reducing the complexity of the problem. But at the same time, you are also in contradiction with item 4 since you might end up sacrificing 4 if you want the audience's eye to trace the new brand being inserted if it was not part of the eye trace in the original video.? Item 3 is an interesting option for exploring if we can recognize the tempo/rhythm of the scene with visual cuts or even audio cuts (or beats/peaks, etc..). For example, if it is a music video with fast editing, it may be emotionally easy to insert brand images for a very short duration. Lastly, for item 1, if we can recognize the scene's emotion and place the brand to match that emotion, we can ignore the rest of the criteria. Comment (2/2)
??Your Wellbeing Coach??Career : Happiness : Wellbeing??xGoogle
9 个月Looks interesting, Bharath! Though not directly related to the specific problem you guys are solving, I am dumping some info below if in case it interests your future intern stars :) Film Editor Walter Murch, in his book “In the Blink of an Eye”, while talking about what makes an ideal cut while editing, talks about six criteria that need to be satisfied: 1) it is true to the emotion of the moment;? 2) it advances the story;? 3) it occurs at a moment that is rhythmically interesting and “right”;? 4) it acknowledges what you might call “eye-trace”—the concern with the location and movement of the audience’s focus of interest within the frame;? 5) it respects “planarity”—the grammar of three dimensions transposed by photography to two (the questions of stage-line, etc.);? 6) and it respects the three-dimensional continuity of the actual space (where people are in the room and in relation to one another). He also says, “Emotion, at the top of the list, is the thing that you should try to preserve at all costs. If you find you have to sacrifice certain of those six things to make a cut, sacrifice your way up, item by item, from the bottom.” (Comment 1/2)