How long until #deepfakes are real?

How long until #deepfakes are real?

A friend sent me an article about a Florida politician who created a misleading ad by altering the audio and manipulating the video. It was pretty amateurish, actually, and didn’t use a whole lot of technology, but my friend knows that the technology behind creating much more convincing video (called deepfakes, a combination of deep learning and fake video) is advancing rapidly, and he asked me:

“How far in the future is this?”

Others have asked, and my snap answer has been 2-3 years. For those who don’t want to read another 1500 words, the TL;DR is that my previous answer was likely far too aggressive, and this technology is likely about 10 years away, in my view.

Now for the serious readers!

One of the reasons that people are so worried about video deepfakery is that audio fakes are imminent. Anyone who saw the Google demo in May knows that computer synthesized voice that is more or less impossible to distinguish from human speech is here now. Making that voice sound exactly like Donald Trump or Hillary Clinton (not so random examples, see image at top!) will be slightly more difficult, but not much. Prediction: “perfect” audio deepfakes will be possible within two years, maybe even 12 months. Yikes.

Video is harder than audio, Part I, technology

A digital version of the human speaking voice is measured in kilobits per second. Meanwhile reasonable quality video (more on this below) is measured in megabits per second. At a very minimum, that means that faking video is at least 1,000 times harder than faking audio purely from a computing viewpoint. For some technical reasons, it is likely even harder than that, but let’s stick with 1,000 times harder for now. That is three orders of magnitude, which is a useful way to talk about these kinds of problems. Many things are measured in orders of magnitude, including human hearing, earthquakes, and so on.

Video is harder than audio, Part II, biology

Humans have two ears and two eyes…but our sense of hearing is not nearly as good as our sense of sight. We are visual creatures, but making the problem of creating convincing video deepfakes even harder is that we have a superpower: a portion of our brain (temporal cortex) has a special set of neurons responsible for faces: recognition and interpretation.

How many people do you know whose voice you can reliably recognise over the phone, i.e. with no visual cues? I would guess for me there are 10 I would be really good at, and perhaps 20-30 where I would be willing to hazard a guess, and forego the “Who is this?” Meanwhile, there are literally hundreds if not thousands of people whose face I recognise instantly with no other cue. Worse, for those trying to fake it, we are unusually good at reading faces in motion. I can interact with someone face to face (notice we don’t say voice to voice) and pick up that they are tired, sick, sad, happy, drunk or on drugs. And I can do it in a few seconds.

The challenge for those who want to make a deepfake of Trump or Clinton isn’t just creating a still image of their face that looks right; it is animating that face in such a way that the average person (more on that below) would be convinced. I have not seen any research on this, but at a guess our visual and facial superpowers make the deepfake video challenge ANOTHER 1,000 times harder than audio alone. Which means the combined technological and biological hurdles make video roughly 1 million times harder than audio, or six orders of magnitude.

So you say you want a resolution: video quality matters

When we discuss digital video, the most common resolutions used on YouTube are 240p, 360p, 720p, 1080p (HD) and 2160p (4K.) The numbers refer to the number of horizontal lines in the image, and more is higher quality, but also requires more bits. In other words, a 1080p deepfake video is at least 10 times harder than 240p technically, and when we throw in the biological angle again, likely closer to 100 times harder (spotting the microcues that allow us to “read” faces requires pretty detailed images.) Prediction: low res video deepfakes will arrive much sooner (years sooner) than high resolution deepfakes. Secondary prediction: except for those who are predisposed to believe a video (see below) 240p deepfakes viewed on a smartphone are NOT what we are talking about. 720p will be a minimum, and full HD video will be the new standard for believability if there is a video of Trump taking orders from Putin or Hillary and Bill Clinton discussing their combined pizza/child porn/murder conspiracy.

What counts as a “believable” deepfake? The bias problem

In 2018, my example above was not chosen randomly. In a highly partisan society, there will be those who will view ANY deepfake video, even one using puppets, as convincing “proof” that their deepest suspicions about those they oppose are true. So when we talk about this issue, we need to be clear that we are ONLY talking about videos that convince the average sceptical person. The bar will be much higher for those folks. And I think that many people are becoming more sceptical about what they read online, which raises the bar even higher.

Fake Duncan videos will be surprisingly hard

The deep in deepfake is for deep learning, an AI technique (part of machine learning) that has achieved exponential improvements in the last two years, and powered the Google voice demo that is freaking everyone out. The problem is that in order to use machine learning to create a fake Duncan, you need hundreds of hours of video of me to train the model on. And there aren’t hundreds of hours of me available. So creating a deepfake video of me saying that I really do think VR is going to be the next big thing is not imminent. :)

Fake Trump or Clinton videos: uh oh

But there are not hundreds of hours of high quality video of either 2016 Presidential candidate…there are thousands of hours. Maybe tens of thousands. And that is grist for the machine learning mill. Making a 100% convincing video of a famous politician (or movie star, or anyone with thousands of hours of talking head high quality video) WILL happen one day.

So back to the question at the beginning: “How far in the future is this?”

The pace of machine learning: more than Moore’s!

The reason we’re talking about deepfake videos is that the progress in machine learning in the last 24 months has been astonishing. The best ever human voice synthesis sounded highly robotic in late 2016, and by mid-2018 is basically perfect. It is hard to measure exactly, but machine learning is about 1,000 times “better” than it was two years ago. Ballpark. Three orders of magnitude, in two years, compared with Moore’s Law, which would see a mere doubling of performance in the same time span.

If I look at my comments above, video deepfakes are roughly one million times harder than audio. Which means that high definition deepfakes will be fully convincing in roughly four years’ time, or 2022. But that is not my prediction.

The problem is that I am pretty sure the last 24 months were a kind of Cambrian explosion for machine learning. There has indeed been a tremendous increase in the abilities of machine learning and an even bigger drop in price: what used to cost $100,000 to do in machine learning now costs $10 through the cloud. (Not an exaggeration.)

But these gains were kind of low hanging fruit. New and specialised chips did not exist in 2016, but do now. There will be further evolutions of those chips, and things will improve, likely at a faster rate than Moore’s Law would suggest. But it will be 100 times better in two years, or even more likely only ten times better, not a thousand times.

If it is 100x from 2018 to 2020, and then 10x better from then on, it will be 1,000x better than today in 2022, 10,000x in 2024, 100,000x in 2026, and 1 million only by 2028.

Prediction: we will see HD 1080p deepfake videos capable of convincing even sceptical viewers in about 10 years, not in about two or three years.

I could easily be wrong. If machine learning keeps improving by three orders of magnitude every couple of years this could happen by 2022. But I think even my assumptions of 10x improvement in the technology every two years after 2020 is likely too high: no technology has ever achieved that level of exponential growth over that long a period. So if I had to bet on the over/under, I think it is MUCH more likely it takes longer than ten years, rather than shorter.

To be clear, this is NOT what you will read in the media. It is much more interesting to announce that deepfakes are imminent and will destroy trust and society and all of civilisation than it is to figure out when it actually happens. But don’t believe me: go look at the deepfake samples on the web (be careful, a lot of current deepfakes are porn) such as this one. Low resolution, and notice how the mouth and nose and eyes don’t look right? This wouldn’t really convince anyone who didn’t want to be convinced.

Making a really convincing fake is not imminent, in my view. But it will happen one day.

Blockchain to the rescue

But once it happens, I predict that this will be one of the more transformative use cases for blockchain. When videos of politicians are captured, everything will go onto the blockchain. Who is in the picture, where was it taken, exact date and time, and every change that is ever made will all be authenticated in near real time. All video will need to have a blockchain verified Certificate of Authenticity. And those that don’t will be assumed to be deepfakes.

Peter T.

President at TAYPE International Business Services Inc.

6 年

Life is about to get that much more complicated. To paraphrase Malcolm X: a man who believes in nothing falls for everything. Staving off nihilism and cynicism in a world where it is difficult to believe your own eyes and ears will be a key challenge of the early 21st century. We may have revert to heavier reliance on textual information and?*gasp*? start reading again.

要查看或添加评论,请登录

Duncan Stewart的更多文章

社区洞察

其他会员也浏览了