登录查看更多内容

Why Do A.I. Image Generators Have Problems Creating Hands?

Patrick Delaney

Software Engineer, DevOps, MLOps, LLMOps, AIOps

发布日期: 2023年8月16日

Welcome to my bi-weekly newsletter, “I’ll Keep This Short,” where I navigate the less-traveled paths of AI, building new insight beyond the banal, mainstream chatter.

Horrific Hands

You may had likely noticed earlier this year or last year - generative images with hands in odd poses, with a wrong number of appendages, or even as a well-circulated meme had shown, two hands growing out of an independent, “stick arm.”

Of course you might not see these images quite so commonly anymore. What happened? Was the hand problem, “solved,” as the Washington Post seemed to allude to?in March of 2023? From the article:

As recently as a few weeks ago, Farid said, spotting poorly created hands was a reliable way to tell if an image was deep-faked. That is becoming harder to do [.]

Today I would like to walk you through the details of how image generation really works, explain why hands really are so hard to draw, and talk about whether the, “drawing hands problem,” has really been solved.

What is a Transform?

The underlying mechanism that powers chatbots like ChatGPT is essentially the same as that which powers image generation: neural networks. Neural networks are what is known as a?Mathematical Transform.?The simplest example of a Mathematical Transforms would be to take a set of numbers such as [1, 2, 3] and put it through some kind of predetermined way of operating on those numbers in a sequence, maybe, { “multiply each number by 2” }, from which you would get an output [2 ,4, 6].

There are many different types of transforms. From a blog called?EquationFreak, you can run a, “Reflection Transform,” to get the mirror image, or a “Rotation Transform,” or a, “Translation Transform,” which is like sliding.

What Are Neural Nets?

Neural nets however are super complex transforms with many steps and many layers, (which are basically sequential steps) of operations. So rather than just doing Reflect → Rotate → Slide, you would have potentially tens or hundreds of steps.

To help make thinking about neural nets more simple, you can think of an excel spreadsheet, with the first tab having a few input numbers, the next tab having thousands of columns and rows of different weights that get multiplied in some confusing huge mess that no human is ever going to understand, and then the last tab being the final output of that behemoth spreadsheet, reduced back down to something not so complex.

You can also create an application which has multiple neural networks which work in series, one after another, continually transforming whatever input into some completely different output, or you could combine neural networks with other kinds of operations.

So How do Image Generators Use Neural Nets?

There are two neural networks at work when using an image generator such as Dall-E 2. The first is the, "Large Language Model," or?LLM, which interprets human the input, and the second is the, "Contrastive Image Pre-Training."

A Contrastive Image Pre-Training model is basically a neural network that has been trained on millions of images tagged with phrases. So for example you might have an image of?chicken nuggets, and then a text phrase, "chicken nuggets" tagged on those images. So when you go into an image generator and search something fundamental like that, you will see what might be logically construed as some of the originally tagged images:

Likewise if you do it with some other terms you might like to combine, such as, “police” or “dressed up as police,” and “criminals,” or “dressed up as criminals.”

So let’s accept these examples above as being close to fundamental images that were used within the training process of Dall-E. It’s important to note that all of these images, on a fundamental level, are just grids of thousands or millions of pixels which have a particular color and grayscale. So if you were to look at an image in some kind of pixel viewer and show the image in terms of how it is encoded, it would literally be a grid of?hexadecimal numbers?like the following:

So then what the contrastive image pre-training does, is for every single pixel of the image, I, it combines it with a soft of averaged out, embedded pointer to text, T to create a new type of image with both the T and the I combined together in every single pixel. So basically they, “mapped,” together every single pixel with the text phrase in question.

Of course, with the millions of images that are tagged and used in the training process, there is a vast amount of information that gets combined which goes beyond just the objects within the picture. The information may also include styling, color, orientation, camera type, or any number of properties that could have been tagged to a particular image.

Hence, when we enter in a phrase such as, “Police chicken nuggets,” or “Criminal chicken nuggets,” you end up with a combined image like the following, which has been generated from the first step of the LLM interpreting, “Police chicken nuggets” and the second step of the Contrastive Image

This allows the language model to identify which pictures to pick, to encode together to form new pictures.

Feature Space

The feature space in contrastive image pre-training refers to the high-dimensional space where images are represented as vectors, and the aim of contrastive learning is to organize this space in a way that similar images are close and dissimilar images are far apart.

From the?Stanford AI Lab Blog:

What the above images shows is that basically you have a, “Space,” with all of the images of cats and dogs mapped out by distance from one another. Sometimes a fancy version of a letter, e.g. ? or ? is used to mean the space, rather than R. This above image is merely an example, but basically using a combination of tagging images by words, either, “dog,” or, “cat” and then having a bunch of Transforms go through and look at the pixels in these images, an algorithm could be built that maps out the distance of these images from one another.

But how do you calculate the, “distance,” between two images? How does that even make sense?

I had put together a YouTube video toward the end of 2022 about plagiarism detection, “How I Defeated A.I. Plagiarism,” which touches on this topic to a certain extent.

Basically, an image, as we had mentioned above, is merely a grid of hexadecimal numbers, which you could also think of as like a grid of binary (one’s and zero’s) numbers. These grids could be lined up in one long line by just taking each row and connecting them up together in a line, so you get something like this:

Jean Joseph 8 个月前

Explaining Humans to AI - the new paradigm of…

Harsha Srivatsa 4 个月前

AI can manipulate your emotions now!

Naveen Joshi 6 年前

Row1 [0 0 1 0 1 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 1 0 1 0 1 0 1]
Row2 [0 0 0 1 1 1 1 1 1 1 1 1 0 0 0 0 1 0 1 0 1 0 1 0 1 0 0 0 0 1 1 1]

When you calculate the, “distance,” between to points in a triangle, you use the pythagorean theorem, that is, for a right triangle:

There’s another, “distance,” measurement, which is the angle between two right triangles, called the Cosine Distance, which looks like this:

The Cosine Distance can be calculated knowing the two coordinates of the black dot shown in the two triangles. Essentially, if A and B have known coordinates, let’s say, (1,2) and (2,3) as the location of the dots, then the Cosine Distance can be found with the following equation.

This same type of distance measurement works on triangles (vectors) of arbitrary length, so if we plug in?Row1?and?Row2?from above, you can get a Cosine Distance between those, using the exact same formula.

So given that we have the ability to really mathematically compare different images, you can now plot them out on a chart that shows how far away any given image is from any other given image, assuming a big set or, “Space” of images, ?.

Negative Learning

In the context of images, the aim is to ensure that the neural network learns to generate similar feature vectors for similar (or "positive") image pairs and different feature vectors for dissimilar (or "negative") image pairs.

Positive pairs typically consist of two augmentations of the same image, while negative pairs are usually composed of two completely different images. The goal is to minimize the distance between positive pairs in the feature space and maximize the distance between negative pairs.

Spoken in more plain language,?“We try to plot the images tagged with, ‘cat’ close to each other using Positive Pairing, but you also put it very far away from houses or cars using Negative Pairing, but not so far away from dogs, since they are both animals but not objects.”

In-painting Coherency

So it’s one thing to identify images that are similar to one another, it’s another thing to combine them together in a way that’s not an absolute mess or noise. The objective that we’re shooting for would be called, “Coherency,” and the method of actually combining these images together is called, “In-Painting.”

There are different methods of combining all of these different types of similar images together, but they all tend to follow odd, complex patterns that look like this, (from this paper):

The main thing to take away from this is that it’s an iterative attempt to reduce, “loss,” or basically, loss of a coherent image. There are many, many papers out there about how to reduce loss. I will summarize a few methods used here:

Reconstruction Loss: One simple approach is to use a reconstruction loss, such as mean squared error (MSE) or mean absolute error (MAE), between the generated image and the original image. This encourages the model to generate content that is similar to the original content, which can help to enforce coherency. However, this approach might not work well if the model needs to generate new content that wasn't in the original image.
Contextual Loss: This is a more sophisticated approach that encourages the model to generate content that is consistent with the surrounding context. This might involve comparing the features of the generated content and the surrounding context at various levels of abstraction, and penalizing the model if these features are dissimilar. This can encourage the model to generate content that is not only similar in appearance to the surrounding context, but also shares similar high-level features.
Adversarial Loss: This is another approach that can help to enforce coherency. In this approach, a separate discriminator model is trained to distinguish between real images and images generated by the model. The generator model is then trained to fool the discriminator. This can encourage the generator model to generate content that is not only plausible in its own right, but also coherent with the surrounding context, since any inconsistencies might give the game away to the discriminator.

Each of these approaches involves a different mathematical formulation. For example, MSE and MAE involve calculating the difference between the generated and original pixel values, contextual loss might involve calculating the cosine similarity between feature vectors, and adversarial loss might involve calculating the binary cross-entropy between the discriminator's predictions and the actual labels.

DALL-E, Midjourney and StableDiffusion each use a combination of these loss functions to train their models to generate coherent images. This combination of loss functions has been shown to be effective in generating high-quality, realistic images, but there are still gaps here and there, depending upon how those loss functions are bing tweaked.

So Why Are Hands So Difficult?

So basically, something goes wrong in the In-Painting step, because hands are so complex, having so many parts and ways of being held, that it’s difficult to map coherency and reduce loss, because there is just too big of a landscape of potential ways that hands look in pictures.

If little of the above made sense to you, the simplest explanation might be, “Well have you ever tried drawing hands? They are frickin’ hard to draw!”

To attempt to be a bit more precise, there are a couple reasons.

Close Negative Distances: All photos of hands, particularly human hands, would be mathematical distance-wise close to one another, since any two different photos of hands may be quite difficult to describe in terms of being unique permutations.
Complex Coherency Goal: Because they have a lot of parts and a lot of different configurations in which those parts can be set. The human hand has 27 bones and perhaps 30 distinct, “geometric parts,” and those parts can be in many different gestures and permutations.

Ideally to train an image generator that would generate hands really well, there would need to be a way to tag all photos of hands in every possible configuration that hands could be in, which could be millions, and then to ensure that those tagged configurations some how do not overlap when they are called upon by a user generating an image using words. Also, it would be helpful if the user could specify with a much greater degree of certainty precisely how they want the hands to appear - but alas, we just don’t have that massive and precise of a vocabulary in any human written language to represent all of the millions of possible configurations that hands could be in. We might be able to say, “the OK symbol,” or, “the middle finger,” but…well, what about someone giving the middle finger, camera pointing straight on at the hand, but with the middle finger slightly angled backwards by about 7 degrees?

Describing the entire space of how hands could potentially appear is just far too complex for a user, and so we’re left with the server to make these decisions for us, and if there isn’t some proper training set that ensures that all hand gesture configurations are of sufficient mathematical distance away from one another, they are going to get smeared together.

So Why Did AI Advance And, “Solve,” The Hands Problem?

Short answer: they didn’t. Not in a conventional way, that is to say, not in the same way that solving for image generation in general in the above methods was solved. Rather, hands have been, to use an analogous phrase, “censored,” through blurring, and heavily restricting the forms and images in which they appear.

This is similar to how high quality 3D images are being generated by a particular startup that I?covered in a previous article. Essentially, if you want higher quality generative, “stuff,” just reduce the number of possible things that can be generated, which makes the training task less difficult.

More advanced image generation software such as Midjourney V5 attempt to solve this problem by either blurring out hands, or being trained with hands in fewer possible configurations. There are varying reports of how successful Midjourney V5 actually is in being able to produce hands in a wide variety of configurations, as well as advice on how to generate images with prompts that actively cut out hands or make them more simple and closer to the original fundamental photos on which they were trained.

Solving For Hands

At some point in the future, there will be a concerted effort to take gazillions of photos of human hands in every possible shape, and build a Contrastive Image Pre-Training model that can deal with hands specifically, because people typically don't like to see that uncanny valley of weird hands. This may well solve the hands problem.

On the other hand, some believe that we really just don’t have a good way of explaining what Neural Networks are doing, in human language, and therefore it’s too difficult to work with them to be able to solve problems like these meaningfully. With computer programming, for decades different types of programming languages have been invented to help humans be able to highlight and become aware of different ways of interacting with computers to be able to tell computers what to do. Some posit that we lack this type of, “language,” or, “interface,” with neural networks, and that is the reason why solving problems like the hand problem is so difficult right now.

One way that this problem is being addressed, which might well result in better AI solutions overall, rather than needing to brute force the problem with better tagging and training, is through, “Interpretability,” as?termed on the MIT CSAIL Research Group website.

Some people go beyond Interpretability and tag on the term, “Mechanistic Interpretability,” but?both terms seem to be used interchangeably. Some go further and tie Mechanistic Interpretability to building safety within AI, e.g. attempting to reduce the potential harm that AI can cause, and some take it all the way and say that it’s a potential way to prevent AI from destroying us all. At this point I wouldn’t go that far, I would just say it’s an interesting topic that might potentially be one path to make generating images of hands better.

Sources

[1]?V7Labs Contrastive Learning Guide

I'll Keep This Short

491 位关注者

Jonathan Sollie

Delivering brand and digital solutions that delight

1 年

Can’t wait to dig in. This has been on my random research list.

1 次回应

要查看或添加评论，请登录

Patrick Delaney的更多文章

The Fine-Turning an Open Source Language Model Journey Part One: Impetus

2023年10月10日

The Fine-Turning an Open Source Language Model Journey Part One: Impetus

Welcome to my bi-weekly newsletter, “I’ll Keep This Short,” where I navigate the less-traveled paths of Artificial…

2 条评论
Craft Beer and Spongiform Brain Bacterium

2023年9月27日

Craft Beer and Spongiform Brain Bacterium

Welcome to my bi-weekly newsletter, “I’ll Keep This Short,” where I navigate the less-traveled paths of Artificial…
Constitutional A.I. and the Math Achievement Gap

2023年9月13日

Constitutional A.I. and the Math Achievement Gap

Welcome to my bi-weekly newsletter, “I’ll Keep This Short,” where I navigate the less-traveled paths of AI, building…
AI Panic: Are Robots Going to Kill Us All?

2023年8月29日

AI Panic: Are Robots Going to Kill Us All?

Welcome to my bi-weekly newsletter, “I’ll Keep This Short,” where I navigate the less-traveled paths of AI, building…
Threading an Argument for the Fediverse

2023年7月31日

Threading an Argument for the Fediverse

Welcome to my bi-weekly newsletter, “I’ll Keep This Short,” where I navigate the less-traveled paths of software…
How Far Are We From Being Able to Generate Whatever 3D Objects On the Fly?

2023年7月17日

How Far Are We From Being Able to Generate Whatever 3D Objects On the Fly?

Welcome to my bi-weekly newsletter, “I’ll Keep This Short,” where I navigate the less-traveled paths of AI, building…
Why Didn't Ancient Rome Have a Space Program?

2023年7月3日

Why Didn't Ancient Rome Have a Space Program?

Within this newsletter, I have mostly covered extremely modern technological topics such as Automated Large Language…
How Far Away Are We From Non-Crappy AI Generated Video?

2023年6月21日

How Far Away Are We From Non-Crappy AI Generated Video?

Dall-E 2 is the quintessential image generation variant of the GPT-3 model developed by OpenAI, which instead of…
Defeating the Wizard: Large Language Model Prompt Attacks

2023年6月12日

Defeating the Wizard: Large Language Model Prompt Attacks

With the advent of Large Language Models, an entirely new class of cybersecurity attacks has emerged from the darkness…

8 条评论
Whatever Happened to The Internet of Things?

2023年6月7日

Whatever Happened to The Internet of Things?

Given the release of Apple Vision Pro yesterday, it's an appropriate time to ask this question: What ever happened to…

3 条评论

See all articles

Why Do A.I. Image Generators Have Problems Creating Hands?

Patrick Delaney

Software Engineer, DevOps, MLOps, LLMOps, AIOps

Horrific Hands

What is a Transform?

What Are Neural Nets?

So How do Image Generators Use Neural Nets?

Feature Space

领英推荐

Negative Learning

In-painting Coherency

So Why Are Hands So Difficult?

So Why Did AI Advance And, “Solve,” The Hands Problem?

Solving For Hands

Sources

I'll Keep This Short

491 位关注者

Patrick Delaney的更多文章

社区洞察

其他会员也浏览了

The Black Box of AI - When the "Brains" Behind the Machine are a Mystery

AI is a black box we can't seem to open

XAI, AFFECTED STAKEHOLDERS & REGULATORY ATTEMPTS

The Hinton Conundrum: AI, Intelligence, and the Unpredictable Future

The Neural Network’s Journey: Unraveling the World Through Text

Neural model in 60sec - How does an AI model work?

Can AI Really Read Your Mind? Find Out Here

Talking with chatgpt4 about human intelligence, deep Neural networks and evolve algorithms

AI Unmasked: Real Insights into Capabilities, Misconceptions, and Future Potential

Do you find AI confusing?

Horrific Hands

What is a Transform?

What Are Neural Nets?

So How do Image Generators Use Neural Nets?

Feature Space

领英推荐

Negative Learning

In-painting Coherency

So Why Are Hands So Difficult?

So Why Did AI Advance And, “Solve,” The Hands Problem?

Solving For Hands

Sources

I'll Keep This Short

491 位关注者

Patrick Delaney的更多文章

The Fine-Turning an Open Source Language Model Journey Part One: Impetus

Craft Beer and Spongiform Brain Bacterium

Constitutional A.I. and the Math Achievement Gap

AI Panic: Are Robots Going to Kill Us All?

Threading an Argument for the Fediverse

How Far Are We From Being Able to Generate Whatever 3D Objects On the Fly?

Why Didn't Ancient Rome Have a Space Program?

How Far Away Are We From Non-Crappy AI Generated Video?

Defeating the Wizard: Large Language Model Prompt Attacks

Whatever Happened to The Internet of Things?

社区洞察

其他会员也浏览了

The Black Box of AI - When the "Brains" Behind the Machine are a Mystery

AI is a black box we can't seem to open

XAI, AFFECTED STAKEHOLDERS & REGULATORY ATTEMPTS

The Hinton Conundrum: AI, Intelligence, and the Unpredictable Future

The Neural Network’s Journey: Unraveling the World Through Text

Neural model in 60sec - How does an AI model work?

Can AI Really Read Your Mind? Find Out Here

Talking with chatgpt4 about human intelligence, deep Neural networks and evolve algorithms

AI Unmasked: Real Insights into Capabilities, Misconceptions, and Future Potential

Do you find AI confusing?