Why Do A.I. Image Generators Have Problems Creating Hands?

Why Do A.I. Image Generators Have Problems Creating Hands?

Welcome to my bi-weekly newsletter, “I’ll Keep This Short,” where I navigate the less-traveled paths of AI, building new insight beyond the banal, mainstream chatter.

Horrific Hands

You may had likely noticed earlier this year or last year - generative images with hands in odd poses, with a wrong number of appendages, or even as a well-circulated meme had shown, two hands growing out of an independent, “stick arm.”

No alt text provided for this image

Of course you might not see these images quite so commonly anymore. What happened? Was the hand problem, “solved,” as the Washington Post seemed to allude to?in March of 2023? From the article:

As recently as a few weeks ago, Farid said, spotting poorly created hands was a reliable way to tell if an image was deep-faked. That is becoming harder to do [.]

Today I would like to walk you through the details of how image generation really works, explain why hands really are so hard to draw, and talk about whether the, “drawing hands problem,” has really been solved.

What is a Transform?

The underlying mechanism that powers chatbots like ChatGPT is essentially the same as that which powers image generation: neural networks. Neural networks are what is known as a?Mathematical Transform.?The simplest example of a Mathematical Transforms would be to take a set of numbers such as [1, 2, 3] and put it through some kind of predetermined way of operating on those numbers in a sequence, maybe, { “multiply each number by 2” }, from which you would get an output [2 ,4, 6].

There are many different types of transforms. From a blog called?EquationFreak, you can run a, “Reflection Transform,” to get the mirror image, or a “Rotation Transform,” or a, “Translation Transform,” which is like sliding.

No alt text provided for this image
Transformers From EquationFreak https://equationfreak.blogspot.com/

What Are Neural Nets?

Neural nets however are super complex transforms with many steps and many layers, (which are basically sequential steps) of operations. So rather than just doing Reflect → Rotate → Slide, you would have potentially tens or hundreds of steps.

To help make thinking about neural nets more simple, you can think of an excel spreadsheet, with the first tab having a few input numbers, the next tab having thousands of columns and rows of different weights that get multiplied in some confusing huge mess that no human is ever going to understand, and then the last tab being the final output of that behemoth spreadsheet, reduced back down to something not so complex.

You can also create an application which has multiple neural networks which work in series, one after another, continually transforming whatever input into some completely different output, or you could combine neural networks with other kinds of operations.

So How do Image Generators Use Neural Nets?

There are two neural networks at work when using an image generator such as Dall-E 2. The first is the, "Large Language Model," or?LLM, which interprets human the input, and the second is the, "Contrastive Image Pre-Training."

No alt text provided for this image

A Contrastive Image Pre-Training model is basically a neural network that has been trained on millions of images tagged with phrases. So for example you might have an image of?chicken nuggets, and then a text phrase, "chicken nuggets" tagged on those images. So when you go into an image generator and search something fundamental like that, you will see what might be logically construed as some of the originally tagged images:

No alt text provided for this image

Likewise if you do it with some other terms you might like to combine, such as, “police” or “dressed up as police,” and “criminals,” or “dressed up as criminals.”

No alt text provided for this image

So let’s accept these examples above as being close to fundamental images that were used within the training process of Dall-E. It’s important to note that all of these images, on a fundamental level, are just grids of thousands or millions of pixels which have a particular color and grayscale. So if you were to look at an image in some kind of pixel viewer and show the image in terms of how it is encoded, it would literally be a grid of?hexadecimal numbers?like the following:

No alt text provided for this image

So then what the contrastive image pre-training does, is for every single pixel of the image, I, it combines it with a soft of averaged out, embedded pointer to text, T to create a new type of image with both the T and the I combined together in every single pixel. So basically they, “mapped,” together every single pixel with the text phrase in question.

No alt text provided for this image

Of course, with the millions of images that are tagged and used in the training process, there is a vast amount of information that gets combined which goes beyond just the objects within the picture. The information may also include styling, color, orientation, camera type, or any number of properties that could have been tagged to a particular image.

No alt text provided for this image

Hence, when we enter in a phrase such as, “Police chicken nuggets,” or “Criminal chicken nuggets,” you end up with a combined image like the following, which has been generated from the first step of the LLM interpreting, “Police chicken nuggets” and the second step of the Contrastive Image

No alt text provided for this image

This allows the language model to identify which pictures to pick, to encode together to form new pictures.

Feature Space

The feature space in contrastive image pre-training refers to the high-dimensional space where images are represented as vectors, and the aim of contrastive learning is to organize this space in a way that similar images are close and dissimilar images are far apart.

From the?Stanford AI Lab Blog:

No alt text provided for this image

What the above images shows is that basically you have a, “Space,” with all of the images of cats and dogs mapped out by distance from one another. Sometimes a fancy version of a letter, e.g. ? or ? is used to mean the space, rather than R. This above image is merely an example, but basically using a combination of tagging images by words, either, “dog,” or, “cat” and then having a bunch of Transforms go through and look at the pixels in these images, an algorithm could be built that maps out the distance of these images from one another.

But how do you calculate the, “distance,” between two images? How does that even make sense?

I had put together a YouTube video toward the end of 2022 about plagiarism detection, “How I Defeated A.I. Plagiarism,” which touches on this topic to a certain extent.

No alt text provided for this image

Basically, an image, as we had mentioned above, is merely a grid of hexadecimal numbers, which you could also think of as like a grid of binary (one’s and zero’s) numbers. These grids could be lined up in one long line by just taking each row and connecting them up together in a line, so you get something like this:

Row1 [0 0 1 0 1 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 1 0 1 0 1 0 1]
Row2 [0 0 0 1 1 1 1 1 1 1 1 1 0 0 0 0 1 0 1 0 1 0 1 0 1 0 0 0 0 1 1 1]        

When you calculate the, “distance,” between to points in a triangle, you use the pythagorean theorem, that is, for a right triangle:

No alt text provided for this image

There’s another, “distance,” measurement, which is the angle between two right triangles, called the Cosine Distance, which looks like this:

No alt text provided for this image

The Cosine Distance can be calculated knowing the two coordinates of the black dot shown in the two triangles. Essentially, if A and B have known coordinates, let’s say, (1,2) and (2,3) as the location of the dots, then the Cosine Distance can be found with the following equation.

No alt text provided for this image

This same type of distance measurement works on triangles (vectors) of arbitrary length, so if we plug in?Row1?and?Row2?from above, you can get a Cosine Distance between those, using the exact same formula.

So given that we have the ability to really mathematically compare different images, you can now plot them out on a chart that shows how far away any given image is from any other given image, assuming a big set or, “Space” of images, ?.

Negative Learning

In the context of images, the aim is to ensure that the neural network learns to generate similar feature vectors for similar (or "positive") image pairs and different feature vectors for dissimilar (or "negative") image pairs.

Positive pairs typically consist of two augmentations of the same image, while negative pairs are usually composed of two completely different images. The goal is to minimize the distance between positive pairs in the feature space and maximize the distance between negative pairs.

Spoken in more plain language,?“We try to plot the images tagged with, ‘cat’ close to each other using Positive Pairing, but you also put it very far away from houses or cars using Negative Pairing, but not so far away from dogs, since they are both animals but not objects.”

In-painting Coherency

So it’s one thing to identify images that are similar to one another, it’s another thing to combine them together in a way that’s not an absolute mess or noise. The objective that we’re shooting for would be called, “Coherency,” and the method of actually combining these images together is called, “In-Painting.”

There are different methods of combining all of these different types of similar images together, but they all tend to follow odd, complex patterns that look like this, (from this paper):

No alt text provided for this image

The main thing to take away from this is that it’s an iterative attempt to reduce, “loss,” or basically, loss of a coherent image. There are many, many papers out there about how to reduce loss. I will summarize a few methods used here:

  1. Reconstruction Loss: One simple approach is to use a reconstruction loss, such as mean squared error (MSE) or mean absolute error (MAE), between the generated image and the original image. This encourages the model to generate content that is similar to the original content, which can help to enforce coherency. However, this approach might not work well if the model needs to generate new content that wasn't in the original image.
  2. Contextual Loss: This is a more sophisticated approach that encourages the model to generate content that is consistent with the surrounding context. This might involve comparing the features of the generated content and the surrounding context at various levels of abstraction, and penalizing the model if these features are dissimilar. This can encourage the model to generate content that is not only similar in appearance to the surrounding context, but also shares similar high-level features.
  3. Adversarial Loss: This is another approach that can help to enforce coherency. In this approach, a separate discriminator model is trained to distinguish between real images and images generated by the model. The generator model is then trained to fool the discriminator. This can encourage the generator model to generate content that is not only plausible in its own right, but also coherent with the surrounding context, since any inconsistencies might give the game away to the discriminator.

Each of these approaches involves a different mathematical formulation. For example, MSE and MAE involve calculating the difference between the generated and original pixel values, contextual loss might involve calculating the cosine similarity between feature vectors, and adversarial loss might involve calculating the binary cross-entropy between the discriminator's predictions and the actual labels.

DALL-E, Midjourney and StableDiffusion each use a combination of these loss functions to train their models to generate coherent images. This combination of loss functions has been shown to be effective in generating high-quality, realistic images, but there are still gaps here and there, depending upon how those loss functions are bing tweaked.

So Why Are Hands So Difficult?

So basically, something goes wrong in the In-Painting step, because hands are so complex, having so many parts and ways of being held, that it’s difficult to map coherency and reduce loss, because there is just too big of a landscape of potential ways that hands look in pictures.

If little of the above made sense to you, the simplest explanation might be, “Well have you ever tried drawing hands? They are frickin’ hard to draw!”

To attempt to be a bit more precise, there are a couple reasons.

  1. Close Negative Distances: All photos of hands, particularly human hands, would be mathematical distance-wise close to one another, since any two different photos of hands may be quite difficult to describe in terms of being unique permutations.
  2. Complex Coherency Goal: Because they have a lot of parts and a lot of different configurations in which those parts can be set. The human hand has 27 bones and perhaps 30 distinct, “geometric parts,” and those parts can be in many different gestures and permutations.

Ideally to train an image generator that would generate hands really well, there would need to be a way to tag all photos of hands in every possible configuration that hands could be in, which could be millions, and then to ensure that those tagged configurations some how do not overlap when they are called upon by a user generating an image using words. Also, it would be helpful if the user could specify with a much greater degree of certainty precisely how they want the hands to appear - but alas, we just don’t have that massive and precise of a vocabulary in any human written language to represent all of the millions of possible configurations that hands could be in. We might be able to say, “the OK symbol,” or, “the middle finger,” but…well, what about someone giving the middle finger, camera pointing straight on at the hand, but with the middle finger slightly angled backwards by about 7 degrees?

Describing the entire space of how hands could potentially appear is just far too complex for a user, and so we’re left with the server to make these decisions for us, and if there isn’t some proper training set that ensures that all hand gesture configurations are of sufficient mathematical distance away from one another, they are going to get smeared together.

So Why Did AI Advance And, “Solve,” The Hands Problem?

Short answer: they didn’t. Not in a conventional way, that is to say, not in the same way that solving for image generation in general in the above methods was solved. Rather, hands have been, to use an analogous phrase, “censored,” through blurring, and heavily restricting the forms and images in which they appear.

This is similar to how high quality 3D images are being generated by a particular startup that I?covered in a previous article. Essentially, if you want higher quality generative, “stuff,” just reduce the number of possible things that can be generated, which makes the training task less difficult.

More advanced image generation software such as Midjourney V5 attempt to solve this problem by either blurring out hands, or being trained with hands in fewer possible configurations. There are varying reports of how successful Midjourney V5 actually is in being able to produce hands in a wide variety of configurations, as well as advice on how to generate images with prompts that actively cut out hands or make them more simple and closer to the original fundamental photos on which they were trained.

Solving For Hands

At some point in the future, there will be a concerted effort to take gazillions of photos of human hands in every possible shape, and build a Contrastive Image Pre-Training model that can deal with hands specifically, because people typically don't like to see that uncanny valley of weird hands. This may well solve the hands problem.

On the other hand, some believe that we really just don’t have a good way of explaining what Neural Networks are doing, in human language, and therefore it’s too difficult to work with them to be able to solve problems like these meaningfully. With computer programming, for decades different types of programming languages have been invented to help humans be able to highlight and become aware of different ways of interacting with computers to be able to tell computers what to do. Some posit that we lack this type of, “language,” or, “interface,” with neural networks, and that is the reason why solving problems like the hand problem is so difficult right now.

One way that this problem is being addressed, which might well result in better AI solutions overall, rather than needing to brute force the problem with better tagging and training, is through, “Interpretability,” as?termed on the MIT CSAIL Research Group website.

Some people go beyond Interpretability and tag on the term, “Mechanistic Interpretability,” but?both terms seem to be used interchangeably. Some go further and tie Mechanistic Interpretability to building safety within AI, e.g. attempting to reduce the potential harm that AI can cause, and some take it all the way and say that it’s a potential way to prevent AI from destroying us all. At this point I wouldn’t go that far, I would just say it’s an interesting topic that might potentially be one path to make generating images of hands better.

Sources


Jonathan Sollie

Delivering brand and digital solutions that delight

1 年

Can’t wait to dig in. This has been on my random research list.

要查看或添加评论,请登录

Patrick Delaney的更多文章

社区洞察

其他会员也浏览了