How AI Generates Images

How AI Generates Images

Interactive chat LLMs like ChatGPT, Gemini, Llama, and Claude are well on their way to becoming mainstream and their underlying technology is nothing short of amazing. Another equally fascinating side of AI generated text is AI generated images. While LLMs are based upon a transformer AI architecture, image generation solutions like Midjourney and DALL-E are based on a stable diffusion AI architecture. Let's take a look at the basics of AI generated images and become more familiar with stable diffusion.

The Basics of Diffusion

In a way, stable diffusion reminds me of a Bob Ross painting that starts completely blurry and slowly comes into sharp focus over time. But that's where the similarity ends. A painter continuously adds detail to the canvas to produce a final detailed work of art. However, stable diffusion mathematically filters out noise from a digital image until that image results in a nice coherent result.

Bob Ross In 1985. "The Joy Of Painting," The Bob Ross name and images are trademarks of Bob Ross Inc. All Rights Reserved.

The concept of stable diffusion is at the center of image generation AI products like Midjourney and DALL-E. It's based on the real-world principles diffusion whereby particles spread out from areas of high concentration to areas of low concentration. To illustrate, imagine the smell of fresh baked cookies spreading through a house. Initially, the delightful smell is strongest in the kitchen where the cookies come right out of the oven. As the smell gradually spreads (or diffuses) into other rooms it becomes less concentrated. The rate of this gradual spread of smell carrying air particles can be calculated by a mathematical expression.

A diffusion equation can predict how the smell of fresh baked cookies diffuses into the air.

Since we can calculate the diffusion of something, like the smell of cookies, we can train an AI model to accurately predict that spread. Even more interesting, we can also train the AI model to do the reverse - start with an entire house of smells and then concentrating it all back into the kitchen. And that's how stable diffusion creates images.

Image Generation

Image generation by stable diffusion involves starting with a completely crude and noisy image and then progressively refining it to produce a clear, detailed image. Think of it as taking the dispersed cookie smell from all over the house and concentrating it back into the kitchen until the smell is just as strong as when the cookies were first baked. That's the goal of stable diffusion in image generation: taking random noise and transforming it into a coherent and high-quality image.

The process begins with a neural network model trained to understand how to add and remove noise from images. During training, the model learns the forward diffusion process by starting with clear images and progressively adding noise until the images become random and unrecognizable. This is done with thousands of images. It's similar to gradually spreading the cookie smell throughout the house until it is evenly dispersed - over and over, with different types of cookies in different shapes of houses. The model learns to predict the amount and pattern of the noise added at each step and creates a detailed map of how images degrade into noise.

Stable diffusion takes random noise and transforms it into a coherent, high-quality image.

Once the AI model understands the forward diffusion process that turns coherent images into noise, it's trained to perform the reverse process to turn the noise into a coherent image. This process is commonly known as denoising. This reversal process is where the magic of image generation happens. The AI model starts with a random noisy image and applies its learned knowledge to iteratively remove the noise. The image is refined step by step until it becomes clear and detailed. This is like filtering out the non-cookie particles from the air in the house, concentrating the cookie smell back into the kitchen.

Forward diffusion trains the AI model while reverse diffusion to generates an image

The key components that make stable diffusion work effectively include the diffusion process itself, the noise schedule, the denoising network, the loss function, and the sampling procedure. The diffusion process involves the controlled addition of noise to a coherent image throughout the AI model training process. This way the model learns how to reverse the noise. The noise schedule defines how much noise is added or removed at each step of the diffusion. It makes sure that the noise is added gradually and consistently throughout the forward diffusion (training) process and then removed smoothly during the reverse diffusion (image generation) process. For example, if you were trying to concentrate the cookie smell too quickly, it could lead to an uneven smell concentration and possibly miss some areas of the house. The noise schedule prevents this by controlling the rate of noise addition and noise removal.

The denoising network is the AI neural network model performs the actual learning of how to remove noise from the images. It is typically a Convolutional Neural Network (CNN) because CNNs are great at image processing. This is because they capturing hierarchical features in images, starting from simple edges and textures and then expand to more complex patterns and objects. This hierarchical feature extraction is perfect for the denoising process. It allows the AI model to understand and reconstruct the fine details in the image as the noise is gradually removed.

The Loss Function measures the difference between the latest generated image and the final target image.

The loss function is another critical component in training the denoising network. It measures the difference between the denoised image produced by the model and the original clean image. By minimizing this difference, the model learns to produce images that are as close as possible to the original ones. The loss function guides the model during training, ensuring it improves its denoising capability over time.

The sampling procedure is used when the AI model generates new images. It starts with a random noisy image and then the model applies the learned denoising steps over and over. At each step of the sampling procedure, some of the noise is removed based on the noise schedule and the image becomes clearer and clearer. This process repeats over and over until the final, clear image is generated. In our example, this is the gradual process of filtering out the non-cookie particles from the air and concentrating the desired smell step by step.

Text to Image

Now that we can generate clear and coherent images from random noise, we can start generating completely new blended images from multiple text descriptions. This is accomplished through natural language processing (NLP). When a user provides a text prompt, such as "a tree by the lake at sunset," the AI model first extracts the meaning behind the prompt and represents them as numerical values for "tree" and "lake" and "sunset". These values guide the image generation process to refine the noisy image into a coherent output that aligns with the text prompt.

Image generated by the text prompt "a tree by the lake at sunset"

As you can imagine, there's a significant amount of math, training images, and computing power that goes into an image generation product. During training, the model learns to recognize patterns and how they degrade when noise is added. The model also learns to effectively reverse the process during image generation. The results are amazing and continue to improve.

It's also worth mentioning that stable diffusion is not limited to image generation. These concepts can also be applied to other types of content, such as text, audio, and video. The process is the same - learning how to add and remove noise to refine the content from a noisy state to a clear and coherent form. For example, in audio generation, the model could begin with a noisy sound and progressively remove the noise to produce clear and high-quality audio. Simply by training an AI model to understand how the content degrades into noise and learning to reverse the process to reconstruct the original content.

For those interested in learning more about stable diffusion and its applications in image generation, I recommend getting to know convolutional neural networks better. Gaining hands-on experience with open-source repositories is a great way to learn more. And while you're doing that, I recommend enjoying some warm freshly baked cookies.

Sumeet Maniar

Product Officer | Product Management AI-ML | Digital Health, Games, SaaS, Payments | X times Founder | Business Development

9 个月

Thanks for connecting. This whole explanation is very similar from my old EE courses on signal processing, sampling, error rate, encoding, decoding. Similarly here it is mapped using AI to make it Gen AI.

要查看或添加评论,请登录

John Kanalakis的更多文章

  • Eliminating Homework Can Help Students Thrive

    Eliminating Homework Can Help Students Thrive

    I was recently asked by a teacher about the impact that AI may have on student homework and cheating. While there are…

  • What's in Your Wallet? A Predatory Lender.

    What's in Your Wallet? A Predatory Lender.

    Why credit cards are predatory lenders and how to avoid them In the not-so-distant past, credit cards were hailed as…

  • Taking on Personal Financial Management with AI Agents

    Taking on Personal Financial Management with AI Agents

    Tendi is evolving how personal financial management software works by integrating a collection of specialized AI…

    3 条评论
  • An AI Model for Human Financial Behavior

    An AI Model for Human Financial Behavior

    We all have our own unique perspective when it comes to managing our finances. Understanding how someone earns, saves…

    4 条评论
  • AI and Non-linear User Interfaces

    AI and Non-linear User Interfaces

    A non-linear, personalized user interface breaks away from the traditional, one-size-fits-all approach to user…

    2 条评论
  • Understanding Neural Networks by Building a Language Model from Scratch

    Understanding Neural Networks by Building a Language Model from Scratch

    Chat-based generative AI is really amazing to interact with and understanding how they fundamentally work is even more…

    2 条评论
  • Revolutionizing Credit Risk with Generative AI

    Revolutionizing Credit Risk with Generative AI

    Credit risk evaluation plays a crucial role in lending decisions, both for consumers and businesses. These evaluations…

    2 条评论
  • It’s a Wonderful Web3 Life

    It’s a Wonderful Web3 Life

    An old-timey banking structure of the 1920s could make a comeback in the 2020s with the help of Web3 technology…

    2 条评论
  • Panning for Prime Gold in the Murky Subprime Waters

    Panning for Prime Gold in the Murky Subprime Waters

    Equifax defines a subprime borrower as “anyone who doesn't have good enough credit to qualify for a creditor's prime…

社区洞察

其他会员也浏览了