Turn a Generative AI Model into a Data Factory — Part One
Augmented image data through text and latent space of Diffusion Model

Turn a Generative AI Model into a Data Factory — Part One

The availability of high-quality training data is critical to the success of any AI project. Unfortunately, collecting and labeling large volumes of data is time-consuming, expensive, and in some cases, impractical. Traditional solutions to this problem, such as conventional?data augmentation algorithm,?synthetic data generation, and?human labor labeling, have their own limitations and drawbacks.

In this series of blog posts, I want to explore a relatively new approach to tackling the training data issue: using foundation (generative) models to mass produce annotated datasets. I’ll examine a simple, real-world example of this application: using generative models to complement and augment large-scale labeled datasets such as the?CelebA?dataset, which contains over 200K celebrity images each with 40 attribute annotations. The code shown later in the post is built on top of existing work from the?SEGA?repo, which explores latent space editing on diffusion model output. You can find all the coding in this post on GitHub:?https://github.com/asrlhhh/diffusion-data-factory

This post will explore the following topics:

  • Existing data labeling solutions
  • The process of the diffusion model
  • The concepts of data distillation
  • The mass production of annotated datasets through the diffusion model’s text space
  • The mass production of annotated datasets through the diffusion model’s latent space

In upcoming posts, I’ll assess the effectiveness of diffusion models in comparison to other methods, focusing on their ability to augment data and labels while maintaining existing features and preserving facial identity. Additionally, I’ll examine the enhancements these models contribute to the performance of those trained on augmented datasets.


Limitations of existing data labeling solutions

Despite the usefulness of existing solutions, they are not without limitations:

  • Data augmentation algorithms are usually too simplistic to bring expected diversity to the dataset
  • Synthetic data generation can be computationally intensive and may not fully reflect the complexity of real-world data
  • Employing human labor for labeling can be costly and time-consuming, and the result can lack quality and consistency

These limitations make it challenging to obtain sufficient quantities of high-quality labeled data, which can impede the development of effective AI models. This is an especially big problem for individual ML engineers, small teams, and early-stage companies.


No alt text provided for this image
Clockwise from top left: 1) Illustration of traditional data augmentation techniques on image data [source]. 2) A sample of a human data labeling tool in action [Source] 3) NVIDIA Omniverse Replicator for DRIVE Sim — showcasing synthetic data generation capabilities [Source] 4) An example of image data generated by early generative models, such as StyleGAN.


Diffusion models and Stable Diffusion

Foundation models, such as diffusion models, address the challenge of insufficiently labeled data by learning the distribution of training data and generating high-quality, realistic data.?Diffusion models?employ denoising score matching, a process that corrupts data with noise over multiple steps and then reverses, refining the noise using a learned denoising function. This generates realistic data samples resembling the training data, making diffusion models ideal for creating complex datasets.

No alt text provided for this image
Diffusion process in training & inference time

Stable Diffusion?offers an efficient alternative to regular diffusion models by working in latent space. It uses an autoencoder to compress image data, accelerating diffusion processes. The inner diffusion model, with a?cross-attention mechanism, converts text prompts into image embeddings. This enables Stable Diffusion to generate high-quality, annotated images by linking textual descriptions to visuals, creating large-scale labeled datasets with minimal human intervention.

No alt text provided for this image
Stable Diffusion process

Data distillation

Data distillation, which involves extracting labeled datasets from large models to train smaller, specialized models, is a?knowledge distillation?methodology that has emerged alongside the development of large foundation models. Work on distilling data from pretrained transformers like GPT began in 2021, and since then, numerous studies have explored data distillation for training data augmentation in large language models. In computer vision, data distillation often focuses on extracting annotated image–label pairs or well-aligned image pairs for image-to-image translation. A?2021 paper?initially proposed using synthetic face images generated by StyleGAN to train and test facial recognition models. Subsequently, various studies have investigated leveraging knowledge distillation for tasks like detection and segmentation in the field. [1] [2]


Simple use case: augmenting a benchmark facial image dataset

In this use case, we’ll delve into how a large diffusion model, Stable Diffusion, can be employed to augment the widely used?CelebA facial dataset?through a guided data distillation process. CelebA is a large-scale dataset containing over 200,000 celebrity images, each annotated with 40 attributes, along with landmarks. Stable Diffusion, as part of the family of diffusion models, can generate a vast number of synthetic yet realistic-looking facial images, complete with attribute labels. This approach not only accelerates the creation of a high-quality labeled dataset but also mitigates the need for human labor–intensive labeling processes. By harnessing the power of diffusion models, we can rapidly develop smaller scale AI models for applications such as facial recognition, attribute prediction, and even content generation in the entertainment industry.

No alt text provided for this image
Cover image from official CelebA dataset website


The labels imbalance issue in the CelebA dataset

In the world of machine learning, the concept of a flawless dataset is always a fallacy. Regardless of the meticulousness of their labeling, every dataset inevitably inherits limitations and biases from the real world. The CelebA dataset, for instance, comprises 200,000 images, each annotated with 40 labels. However, upon closer examination of these annotations, it becomes evident that each binary label exhibits an imbalance, with the majority of annotations being either predominantly positive or negative. This phenomenon can typically be attributed to the unequal representation of labels within the real world where the data were captured.

No alt text provided for this image
Stacked bar chart illustrating the annotation distribution of each label

The aforementioned imbalance is illustrated in the correlation heat map below. As mentioned, real-world data often exhibit inherent statistical biases, such as the strong positive correlation between ‘Heavy_Makeup’ and ‘Wearing_Lipstick’ or the pronounced negative correlation between ‘Male’ and ‘Heavy_Makeup.’ This skewed distribution may lead to a model internalizing these biases and generating inaccurate predictions. For instance, whenever the model identifies an individual wearing makeup, it may consistently assign a positive label to ‘Wearing_Lipstick’ as well.

No alt text provided for this image
Heatmap distribution of the correlation score between labels

This is where generative models such as Stable Diffusion can be advantageous. By using its prompt and latent input, we can augment the existing dataset and reestablish label balance. I’ll provide a step-by-step walkthrough of the codebase to establish a “data factory.”


Step One: Setting up the base prompt template

As the first step, we create a text prompt template designed to facilitate the generation of labeled facial datasets with diverse attributes. The template consists of 35 out of the original 40 attribute categories in CelebA. I removed a few attributes that are either beyond the model’s capacity (5 o’clock shadow) or likely to induce ethical concerns (like ‘attractive’). The rest are common features such as nose size, hair color, bangs, mouth size, cheeks, face shape, eyebrow shape, eye size, skin tone, facial hair, and apparel. For each category, we define an array of possible labels corresponding to different facial features or characteristics. The model will then randomly assign these labels to generate a diverse range of facial images accordingly.

prompt_template = {
   "base": "A professional colorful headshot of a{Smiling}{Young} {Male},",
   "hair": "{Wavy_Hair}{hair_color}{Receding_Hairline} hairs,",
   . . . more key, value pairs hidden . . .
   "apparels": " wearing{Wearing_Earrings}{Wearing_Hat}{Wearing_Lipstick}{Wearing_Necklace}{Wearing_Necktie}, "
}

prompt_map = {
   "Smiling": {0:"", 1:"smiling"},
   "Young": {0:"", 1:"young", -1:"old"},
   "Wavy_Hair": {0:"", 1:"wavy", -1:"straight"},
   "hair_color": {0:"", 1:"blonde", 2:"black", 3:"brown", 4:"gray"},
   . . . more key, value pairs hidden . . . 
   "Wearing_Earrings": {0:"", 1:"earrings"},"Wearing_Hat": {0:"", 1:"hat"},
   "Wearing_Lipstick": {0:"", 1:"lipstick"},"Wearing_Necklace": {0:"", 1:"necklace"},
   "Wearing_Necktie": {0:"", 1:"necktie"}
}        


Step Two: Random initialization of prompt

Then, we define a function called?randomize_prompt?to dynamically generate diverse facial images by randomizing the text prompt for each iteration. The function ensures that any combination of the 35 features can be included in the generated images by leveraging both positive and negative prompt directions.

To handle any potential mutual exclusions between attributes, the function calls?handle_mutual_exlusion?before constructing the final prompt. The resulting prompt string is then populated with the randomized values and string representations, ensuring a diverse range of facial features and characteristics in the generated images. By limiting the token count to a predefined threshold?token_limit, the function ensures that the generated prompt remains within the model’s input constraints.

def randomize_prompt():
   # randomly select a key value from each key of the prompt_map
   # make sure that there are half the keys that have non -1 values
   # and return the prompt
   . . . codes hidden . . .
   # randomly select half of the keys for positive prompts and the other for negative prompts; save in prompt_template_keys dict
   prompt_final = ""
   prompt_template_keys = list(prompt_template.keys())
   prompt_template_keys.remove("base")
   random.shuffle(prompt_template_keys)
   prompt_template_keys = ["base"] + prompt_template_keys
   new_prompt_map_random = {}
   for i,prompt_key in enumerate(prompt_template_keys):
       container = PromptContainer(prompt_template[prompt_key])
       current_prompt_keys = container.get_all_variables()
       for key in current_prompt_keys:
           new_prompt_map_random[key] = prompt_map_random[key]
       tmp_prompt = container.populate(prompt_map_random_val)
       if len(token_counter(prompt_final+tmp_prompt)['input_ids']) > token_limit:
           break
       prompt_final += tmp_prompt
   if prompt_final[-1:] == ",":
       prompt_final = prompt_final[:-1] + "."
   prompt_final = prompt_final + " high quality, detailed."
   return prompt_final, negative_prompts, prompt_map_random        


Step Three: Label retrieval

We define the function get_labels to retrieve and generate labels for each generated image based on the randomized text prompt. This function is crucial for creating a labeled dataset that can be used to train AI models on various facial attributes.

def get_labels(prompt_map_random_val):
   # get the labels for the prompt
   labels = {}
   for key in prompt_map_random_val:
       if key == "hair_color":
           hair_map = hair_color_map[prompt_map_random_val[key]]
           labels.update(hair_map)
       elif key == "Wavy_Hair":
           hair_map = hair_style_map[prompt_map_random_val[key]]
           labels.update(hair_map)
       else:
           labels[key] = prompt_map_random_val[key]
   return labels        


Step Four: Filtering low-quality output images

No alt text provided for this image
Four of ten images randomly sampled from Stable Diffusion are unclean or corrupted

In the process of generating facial images, it’s essential to ensure that the output images meet the desired quality standards. However, a large diffusion model can sometimes generate corrupted images, leading to distorted faces, partial faces, multiple faces, or even completely blank frames. To address this issue, I’ve designed a function called?has_single_high_quality_face?to filter out and discard low-quality output along with its labels during the production phase.

This function takes an input image and a minimum confidence threshold as arguments and checks if the image has exactly one high-quality face. It utilizes the?DeepFace?library for initial face detection and the?RetinaFace?model as the detector backend. The function then filters high-quality faces based on the confidence score, ensuring that only images with a single face meeting the specified confidence threshold is retained.

The function also uses?Dlib’s shape predictor to detect facial landmarks and calculate the bounding box that includes hair. It verifies that the entire bounding box is within the image dimensions. This verification is essential in guaranteeing the integrity of the resulting facial images, as it ensures their completeness and prevents the occurrence of multiple faces within the same bounding box.

By employing this filtering step, the dataset production pipeline ensures that only high-quality facial images (those most similar to images of faces captured in the real world) are included in the final dataset, enhancing the effectiveness and reliability of the generated dataset for various AI applications.

def has_single_high_quality_face(img, min_confidence=0.9):
   """
   Check if the input image has one and only one high-quality face.
  
   Args:
       image_path (str): Path to the input image file.
       min_confidence (float): Minimum confidence score to consider a face as high-quality.
      
   Returns:
       bool: True if the image contains one and only one high-quality face, False otherwise.
   """

   # Detect faces using DeepFace
   face_detections = DeepFace.extract_faces(img, detector_backend='retinaface', enforce_detection=False)
  
   # Filter high-quality faces based on confidence
   high_quality_faces = [face for face in face_detections if face['confidence'] >= min_confidence]
   if len(high_quality_faces) != 1:
       return False

   # Use Dlib's shape predictor for facial landmarks
   predictor_path = "./shape_predictor_68_face_landmarks.dat"
   predictor = dlib.shape_predictor(predictor_path)
   detector = dlib.get_frontal_face_detector()
   dlib_rects = detector(img, 1)
   if len(dlib_rects) != 1:
       return False
   dlib_rect = dlib_rects[0]

   # Get facial landmarks
   shape = predictor(img, dlib_rect)
   padding_ratio = 10
  
   # Calculate the bounding box including hair
   min_x = max(0, dlib_rect.left() - int(padding_ratio * dlib_rect.width()))
   min_y = max(0, dlib_rect.top() - int(2 * padding_ratio * dlib_rect.height()))
   max_x = min(img.shape[1], dlib_rect.right() + int(padding_ratio * dlib_rect.width()))
   max_y = min(img.shape[0], dlib_rect.bottom() + int(padding_ratio * dlib_rect.height()))

   # Check if the entire bounding box is within the image dimensions
   if min_x >= 0 and min_y >= 0 and max_x <= img.shape[1] and max_y <= img.shape[0]:
       return True
   else:
       return False        


Step Five: Save the seed, prompts, and labels

We save the images, seed number, positive prompt, negative prompt, and labels locally as the final step of generating the original face images. This allows us to duplicate each data point during the augmentation process later.

for seed in tqdm(range(20000)):
   gen.manual_seed(seed)
   initial_prompt, negative_prompts, prompt_map = randomize_prompt()
   negative_prompts = ", ".join(negative_prompts)
   labels = get_labels(prompt_map)
   out = pipe(prompt=initial_prompt, negative_prompt=negative_prompts, generator=gen, num_images_per_prompt=num_images_per_prompt, guidance_scale=guidance_scale)
   images = out.images
   image = images[0]
   image = pil_to_numpy(image)
   if_face = has_single_high_quality_face(image, min_confidence=0.9)
   if if_face:
       filtered_seeds.append({"seed":seed,"init prompt":initial_prompt,"negative prompt":negative_prompts, "label":labels})
       image_path = str(seed).zfill(4)+".png"
       image_path_full = os.path.join(selected_folder_full_path,image_path)
       save_image(image, image_path_full)
   else:
       image_path = str(seed).zfill(4)+".png"
       image_path_full = os.path.join(filtered_folder_full_path,image_path)

with open("selected_seeds.json", "w") as file:
   json.dump(filtered_seeds, file)        
No alt text provided for this image
Fifty samples of the generated images


Step Six: Traversing the latent space for a cleaner, smoother transition

The CelebA dataset contains 202,599 images and 10,177 unique identities, with most identities having multiple images. The ratio of images to identities is approximately 1 to 20. Therefore, to recreate the dataset, it’s necessary to generate faces of the same identities in different scenarios and possibly with different features.

Latent space interpolation in diffusion models achieves this by augmenting facial images with arbitrary features and varying strengths. By leveraging the principles of semantic guidance (SEGA) and isolating semantics within the latent space, it becomes possible to manipulate the image generation process in a controlled manner. The high-dimensional latent space can be understood as a composition of sub-spaces representing various semantic concepts. By interpolating between these sub-spaces, one can generate images with specific attributes or features.

For instance, let’s consider the task of augmenting a facial image with a specific hairstyle or facial feature. By identifying the dimensions within the latent space that encode the target concept, such as the desired hairstyle or feature, we can calculate the noise estimate conditioned on that concept description. The difference between the conditioned and unconditioned estimates can then be scaled and applied to the original image, effectively modifying the image to incorporate the desired feature.

No alt text provided for this image
Comparison between manipulating images in latent space vs. in text space

This method enables precise control over the strength and combination of arbitrary features added to the generated image. Through latent space interpolation in diffusion models, we can generate a diverse array of facial images with varying attributes, greatly expanding possibilities for creating large-scale, annotated datasets for various AI applications.

No alt text provided for this image
Multi-label image transition through latent space


Step Seven: Setting up warm-up and cooldown steps for regional control

Some of the above samples, such as the open-mouth one, exhibit significant changes in appearance even with latent transition. To preserve the identity during augmentation, we require additional control over the transition process to ensure that it’s more localized and precise.

By leveraging the warm-up parameter δ, it becomes possible to implement regional controls over the image generation process, allowing for more nuanced and targeted augmentation of images. The warm-up parameter δ determines the point in the diffusion process at which guidance γ is applied. By setting a higher value for δ, the guidance is applied later in the process, ensuring that only fine-grained details of the image are altered while preserving the overall composition.

for attribute_ind,single_attribute in enumerate(attributes_lst):
       direction_lst = generate_random_bool_list(len(single_attribute))
       guidance_lst = generate_random_float_list(len(single_attribute), edit_guidance_scale_min, edit_guidance_scale_max)
       threshold_lst = generate_random_float_list(len(single_attribute), edit_threshold_min, edit_threshold_max)
       warmup_steps = []
       cooldown_steps = []
       attribute_prompts = []
       for attribute in single_attribute:
           warmup_steps.append(attributes_full[attribute]['warmups'])
           cooldown_steps.append(attributes_full[attribute]['cooldowns'])
           attribute_prompts.append(attributes_full[attribute]['prompt'])
       labels_from_latent = {}
       for attr, direction in zip(single_attribute,direction_lst):
           if attr in labels:
               if direction:
                   labels_from_latent[attr] = -1
               else:
                   labels_from_latent[attr] = 1
       gen.manual_seed(seed)
       out = pipe(prompt = initial_prompt, generator = gen,
                  num_images_per_prompt = num_images_per_prompt, guidance_scale=guidance_scale,
                  editing_prompt= attribute_prompts,negative_prompt=negative_prompt,
                  reverse_editing_direction=direction_lst, # Direction of guidance i.e. increase all concepts
                  edit_warmup_steps= warmup_steps, # Warmup period for each concept
                  edit_cooldown_steps= cooldown_steps,
                  edit_guidance_scale=guidance_lst, # Guidance scale for each concept
                  edit_threshold=threshold_lst, # Threshold for each concept. Threshold equals the percentile of the latent space that will be discarded. I.e. threshold=0.99 uses 1% of the latent dimensions
                  edit_momentum_scale=0.3, # Momentum scale that will be added to the latent guidance
                  edit_mom_beta=0.6, # Momentum beta
                  edit_weights=[1]*len(attributes) # Weights of the individual concepts against each other
                 )
       images = out.images
       update_labels = labels.copy()
       update_labels.update(labels_from_latent)
       image = images[0]
       tmp = pil_to_numpy(image)
       if_face = has_single_high_quality_face(tmp, min_confidence=0.9)
       if if_face:
           png_name = str(image_cnt).zfill(6)+".png"
           save_image(tmp, os.path.join(data_folder,png_name))
           annotate_data["path"].append(os.path.join(data_folder,png_name))
           annotate_data["labels"].append(update_labels)
           image_cnt = image_cnt + 1        

This warm-up technique lets users selectively modify specific regions or features within the generated image without affecting other parts. For example, one could apply regional control to change the hairstyle or add accessories to a facial image without altering the face’s underlying structure. The warm-up parameter provides an additional layer of control, allowing for more refined image augmentation and the creation of diverse, high-quality datasets with targeted attribute variations.

No alt text provided for this image
Comparison of latent space transitions with (on the left) and without (on the right) warm-up/cooldown controls.

The following examples demonstrate smooth latent transitions achieved using warm-up and cooldown steps.

No alt text provided for this image
No alt text provided for this image


Step Eight (optional): Enhancing output quality and correcting corruption using face restoration

In the final step of generating facial images, an optional step is introduced to enhance the quality of the generated dataset by using a blind face restoration model,?CodeFormer. This model restores the quality of the facial images generated by the large diffusion model, making them more like real-world datasets such as CelebA.

The blind face restoration model works by improving the overall quality and clarity of the generated images while ensuring that all the labeled features remain unchanged. This can be beneficial when the goal is to create a high-quality dataset for training or fine-tuning AI models, as it helps reduce potential biases and improve generalization capabilities.

However, using the blind face restoration model can sometimes result in the removal of some fine-grained details on the face. The trade-off between quality enhancement and detail preservation is a crucial aspect to consider when deciding whether to apply the restoration step. This step is made optional in the pipeline to allow users to balance their dataset’s quality based on their specific requirements and preferences. In some cases, retaining certain fine-grained details might be more critical than obtaining the highest possible image quality, making the optionality of this step valuable.

No alt text provided for this image
Comparisons of the original output image (left) with the version that underwent face restoration (right). CodeFormer can correct artifacts, such as in eyes and teeth, introduced by the diffusion model while retaining the facial identity.


Conclusion

In this post, I’ve presented the process of setting up a diffusion model as a data factory, with an emphasis on generating high-quality facial datasets featuring diverse annotations that resemble and surpass the characteristics and distributions found in the CelebA dataset. When configured appropriately, foundation models can act as potent data-generation engines. Nonetheless, they possess limitations that may necessitate collaboration with other AI models or algorithms to produce accurate data and annotations.

In future posts, I’ll explore other functionalities of Stable Diffusion and compare diffusion and other generative models, investigating their ability to augment training and testing datasets. Stay tuned for upcoming articles in this series, and follow my LinkedIn account to stay informed about my ongoing journey in this field.


Coming next: Exploring Image Prompt for Augmenting Existing Data

In the next post, I’ll explore the use of the image prompt component of Stable Diffusion to facilitate granular editing on existing CelebA images, rather than generating new ones. By encoding the original CelebA images into latent vectors, we can employ the same semantic guidance methodology to modify these images for enhanced facial features.

No alt text provided for this image

要查看或添加评论,请登录

社区洞察