Fine-Tuning Stable Diffusion with Dreambooth ??
Dreambooth is a technique that you can easily train your own model with just a few images of a subject or style. In the paper, the authors stated that,
“We present a new approach for “personalization” of text-to-image diffusion models (specializing them to users’ needs).”
In this blog, we will explore how to train Dreambooth, discuss its hyperparameters and look into how to train images using captions. Let’s dive in! ??
How to train Dreambooth?
First of all, you need to prepare your training data. If you prefer collecting your own images, you can take 4–10 pictures of the specific image. However, if you train a person’s face, it will be better to gather a few more images. Alternatively, you can try training Drembooth using the datasets available at this link.
Gathering more images is important, but even more crucial is obtaining high-quality images. The quality of the input images impacts the quality of the output images.
Secondly, you need to resize your images to 512x512 in order to provide them to the model. You can use this website to resize the images.
We will use 18 images of Elon Musk to train Dreambooth.
Before proceeding to train Dreambooth, let’s look into some hyperparameters.
Also, we added two hyperparameter to the dreambooth script for validation, which are save_guidance_scale and save_infer_steps. Now, we can look at the run code for training and the results below.
As you can see, these images are not good enough.
After the initial training, we adjusted the max_train_steps to 1600. Let’s look at the results below.
Now, we have better images compared to previous ones.
领英推荐
How to train style images using captions?
This is a method that you can train your style images with captions. First of all, we have prepared the dataset that we will be using for training. We have decided to train a movie style which is Spider-Verse. You can take a look at the images below to get a sense of this style.
we have collected 34 images in the spiderverse style and crafted a caption for each one. Gathering more high-quality images could further improve the results.
For each training image, we have created a txt file with the same name as the image. The structure is as follows:
We typically describe the images such as: “a middle-aged man, upper body, short brown hair, brown mustache, wearing blue and purple shirt, glasses, lightings in the background”
Now let’s look into the run code below. We adjusted the train_dreambooth.py script to be able to add captions. Class images were not used. Additionally, we have chosen the unique identifier “spdrvrs”, and we set the max_train_steps to 6000.
Let’s look at the results ?
Alternatively, we trained a model using the same images but without captions. We kept the other hyperparameters same. Let’s compare the results with and without captions.
As you can see in the examples, the main problem with Dreambooth is overfitting. In both cases, we can observe that the model overfits the instance images. However, we can obtain better results when using images with captions. You can try to obtain more high-quality images to avoid overfitting.
On the other hand, we wanted to try Dreambooth LoRA SDXL using the train_dreambooth_lora_sdxl.py script to see if there was any noticeable difference. While using SDXL enhances our results, using LoRA reduces the total file size. We have revised the script to be able to add captions. Additionally, there are some hyperparameters in this script that we haven’t explained yet. Let’s explore them.
pretrained_vae_model_name_or_path: path to pretrained VAE model (we’ll use sdxl-vae-fp16-fix model) rank: the inner dimension of the LoRA matrices
If you look at the results below, we can say that we have significantly improved Spider-Verse style images. Moreover, it is worth noting that this model has no signs of overfitting ??. Training your own model with the Dreambooth LoRA SDXL would be a good choice.
Wiro AI / Machine Learning Team
Ingénieure R&D (AI/ML/XR/3D) | Technical Artist & TD Pipeline | Chercheuse-artiste
8 个月Hey there! This looks very interesting, I'm trying to implement it but I'm not sure your links point towards the right scripts : when launching the script with your arguments, "save_guidance_scale" and "save_infer_steps" are not recognized, and the script crashes when finding the caption files. Could you point me again towards your updated script? I'll have a look and try to debug. Thank you!
Love this!