Comparing Stable Diffusion fine-tuned models for photographic image generation

Comparing Stable Diffusion fine-tuned models for photographic image generation

To test the upcoming AP Workflow 6.0 for ComfyUI, today I want to compare the performance of 4 different open diffusion models in generating photographic content: SDXL 1.0, Realistic Stock Photo 1.0, RealVisXL 2.0, and CineVisionXL 1.5.

Let's start with the SDXL 1.0 Base model, enhanced by the SDXL Refiner, and compare what a simple prompt can do with four different negative prompts:

  1. No negative prompt
  2. A basic negative prompt
  3. A variant of the basic negative prompt, mainly used for Stable Diffusion 1.5
  4. A negative prompt optimized for photographic image generation

SDXL 1.0 Base+Refiner, with no negative prompt
SDXL 1.0 Base+Refiner, with a basic negative prompt
SDXL 1.0 Base+Refiner, with a variant of the basic negative prompt, usually used for Stable Diffusion 1.5
SDXL 1.0 Base+Refiner, with a negative prompt optimized for photographic image generation


As you can see, how you write your prompt matters immensely.

The winner, IMO, is the one with the negative prompt optimized for photographic image generation.

Let's use that as a baseline to see if the optimization technique known as Free Lunch makes a big difference in the generation of the image:

  1. Negative prompt for synthetic photos without Free Lunch
  2. Negative prompt for synthetic photos with Free Lunch v1
  3. Negative prompt for synthetic photos with Free Lunch v2
  4. Bonus: the Free Lunch v2 optimization applied to the SDXL 1.0 Base+Refiner image, but with the basic negative prompt optimized for Stable Diffusion 1.5

Again: SDXL 1.0 Base+Refiner, with a negative prompt optimized for photographic image generation.
SDXL 1.0 Base+Refiner, with a negative prompt for synthetic photos + Free Lunch v1
SDXL 1.0 Base+Refiner, with a negative prompt for synthetic photos + Free Lunch v2
SDXL 1.0 Base+Refiner, with a negative prompt for Stable Diffusion 1.5 + Free Lunch v2

While Free Lunch makes a difference, it doesn't necessarily make the picture better. So we'll stick with the original image, plus the negative prompt optimized for photographic image generation. That will be our baseline.

Now, with all parameters equal, and prompt settings equal, let's compare SDXL 1.0 with its fine-tuned alternatives:

  1. SDXL 1.0 Base+Refiner
  2. Realistic Stock Photo 1.0
  3. RealVisXL 2.0
  4. CineVisionXL 1.5

Again: SDXL 1.0 Base+Refiner, with a negative prompt optimized for photographic image generation.
Realistic Stock Photo 1.0, with a negative prompt optimized for photographic image generation.
RealVisXL 2.0, with a negative prompt optimized for photographic image generation.
CineVisionXL 1.0, with a negative prompt optimized for photographic image generation.

As you can see, my parameters are fine for the SDXL 1.0 Base+Refiner, but the CFG Scale value is too high for the fine-tuned variants, so the images get "overcooked". Let's lower that value.

Each fine-tuned SDXL model has different Steps and CFG Scale values recommended by the creators, but a good compromise to maintain control over the generation is a CFG Scale = 7.

What happens in that case?

  1. SDXL 1.0 Base + Refiner with a CFG Scale = 10
  2. Realistic Stock Photo 1.0 with a CFG Scale = 7
  3. RealVisX 2.0 with a CFG Scale = 7
  4. CineVisionXL 1.5 with a CFG Scale = 7

Again: SDXL 1.0 Base+Refiner, with a negative prompt optimized for photographic image generation, and CFG=10
Realistic Stock Photo 1.0, with a negative prompt optimized for photographic image generation, and CFG=7
RealVisXL 2.0, with a negative prompt optimized for photographic image generation, and CFG=7
CineVisionXL 1.0, with a negative prompt optimized for photographic image generation, and CFG=7

Much better and, IMO, Realistic Stock Photos 1.0 is the fine-tuned SDXL variant generating the most pleasant images.

The question now is:

Would these fine-tuned models perform better with their own recommended settings?

The answer is no. At least not on my Apple Silicon system.

In every test I've done, the KSampler called DPM++ 2S Ancestral, paired with the Scheduler called "Karras" produces better results compared to DPM++ SDE, DPM++ 2M, and DPM++ 3M SDE, which were recommended for those models.

Surely, NVIDIA cards perform in a different way compared to my Apple MPS.

Plus, these generations are done with ComfyUI, which interprets the weights in a different way compared to A1111 & co.

On my system, SDXL 1.0 Base + Refiner still produces the best results (at least in this test).

So let's close by passing that image through the Face Detailer function of my AP Workflow:

SDXL 1.0 Base+Refiner, with a negative prompt optimized for photographic image generation, CFG=10, and face enhancements.

Much more could be done to this image, but Apple MPS is excruciatingly slow and this little comparison took hours. Imagine what I could do with an NVIDIA system...

If you are interested in the upcoming AP Workflow 6.0, keep an eye on https://perilli.com/ai/comfyui/

AP Workflow 5.0 Control Panel
Upcoming features in AP Workflow 6.0

The takeaways from all of this are:

  1. How you write your prompt matters immensely (and not just in text-to-image generation). Prompt engineering is still very much a thing, despite what some AI pundits keep telling you.
  2. If you want full control over how your image looks, AI image generation remains a dark art, where you have to master a huge number of concepts and tweak an impressive number of parameters.
  3. At least for now, commercial models like Midjourney 6 or Dall-3 produce very high-quality images, but it's close to impossible to control the image generation in deep detail. If you want maximum control, you can only count on open models like Stable Diffusion.

Alessandro

要查看或添加评论,请登录

社区洞察

其他会员也浏览了