Stable Diffusion V3 is next generation of latent diffusion image Stable
Diffusion models family that outperforms state-of-the-art text-to-image
generation systems in typography and prompt adherence, based on human
preference evaluations. In comparison with previous versions, it based
on Multimodal Diffusion Transformer (MMDiT) text-to-image model that
features greatly improved performance in image quality, typography,
complex prompt understanding, and resource-efficiency.
More details about model can be found in model
card,
research
paper
and Stability.AI blog
post. In this
tutorial, we will consider how to convert Stable Diffusion v3 for
running with OpenVINO. An additional part demonstrates how to run
optimization with NNCF to
speed up pipeline. If you want to run previous Stable Diffusion
versions, please check our other notebooks:
Note: run model with notebook, you will need to accept license
agreement. You must be a registered user in Hugging Face Hub.
Please visit HuggingFace model
card,
carefully read terms of usage and click accept button. You will need
to use an access token for the code below to run. For more
information on access tokens, refer to this section of the
documentation.
You can login on Hugging Face Hub in notebook environment, using
following code:
# uncomment these lines to login to huggingfacehub to get access to pretrained model# from huggingface_hub import notebook_login, whoami# try:# whoami()# print('Authorization token already provided')# except OSError:# notebook_login()
We will use
Diffusers
library integration for running Stable Diffusion v3 model. You can find
more details in Diffusers
documentation.
Additionally, we can apply optimization for pipeline performance and
memory consumption:
Remove T5 text encoder. Removing the memory-intensive 4.7B
parameter T5-XXL text encoder during inference can significantly
decrease the memory requirements for SD3 with only a slight loss in
performance. If you want to use this model in pipeline, please set
use t5 text encoder checkbox.
/home/ea/work/my_optimum_intel/optimum_env/lib/python3.8/site-packages/diffusers/models/transformers/transformer_2d.py:34: FutureWarning: Transformer2DModelOutput is deprecated and will be removed in version 1.0.0. Importing Transformer2DModelOutput from diffusers.models.transformer_2d is deprecated and this will be removed in a future version. Please use from diffusers.models.modeling_outputs import Transformer2DModelOutput, instead.
deprecate("Transformer2DModelOutput", "1.0.0", deprecation_message)
2024-08-08 08:15:46.648328: I tensorflow/core/util/port.cc:110] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable TF_ENABLE_ONEDNN_OPTS=0.
2024-08-08 08:15:46.650527: I tensorflow/tsl/cuda/cudart_stub.cc:28] Could not find cuda drivers on your machine, GPU will not be used.
2024-08-08 08:15:46.687530: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX512F AVX512_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2024-08-08 08:15:47.368728: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
Starting from 2023.0 release, OpenVINO supports PyTorch models directly
via Model Conversion API. ov.convert_model function accepts instance
of PyTorch model and example inputs for tracing and returns object of
ov.Model class, ready to use or save on disk using ov.save_model
function.
The pipeline consists of four important parts:
Clip and T5 Text Encoders to create condition to generate an image
from a text prompt.
Transformer for step-by-step denoising latent image representation.
Autoencoder (VAE) for decoding latent space to image.
We will use convert_sd3 helper function defined in
sd3_helper.py that create original PyTorch model
and convert each part of pipeline using ov.convert_model.
fromsd3_helperimportconvert_sd3# Uncomment the line beolow to see model conversion code# ??convert_sd3
importtorchimage=ov_pipe("A raccoon trapped inside a glass jar full of colorful candies, the background is steamy with vivid colors",negative_prompt="",num_inference_steps=28ifnotuse_flash_lora.valueelse4,guidance_scale=5ifnotuse_flash_lora.valueelse0,height=512,width=512,generator=torch.Generator().manual_seed(141),).images[0]image
NNCF enables
post-training quantization by adding quantization layers into model
graph and then using a subset of the training dataset to initialize the
parameters of these additional quantization layers. Quantized operations
are executed in INT8 instead of FP32/FP16 making model
inference faster.
According to OVStableDiffusion3Pipeline structure, the
transformer model takes up significant portion of the overall
pipeline execution time. Now we will show you how to optimize the UNet
part using NNCF to reduce
computation cost and speed up the pipeline. Quantizing the rest of the
pipeline does not significantly improve inference performance but can
lead to a substantial degradation of accuracy. That’s why we use 4-bit
weight compression for the rest of the pipeline to reduce memory
footprint.
Please select below whether you would like to run quantization to
improve model inference speed.
NOTE: Quantization is time and memory consuming operation.
Running quantization code below may take some time.
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
- Avoid using tokenizers before the fork if possible
- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
We use a portion of
google-research-datasets/conceptual_captions
dataset from Hugging Face as calibration data. We use prompts below to
guide image generation and to determine what not to include in the
resulting image.
To collect intermediate model inputs for calibration we should customize
CompiledModel. We should set the height and width of the image to
512 to reduce memory consumption during quantization.
%%skip not $to_quantize.value
from sd3_quantization_helper import collect_calibration_data, TRANSFORMER_INT8_PATH
# Uncomment the line to see calibration data collection code
# ??collect_calibration_data
Quantization of the first Convolution layer impacts the generation
results. We recommend using IgnoredScope to keep accuracy sensitive
layers in FP16 precision.
%%skip not $to_quantize.value
import nncf
import gc
import openvino as ov
core = ov.Core()
if not TRANSFORMER_INT8_PATH.exists():
calibration_dataset_size = 200
print("Calibration data collection started")
unet_calibration_data = collect_calibration_data(ov_pipe,
calibration_dataset_size=calibration_dataset_size,
num_inference_steps=28 if not use_flash_lora.value else 4,
guidance_scale=5 if not use_flash_lora.value else 0
)
print("Calibration data collection finished")
del ov_pipe
gc.collect()
ov_pipe = None
transformer = core.read_model(TRANSFORMER_PATH)
quantized_model = nncf.quantize(
model=transformer,
calibration_dataset=nncf.Dataset(unet_calibration_data),
subset_size=calibration_dataset_size,
model_type=nncf.ModelType.TRANSFORMER,
ignored_scope=nncf.IgnoredScope(names=["__module.model.base_model.model.pos_embed.proj.base_layer/aten::_convolution/Convolution"]),
)
ov.save_model(quantized_model, TRANSFORMER_INT8_PATH)
Quantizing of the TextEncoders and Autoencoder does not
significantly improve inference performance but can lead to a
substantial degradation of accuracy.
For reducing model memory consumption we will use weights compression.
The Weights
Compression
algorithm is aimed at compressing the weights of the models and can be
used to optimize the model footprint and performance of large models
where the size of weights is relatively larger than the size of
activations, for example, Large Language Models (LLM). Compared to INT8
compression, INT4 compression improves performance even more, but
introduces a minor drop in prediction quality.
%%skip not $to_quantize.value
from sd3_quantization_helper import compress_models
compress_models()
%%skip not $to_quantize.value
from sd3_quantization_helper import visualize_results
opt_image = optimized_pipe(
"A raccoon trapped inside a glass jar full of colorful candies, the background is steamy with vivid colors",
negative_prompt="",
num_inference_steps=28 if not use_flash_lora.value else 4,
guidance_scale=5 if not use_flash_lora.value else 0,
height=512,
width=512,
generator=torch.Generator().manual_seed(141),
).images[0]
visualize_results(image, opt_image)
Compare inference time of the FP16 and optimized pipelines#
To measure the inference performance of the FP16 and optimized
pipelines, we use mean inference time on 5 samples.
NOTE: For the most accurate performance estimation, it is
recommended to run benchmark_app in a terminal/command prompt
after closing other applications.
%%skip not $to_quantize.value
from sd3_quantization_helper import compare_perf
compare_perf(models_dict, opt_models_dict, device.value, use_flash_lora.value, validation_size=5)
fromgradio_helperimportmake_demoov_pipe=init_pipeline(models_dictifnotuse_quantized_models.valueelseopt_models_dict,device.value,use_flash_lora.value)demo=make_demo(ov_pipe,use_flash_lora.value)# if you are launching remotely, specify server_name and server_port# demo.launch(server_name='your server name', server_port='server port in int')# if you have any issue to launch on your platform, you can pass share=True to launch method:# demo.launch(share=True)# it creates a publicly shareable link for the interface. Read more in the docs: https://gradio.app/docs/try:demo.launch(debug=False)exceptException:demo.launch(debug=False,share=True)