Stable Diffusion is a text-to-image latent diffusion model created by
the researchers and engineers from
CompVis, Stability
AI and LAION. It is
trained on 512x512 images from a subset of the
LAION-5B database. This model uses
a frozen CLIP ViT-L/14 text encoder to condition the model on text
prompts. With its 860M UNet and 123M text encoder. See the model
card for more
information.
General diffusion models are machine learning systems that are trained
to denoise random gaussian noise step by step, to get to a sample of
interest, such as an image. Diffusion models have shown to achieve
state-of-the-art results for generating image data. But one downside of
diffusion models is that the reverse denoising process is slow. In
addition, these models consume a lot of memory because they operate in
pixel space, which becomes unreasonably expensive when generating
high-resolution images. Therefore, it is challenging to train these
models and also use them for inference. OpenVINO brings capabilities to
run model inference on Intel hardware and opens the door to the
fantastic world of diffusion models for everyone!
Model capabilities are not limited text-to-image only, it also is able
solve additional tasks, for example text-guided image-to-image
generation and inpainting. This tutorial also considers how to run
text-guided image-to-image generation using Stable Diffusion.
This notebook demonstrates how to convert and run stable diffusion model
using OpenVINO.
As you can see from the diagram, the only difference between
Text-to-Image and text-guided Image-to-Image generation in approach is
how initial latent state is generated. In case of Image-to-Image
generation, you additionally have an image encoded by VAE encoder mixed
with the noise produced by using latent seed, while in Text-to-Image you
use only noise as initial latent state. The stable diffusion model takes
both a latent image representation of size and a
text prompt is transformed to text embeddings of size
via CLIP’s text encoder as an input.
Next, the U-Net iteratively denoises the random latent image
representations while being conditioned on the text embeddings. The
output of the U-Net, being the noise residual, is used to compute a
denoised latent image representation via a scheduler algorithm. Many
different scheduler algorithms can be used for this computation, each
having its pros and cons. For Stable Diffusion, it is recommended to use
one of:
Theory on how the scheduler algorithm function works is out of scope for
this notebook. Nonetheless, in short, you should remember that you
compute the predicted denoised image representation from the previous
noise representation and the predicted noise residual. For more
information, refer to the recommended Elucidating the Design Space of
Diffusion-Based Generative Models
The denoising process is repeated given number of times (by default
50) to step-by-step retrieve better latent image representations. When
complete, the latent image representation is decoded by the decoder part
of the variational auto encoder.
Load Stable Diffusion model and create text-to-image pipeline#
We will load optimized Stable Diffusion model from the Hugging Face Hub
and create pipeline to run an inference with OpenVINO Runtime by
Optimum
Intel.
For running the Stable Diffusion model with Optimum Intel, we will use
the optimum.intel.OVStableDiffusionPipeline class, which represents
the inference pipeline. OVStableDiffusionPipeline initialized by the
from_pretrained method. It supports on-the-fly conversion models
from PyTorch using the export=True parameter. A converted model can
be saved on disk using the save_pretrained method for the next
running.
When Stable Diffusion models are exported to the OpenVINO format, they
are decomposed into three components that consist of four models
combined during inference into the pipeline:
The text encoder
The text-encoder is responsible for transforming the input
prompt(for example “a photo of an astronaut riding a horse”) into
an embedding space that can be understood by the U-Net. It is
usually a simple transformer-based encoder that maps a sequence of
input tokens to a sequence of latent text embeddings.
The U-NET
Model predicts the sample state for the next step.
The VAE encoder
The encoder is used to convert the image into a low dimensional
latent representation, which will serve as the input to the U-Net
model.
The VAE decoder
The decoder transforms the latent representation back into an
image.
Select device from dropdown list for running inference using OpenVINO.
Now, you can define a text prompt for image generation and run inference
pipeline.
Note: Consider increasing steps to get more precise results.
A suggested value is 50, but it will take longer time to process.
importipywidgetsaswidgetssample_text=("cyberpunk cityscape like Tokyo New York with tall buildings at dusk golden hour cinematic lighting, epic composition. ""A golden daylight, hyper-realistic environment. ""Hyper and intricate detail, photo-realistic. ""Cinematic and volumetric light. ""Epic concept art. ""Octane render and Unreal Engine, trending on artstation")text_prompt=widgets.Text(value=sample_text,description="your text")num_steps=widgets.IntSlider(min=1,max=50,value=20,description="steps:")seed=widgets.IntSlider(min=0,max=10000000,description="seed: ",value=42)widgets.VBox([text_prompt,num_steps,seed])
VBox(children=(Text(value='cyberpunk cityscape like Tokyo New York with tall buildings at dusk golden hour ci…
print("Pipeline settings")print(f"Input text: {text_prompt.value}")print(f"Seed: {seed.value}")print(f"Number of steps: {num_steps.value}")
Let’s generate an image and save the generation results. The pipeline
returns one or several results: images contains final generated
image. To get more than one result, you can set the
num_images_per_prompt parameter.
importgradioasgrdefgenerate_from_text(text,seed,num_steps,_=gr.Progress(track_tqdm=True)):np.random.seed(seed)result=ov_pipe(text,num_inference_steps=num_steps)returnresult["images"][0]withgr.Blocks()asdemo:withgr.Tab("Text-to-Image generation"):withgr.Row():withgr.Column():text_input=gr.Textbox(lines=3,label="Text")seed_input=gr.Slider(0,10000000,value=42,step=1,label="Seed")steps_input=gr.Slider(1,50,value=20,step=1,label="Steps")out=gr.Image(label="Result",type="pil")btn=gr.Button()btn.click(generate_from_text,[text_input,seed_input,steps_input],out)gr.Examples([[sample_text,42,20]],[text_input,seed_input,steps_input])try:demo.queue().launch()exceptException:demo.queue().launch(share=True)# if you are launching remotely, specify server_name and server_port# demo.launch(server_name='your server name', server_port='server port in int')# Read more in the docs: https://gradio.app/docs/
For running the Stable Diffusion model with Optimum Intel, we will use
the optimum.intel.OVStableDiffusionImg2ImgPipeline class, which
represents the inference pipeline. We will use the same model as for
text-to-image pipeline. The model has already been downloaded from the
Hugging Face Hub and converted to OpenVINO IR format on previous steps,
so we can just load it.
Image-to-Image generation, additionally to text prompt, requires
providing initial image. Optionally, you can also change strength
parameter, which is a value between 0.0 and 1.0, that controls the
amount of noise that is added to the input image. Values that approach
1.0 enable lots of variations but will also produce images that are not
semantically consistent with the input.
importPILimportnumpyasnpdefscale_fit_to_window(dst_width:int,dst_height:int,image_width:int,image_height:int):""" Preprocessing helper function for calculating image size for resize with peserving original aspect ratio and fitting image to specific window size Parameters: dst_width (int): destination window width dst_height (int): destination window height image_width (int): source image width image_height (int): source image height Returns: result_width (int): calculated width for resize result_height (int): calculated height for resize """im_scale=min(dst_height/image_height,dst_width/image_width)returnint(im_scale*image_width),int(im_scale*image_height)defpreprocess(image:PIL.Image.Image):""" Image preprocessing function. Takes image in PIL.Image format, resizes it to keep aspect ration and fits to model input window 512x512, then converts it to np.ndarray and adds padding with zeros on right or bottom side of image (depends from aspect ratio), after that converts data to float32 data type and change range of values from [0, 255] to [-1, 1]. The function returns preprocessed input tensor and padding size, which can be used in postprocessing. Parameters: image (PIL.Image.Image): input image Returns: image (np.ndarray): preprocessed image tensor meta (Dict): dictionary with preprocessing metadata info """src_width,src_height=image.sizedst_width,dst_height=scale_fit_to_window(512,512,src_width,src_height)image=np.array(image.resize((dst_width,dst_height),resample=PIL.Image.Resampling.LANCZOS))[None,:]pad_width=512-dst_widthpad_height=512-dst_heightpad=((0,0),(0,pad_height),(0,pad_width),(0,0))image=np.pad(image,pad,mode="constant")image=image.astype(np.float32)/255.0image=2.0*image-1.0returnimage,{"padding":pad,"src_width":src_width,"src_height":src_height}defpostprocess(image:PIL.Image.Image,orig_width:int,orig_height:int):""" Image postprocessing function. Takes image in PIL.Image format and metrics of original image. Image is cropped and resized to restore initial size. Parameters: image (PIL.Image.Image): input image orig_width (int): original image width orig_height (int): original image height Returns: image (PIL.Image.Image): postprocess image """src_width,src_height=image.sizedst_width,dst_height=scale_fit_to_window(src_width,src_height,orig_width,orig_height)image=image.crop((0,0,dst_width,dst_height))image=image.resize((orig_width,orig_height))returnimage
ifnotPath("gradio_helper.py").exists():download_file(url="https://raw.githubusercontent.com/openvinotoolkit/openvino_notebooks/latest/notebooks/stable-diffusion-text-to-image/gradio_helper.py")fromgradio_helperimportmake_demodemo=make_demo(ov_pipe_i2i,preprocess,postprocess,default_image_path)try:demo.queue().launch()exceptException:demo.queue().launch(share=True)# if you are launching remotely, specify server_name and server_port# demo.launch(server_name='your server name', server_port='server port in int')# Read more in the docs: https://gradio.app/docs/