Animating Open-domain Images with DynamiCrafter and OpenVINO#

This Jupyter notebook can be launched after a local installation only.

Github

Animating a still image offers an engaging visual experience. Traditional image animation techniques mainly focus on animating natural scenes with stochastic dynamics (e.g. clouds and fluid) or domain-specific motions (e.g. human hair or body motions), and thus limits their applicability to more general visual content. To overcome this limitation, DynamiCrafter team explores the synthesis of dynamic content for open-domain images, converting them into animated videos. The key idea is to utilize the motion prior of text-to-video diffusion models by incorporating the image into the generative process as guidance. Given an image, DynamiCrafter team first projects it into a text-aligned rich context representation space using a query transformer, which facilitates the video model to digest the image content in a compatible fashion. However, some visual details still struggle to be preserved in the resultant videos. To supplement with more precise image information, DynamiCrafter team further feeds the full image to the diffusion model by concatenating it with the initial noises. Experimental results show that the proposed method can produce visually convincing and more logical & natural motions, as well as higher conformity to the input image.

In this tutorial, we consider how to use DynamiCrafter with OpenVINO. An additional part demonstrates how to run optimization with NNCF to speed up model.

“bear playing guitar happily, snowing”

“boy walking on the street”

Table of contents:

This is a self-contained example that relies solely on its own code.

We recommend running the notebook in a virtual environment. You only need a Jupyter server to start. For details, please refer to Installation Guide.

Prerequisites#

%pip install -q "openvino>=2024.2.0" "nncf>=2.11.0" "datasets>=2.20.0"
%pip install -q "gradio>=4.19" omegaconf einops pytorch_lightning kornia "open_clip_torch==2.22.0" transformers av opencv-python "torch==2.2.2" --extra-index-url https://download.pytorch.org/whl/cpu
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
from pathlib import Path
import requests


if not Path("cmd_helper.py").exists():
    r = requests.get(
        url="https://raw.githubusercontent.com/openvinotoolkit/openvino_notebooks/latest/utils/cmd_helper.py",
    )
    open("cmd_helper.py", "w").write(r.text)

r = requests.get(
    url="https://raw.githubusercontent.com/openvinotoolkit/openvino_notebooks/latest/utils/notebook_utils.py",
)
open("notebook_utils.py", "w").write(r.text)
24624
from cmd_helper import clone_repo


clone_repo("https://github.com/Doubiiu/DynamiCrafter.git", "26e665cd6c174234238d2ded661e2e56f875d360")
PosixPath('DynamiCrafter')

Load and run the original pipeline#

We will use model for 256x256 resolution as example. Also, models for 320x512 and 576x1024 are available.

import os
from collections import OrderedDict

import torch
from huggingface_hub import hf_hub_download
from omegaconf import OmegaConf

from utils.utils import instantiate_from_config


def load_model_checkpoint(model, ckpt):
    def load_checkpoint(model, ckpt, full_strict):
        state_dict = torch.load(ckpt, map_location="cpu")
        if "state_dict" in list(state_dict.keys()):
            state_dict = state_dict["state_dict"]
            try:
                model.load_state_dict(state_dict, strict=full_strict)
            except Exception:
                ## rename the keys for 256x256 model
                new_pl_sd = OrderedDict()
                for k, v in state_dict.items():
                    new_pl_sd[k] = v

                for k in list(new_pl_sd.keys()):
                    if "framestride_embed" in k:
                        new_key = k.replace("framestride_embed", "fps_embedding")
                        new_pl_sd[new_key] = new_pl_sd[k]
                        del new_pl_sd[k]
                model.load_state_dict(new_pl_sd, strict=full_strict)
        else:
            ## deepspeed
            new_pl_sd = OrderedDict()
            for key in state_dict["module"].keys():
                new_pl_sd[key[16:]] = state_dict["module"][key]
            model.load_state_dict(new_pl_sd, strict=full_strict)

        return model

    load_checkpoint(model, ckpt, full_strict=True)
    print(">>> model checkpoint loaded.")
    return model


def download_model():
    REPO_ID = "Doubiiu/DynamiCrafter"
    if not os.path.exists("./checkpoints/dynamicrafter_256_v1/"):
        os.makedirs("./checkpoints/dynamicrafter_256_v1/")
    local_file = os.path.join("./checkpoints/dynamicrafter_256_v1/model.ckpt")
    if not os.path.exists(local_file):
        hf_hub_download(repo_id=REPO_ID, filename="model.ckpt", local_dir="./checkpoints/dynamicrafter_256_v1/", local_dir_use_symlinks=False)

    ckpt_path = "checkpoints/dynamicrafter_256_v1/model.ckpt"
    config_file = "DynamiCrafter/configs/inference_256_v1.0.yaml"
    config = OmegaConf.load(config_file)
    model_config = config.pop("model", OmegaConf.create())
    model_config["params"]["unet_config"]["params"]["use_checkpoint"] = False
    model = instantiate_from_config(model_config)
    model = load_model_checkpoint(model, ckpt_path)
    model.eval()

    return model


model = download_model()
Note: switching to '26e665cd6c174234238d2ded661e2e56f875d360'.

You are in 'detached HEAD' state. You can look around, make experimental
changes and commit them, and you can discard any commits you make in this
state without impacting any branches by switching back to a branch.

If you want to create a new branch to retain commits you create, you may
do so (now or later) by using -c with the switch command. Example:

  git switch -c <new-branch-name>

Or undo this operation with:

  git switch -

Turn off this advice by setting config variable advice.detachedHead to false

HEAD is now at 26e665c add dataset
/opt/home/k8sworker/ci-ai/cibuilds/jobs/ov-notebook/jobs/OVNotebookOps/builds/835/archive/.workspace/scm/ov-notebook/.venv/lib/python3.8/site-packages/huggingface_hub/file_download.py:1204: UserWarning: local_dir_use_symlinks parameter is deprecated and will be ignored. The process to download files to a local folder has been updated and do not rely on symlinks anymore. You only need to pass a destination folder as`local_dir`.
For more details, check out https://huggingface.co/docs/huggingface_hub/main/en/guides/download#download-files-to-local-folder.
  warnings.warn(
model.ckpt:   0%|          | 0.00/10.4G [00:00<?, ?B/s]
AE working on z of shape (1, 4, 32, 32) = 4096 dimensions.
/opt/home/k8sworker/ci-ai/cibuilds/jobs/ov-notebook/jobs/OVNotebookOps/builds/835/archive/.workspace/scm/ov-notebook/.venv/lib/python3.8/site-packages/timm/models/layers/__init__.py:48: FutureWarning: Importing from timm.models.layers is deprecated, please import via timm.layers
  warnings.warn(f"Importing from {__name__} is deprecated, please import via timm.layers", FutureWarning)
2024-12-09 23:43:19.000762: I tensorflow/core/util/port.cc:110] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable TF_ENABLE_ONEDNN_OPTS=0.
2024-12-09 23:43:19.034903: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX512F AVX512_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2024-12-09 23:43:19.630734: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
>>> model checkpoint loaded.

Convert the model to OpenVINO IR#

Let’s define the conversion function for PyTorch modules. We use ov.convert_model function to obtain OpenVINO Intermediate Representation object and ov.save_model function to save it as XML file.

import gc

import openvino as ov


def convert(model: torch.nn.Module, xml_path: str, example_input, input_shape=None):
    xml_path = Path(xml_path)
    if not xml_path.exists():
        xml_path.parent.mkdir(parents=True, exist_ok=True)
        with torch.no_grad():
            if not input_shape:
                converted_model = ov.convert_model(model, example_input=example_input)
            else:
                converted_model = ov.convert_model(model, example_input=example_input, input=input_shape)
        ov.save_model(converted_model, xml_path, compress_to_fp16=False)

        # cleanup memory
        torch._C._jit_clear_class_registry()
        torch.jit._recursive.concrete_type_store = torch.jit._recursive.ConcreteTypeStore()
        torch.jit._state._clear_class_state()

Flowchart of DynamiCrafter proposed in the paper:

schema Description: > During inference, our model can generate animation clips from noise conditioned on the input still image.

Let’s convert models from the pipeline one by one.

Convert CLIP text encoder#

from lvdm.modules.encoders.condition import FrozenOpenCLIPEmbedder

MODEL_DIR = Path("models")

COND_STAGE_MODEL_OV_PATH = MODEL_DIR / "cond_stage_model.xml"


class FrozenOpenCLIPEmbedderWrapper(FrozenOpenCLIPEmbedder):
    def forward(self, tokens):
        z = self.encode_with_transformer(tokens.to(self.device))
        return z


cond_stage_model = FrozenOpenCLIPEmbedderWrapper(device="cpu")

if not COND_STAGE_MODEL_OV_PATH.exists():
    convert(
        cond_stage_model,
        COND_STAGE_MODEL_OV_PATH,
        example_input=torch.ones([1, 77], dtype=torch.long),
    )

del cond_stage_model
gc.collect();
WARNING:tensorflow:Please fix your imports. Module tensorflow.python.training.tracking.base has been moved to tensorflow.python.trackable.base. The old module will be deleted in version 2.11.
WARNING:tensorflow:Please fix your imports. Module tensorflow.python.training.tracking.base has been moved to tensorflow.python.trackable.base. The old module will be deleted in version 2.11.

Convert CLIP image encoder#

FrozenOpenCLIPImageEmbedderV2 model accepts images of various resolutions.

EMBEDDER_OV_PATH = MODEL_DIR / "embedder_ir.xml"


dummy_input = torch.rand([1, 3, 767, 767], dtype=torch.float32)

model.embedder.model.visual.input_patchnorm = None  # fix error: visual model has not  attribute 'input_patchnorm'
if not EMBEDDER_OV_PATH.exists():
    convert(model.embedder, EMBEDDER_OV_PATH, example_input=dummy_input, input_shape=[1, 3, -1, -1])


del model.embedder
gc.collect();
/opt/home/k8sworker/ci-ai/cibuilds/jobs/ov-notebook/jobs/OVNotebookOps/builds/835/archive/.workspace/scm/ov-notebook/.venv/lib/python3.8/site-packages/kornia/utils/image.py:226: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
  if input.numel() == 0:
/opt/home/k8sworker/ci-ai/cibuilds/jobs/ov-notebook/jobs/OVNotebookOps/builds/835/archive/.workspace/scm/ov-notebook/.venv/lib/python3.8/site-packages/kornia/geometry/transform/affwarp.py:573: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
  if size == input_size:
/opt/home/k8sworker/ci-ai/cibuilds/jobs/ov-notebook/jobs/OVNotebookOps/builds/835/archive/.workspace/scm/ov-notebook/.venv/lib/python3.8/site-packages/kornia/geometry/transform/affwarp.py:579: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
  antialias = antialias and (max(factors) > 1)
/opt/home/k8sworker/ci-ai/cibuilds/jobs/ov-notebook/jobs/OVNotebookOps/builds/835/archive/.workspace/scm/ov-notebook/.venv/lib/python3.8/site-packages/kornia/geometry/transform/affwarp.py:581: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
  if antialias:
/opt/home/k8sworker/ci-ai/cibuilds/jobs/ov-notebook/jobs/OVNotebookOps/builds/835/archive/.workspace/scm/ov-notebook/.venv/lib/python3.8/site-packages/kornia/geometry/transform/affwarp.py:584: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
  sigmas = (max((factors[0] - 1.0) / 2.0, 0.001), max((factors[1] - 1.0) / 2.0, 0.001))
/opt/home/k8sworker/ci-ai/cibuilds/jobs/ov-notebook/jobs/OVNotebookOps/builds/835/archive/.workspace/scm/ov-notebook/.venv/lib/python3.8/site-packages/kornia/geometry/transform/affwarp.py:589: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
  ks = int(max(2.0 * 2 * sigmas[0], 3)), int(max(2.0 * 2 * sigmas[1], 3))
/opt/home/k8sworker/ci-ai/cibuilds/jobs/ov-notebook/jobs/OVNotebookOps/builds/835/archive/.workspace/scm/ov-notebook/.venv/lib/python3.8/site-packages/kornia/geometry/transform/affwarp.py:589: TracerWarning: Converting a tensor to a Python integer might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
  ks = int(max(2.0 * 2 * sigmas[0], 3)), int(max(2.0 * 2 * sigmas[1], 3))
/opt/home/k8sworker/ci-ai/cibuilds/jobs/ov-notebook/jobs/OVNotebookOps/builds/835/archive/.workspace/scm/ov-notebook/.venv/lib/python3.8/site-packages/kornia/filters/gaussian.py:55: TracerWarning: torch.tensor results are registered as constants in the trace. You can safely ignore this warning if you use this function to create tensors out of constant variables that would be the same every time you call this function. In any other case, this might cause the trace to be incorrect.
  sigma = tensor([sigma], device=input.device, dtype=input.dtype)
/opt/home/k8sworker/ci-ai/cibuilds/jobs/ov-notebook/jobs/OVNotebookOps/builds/835/archive/.workspace/scm/ov-notebook/.venv/lib/python3.8/site-packages/kornia/filters/gaussian.py:55: TracerWarning: Converting a tensor to a Python float might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
  sigma = tensor([sigma], device=input.device, dtype=input.dtype)
/opt/home/k8sworker/ci-ai/cibuilds/jobs/ov-notebook/jobs/OVNotebookOps/builds/835/archive/.workspace/scm/ov-notebook/.venv/lib/python3.8/site-packages/kornia/core/check.py:78: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
  if x_shape_to_check[i] != dim:
/opt/home/k8sworker/ci-ai/cibuilds/jobs/ov-notebook/jobs/OVNotebookOps/builds/835/archive/.workspace/scm/ov-notebook/.venv/lib/python3.8/site-packages/kornia/filters/kernels.py:92: TracerWarning: torch.tensor results are registered as constants in the trace. You can safely ignore this warning if you use this function to create tensors out of constant variables that would be the same every time you call this function. In any other case, this might cause the trace to be incorrect.
  mean = tensor([[mean]], device=sigma.device, dtype=sigma.dtype)
/opt/home/k8sworker/ci-ai/cibuilds/jobs/ov-notebook/jobs/OVNotebookOps/builds/835/archive/.workspace/scm/ov-notebook/.venv/lib/python3.8/site-packages/kornia/enhance/normalize.py:101: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
  if len(mean.shape) == 0 or mean.shape[0] == 1:
/opt/home/k8sworker/ci-ai/cibuilds/jobs/ov-notebook/jobs/OVNotebookOps/builds/835/archive/.workspace/scm/ov-notebook/.venv/lib/python3.8/site-packages/kornia/enhance/normalize.py:103: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
  if len(std.shape) == 0 or std.shape[0] == 1:
/opt/home/k8sworker/ci-ai/cibuilds/jobs/ov-notebook/jobs/OVNotebookOps/builds/835/archive/.workspace/scm/ov-notebook/.venv/lib/python3.8/site-packages/kornia/enhance/normalize.py:107: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
  if mean.shape and mean.shape[0] != 1:
/opt/home/k8sworker/ci-ai/cibuilds/jobs/ov-notebook/jobs/OVNotebookOps/builds/835/archive/.workspace/scm/ov-notebook/.venv/lib/python3.8/site-packages/kornia/enhance/normalize.py:108: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
  if mean.shape[0] != data.shape[1] and mean.shape[:2] != data.shape[:2]:
/opt/home/k8sworker/ci-ai/cibuilds/jobs/ov-notebook/jobs/OVNotebookOps/builds/835/archive/.workspace/scm/ov-notebook/.venv/lib/python3.8/site-packages/kornia/enhance/normalize.py:112: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
  if std.shape and std.shape[0] != 1:
/opt/home/k8sworker/ci-ai/cibuilds/jobs/ov-notebook/jobs/OVNotebookOps/builds/835/archive/.workspace/scm/ov-notebook/.venv/lib/python3.8/site-packages/kornia/enhance/normalize.py:113: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
  if std.shape[0] != data.shape[1] and std.shape[:2] != data.shape[:2]:
/opt/home/k8sworker/ci-ai/cibuilds/jobs/ov-notebook/jobs/OVNotebookOps/builds/835/archive/.workspace/scm/ov-notebook/.venv/lib/python3.8/site-packages/kornia/enhance/normalize.py:116: TracerWarning: torch.as_tensor results are registered as constants in the trace. You can safely ignore this warning if you use this function to create tensors out of constant variables that would be the same every time you call this function. In any other case, this might cause the trace to be incorrect.
  mean = torch.as_tensor(mean, device=data.device, dtype=data.dtype)
/opt/home/k8sworker/ci-ai/cibuilds/jobs/ov-notebook/jobs/OVNotebookOps/builds/835/archive/.workspace/scm/ov-notebook/.venv/lib/python3.8/site-packages/kornia/enhance/normalize.py:117: TracerWarning: torch.as_tensor results are registered as constants in the trace. You can safely ignore this warning if you use this function to create tensors out of constant variables that would be the same every time you call this function. In any other case, this might cause the trace to be incorrect.
  std = torch.as_tensor(std, device=data.device, dtype=data.dtype)

Convert AE encoder#

ENCODER_FIRST_STAGE_OV_PATH = MODEL_DIR / "encoder_first_stage_ir.xml"


dummy_input = torch.rand([1, 3, 256, 256], dtype=torch.float32)

if not ENCODER_FIRST_STAGE_OV_PATH.exists():
    convert(
        model.first_stage_model.encoder,
        ENCODER_FIRST_STAGE_OV_PATH,
        example_input=dummy_input,
    )

del model.first_stage_model.encoder
gc.collect();
/opt/home/k8sworker/ci-ai/cibuilds/jobs/ov-notebook/jobs/OVNotebookOps/builds/835/archive/.workspace/scm/ov-notebook/notebooks/dynamicrafter-animating-images/DynamiCrafter/lvdm/modules/networks/ae_modules.py:67: TracerWarning: Converting a tensor to a Python integer might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
  w_ = w_ * (int(c)**(-0.5))

Convert Diffusion U-Net model#

MODEL_OV_PATH = MODEL_DIR / "model_ir.xml"


class ModelWrapper(torch.nn.Module):
    def __init__(self, diffusion_model):
        super().__init__()
        self.diffusion_model = diffusion_model

    def forward(self, xc, t, context=None, fs=None, temporal_length=None):
        outputs = self.diffusion_model(xc, t, context=context, fs=fs, temporal_length=temporal_length)
        return outputs


if not MODEL_OV_PATH.exists():
    convert(
        ModelWrapper(model.model.diffusion_model),
        MODEL_OV_PATH,
        example_input={
            "xc": torch.rand([1, 8, 16, 32, 32], dtype=torch.float32),
            "t": torch.tensor([1]),
            "context": torch.rand([1, 333, 1024], dtype=torch.float32),
            "fs": torch.tensor([3]),
            "temporal_length": torch.tensor([16]),
        },
    )

out_channels = model.model.diffusion_model.out_channels
del model.model.diffusion_model
gc.collect();
/opt/home/k8sworker/ci-ai/cibuilds/jobs/ov-notebook/jobs/OVNotebookOps/builds/835/archive/.workspace/scm/ov-notebook/notebooks/dynamicrafter-animating-images/DynamiCrafter/lvdm/modules/networks/openaimodel3d.py:556: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
  if l_context == 77 + t*16: ## !!! HARD CODE here
/opt/home/k8sworker/ci-ai/cibuilds/jobs/ov-notebook/jobs/OVNotebookOps/builds/835/archive/.workspace/scm/ov-notebook/notebooks/dynamicrafter-animating-images/DynamiCrafter/lvdm/modules/networks/openaimodel3d.py:205: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
  if batch_size:
/opt/home/k8sworker/ci-ai/cibuilds/jobs/ov-notebook/jobs/OVNotebookOps/builds/835/archive/.workspace/scm/ov-notebook/notebooks/dynamicrafter-animating-images/DynamiCrafter/lvdm/modules/networks/openaimodel3d.py:232: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
  if self.use_temporal_conv and batch_size:
/opt/home/k8sworker/ci-ai/cibuilds/jobs/ov-notebook/jobs/OVNotebookOps/builds/835/archive/.workspace/scm/ov-notebook/notebooks/dynamicrafter-animating-images/DynamiCrafter/lvdm/modules/networks/openaimodel3d.py:76: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
  assert x.shape[1] == self.channels
/opt/home/k8sworker/ci-ai/cibuilds/jobs/ov-notebook/jobs/OVNotebookOps/builds/835/archive/.workspace/scm/ov-notebook/notebooks/dynamicrafter-animating-images/DynamiCrafter/lvdm/modules/networks/openaimodel3d.py:99: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
  assert x.shape[1] == self.channels

Convert AE decoder#

Decoder receives a bfloat16 tensor. numpy doesn’t support this type. To avoid problems with the conversion lets replace decode method to convert bfloat16 to float32.

import types


def decode(self, z, **kwargs):
    z = self.post_quant_conv(z)
    z = z.float()
    dec = self.decoder(z)
    return dec


model.first_stage_model.decode = types.MethodType(decode, model.first_stage_model)
DECODER_FIRST_STAGE_OV_PATH = MODEL_DIR / "decoder_first_stage_ir.xml"


dummy_input = torch.rand([16, 4, 32, 32], dtype=torch.float32)

if not DECODER_FIRST_STAGE_OV_PATH.exists():
    convert(
        model.first_stage_model.decoder,
        DECODER_FIRST_STAGE_OV_PATH,
        example_input=dummy_input,
    )

del model.first_stage_model.decoder
gc.collect();

Compiling models#

Select device from dropdown list for running inference using OpenVINO.

from notebook_utils import device_widget

core = ov.Core()
device = device_widget()

device
Dropdown(description='Device:', index=1, options=('CPU', 'AUTO'), value='AUTO')
cond_stage_model = core.read_model(COND_STAGE_MODEL_OV_PATH)
encoder_first_stage = core.read_model(ENCODER_FIRST_STAGE_OV_PATH)
embedder = core.read_model(EMBEDDER_OV_PATH)
model_ov = core.read_model(MODEL_OV_PATH)
decoder_first_stage = core.read_model(DECODER_FIRST_STAGE_OV_PATH)

compiled_cond_stage_model = core.compile_model(cond_stage_model, device.value)
compiled_encode_first_stage = core.compile_model(encoder_first_stage, device.value)
compiled_embedder = core.compile_model(embedder, device.value)
compiled_model = core.compile_model(model_ov, device.value)
compiled_decoder_first_stage = core.compile_model(decoder_first_stage, device.value)

Building the pipeline#

Let’s create callable wrapper classes for compiled models to allow interaction with original pipelines. Note that all of wrapper classes return torch.Tensors instead of np.arrays.

from typing import Any
import open_clip


class CondStageModelWrapper(torch.nn.Module):
    def __init__(self, cond_stage_model):
        super().__init__()
        self.cond_stage_model = cond_stage_model

    def encode(self, tokens):
        if isinstance(tokens, list):
            tokens = open_clip.tokenize(tokens[0])
        outs = self.cond_stage_model(tokens)[0]

        return torch.from_numpy(outs)


class EncoderFirstStageModelWrapper(torch.nn.Module):
    def __init__(self, encode_first_stage):
        super().__init__()
        self.encode_first_stage = encode_first_stage

    def forward(self, x):
        outs = self.encode_first_stage(x)[0]

        return torch.from_numpy(outs)

    def __call__(self, *args: Any, **kwargs: Any) -> Any:
        return self.forward(*args, **kwargs)


class EmbedderWrapper(torch.nn.Module):
    def __init__(self, embedder):
        super().__init__()
        self.embedder = embedder

    def forward(self, x):
        outs = self.embedder(x)[0]

        return torch.from_numpy(outs)

    def __call__(self, *args: Any, **kwargs: Any) -> Any:
        return self.forward(*args, **kwargs)


class CModelWrapper(torch.nn.Module):
    def __init__(self, diffusion_model, out_channels):
        super().__init__()
        self.diffusion_model = diffusion_model
        self.out_channels = out_channels

    def forward(self, xc, t, context, fs, temporal_length):
        inputs = {
            "xc": xc,
            "t": t,
            "context": context,
            "fs": fs,
        }
        outs = self.diffusion_model(inputs)[0]

        return torch.from_numpy(outs)

    def __call__(self, *args: Any, **kwargs: Any) -> Any:
        return self.forward(*args, **kwargs)


class DecoderFirstStageModelWrapper(torch.nn.Module):
    def __init__(self, decoder_first_stage):
        super().__init__()
        self.decoder_first_stage = decoder_first_stage

    def forward(self, x):
        x.float()
        outs = self.decoder_first_stage(x)[0]

        return torch.from_numpy(outs)

    def __call__(self, *args: Any, **kwargs: Any) -> Any:
        return self.forward(*args, **kwargs)

And insert wrappers instances in the pipeline:

model.cond_stage_model = CondStageModelWrapper(compiled_cond_stage_model)
model.first_stage_model.encoder = EncoderFirstStageModelWrapper(compiled_encode_first_stage)
model.embedder = EmbedderWrapper(compiled_embedder)
model.model.diffusion_model = CModelWrapper(compiled_model, out_channels)
model.first_stage_model.decoder = DecoderFirstStageModelWrapper(compiled_decoder_first_stage)

Run OpenVINO pipeline inference#

from einops import repeat, rearrange
import torchvision.transforms as transforms


transform = transforms.Compose(
    [
        transforms.Resize(min((256, 256))),
        transforms.CenterCrop((256, 256)),
    ]
)


def get_latent_z(model, videos):
    b, c, t, h, w = videos.shape
    x = rearrange(videos, "b c t h w -> (b t) c h w")
    z = model.encode_first_stage(x)
    z = rearrange(z, "(b t) c h w -> b c t h w", b=b, t=t)
    return z


def process_input(model, prompt, image, transform=transform, fs=3):
    text_emb = model.get_learned_conditioning([prompt])

    # img cond
    img_tensor = torch.from_numpy(image).permute(2, 0, 1).float().to(model.device)
    img_tensor = (img_tensor / 255.0 - 0.5) * 2

    image_tensor_resized = transform(img_tensor)  # 3,h,w
    videos = image_tensor_resized.unsqueeze(0)  # bchw

    z = get_latent_z(model, videos.unsqueeze(2))  # bc,1,hw
    frames = model.temporal_length
    img_tensor_repeat = repeat(z, "b c t h w -> b c (repeat t) h w", repeat=frames)

    cond_images = model.embedder(img_tensor.unsqueeze(0))  # blc
    img_emb = model.image_proj_model(cond_images)
    imtext_cond = torch.cat([text_emb, img_emb], dim=1)

    fs = torch.tensor([fs], dtype=torch.long, device=model.device)
    cond = {"c_crossattn": [imtext_cond], "fs": fs, "c_concat": [img_tensor_repeat]}
    return cond
import time
from PIL import Image
import numpy as np
from lvdm.models.samplers.ddim import DDIMSampler
from pytorch_lightning import seed_everything
import torchvision


def register_buffer(self, name, attr):
    if isinstance(attr, torch.Tensor):
        if attr.device != torch.device("cpu"):
            attr = attr.to(torch.device("cpu"))
    setattr(self, name, attr)


def batch_ddim_sampling(model, cond, noise_shape, n_samples=1, ddim_steps=50, ddim_eta=1.0, cfg_scale=1.0, temporal_cfg_scale=None, **kwargs):
    ddim_sampler = DDIMSampler(model)
    uncond_type = model.uncond_type
    batch_size = noise_shape[0]
    fs = cond["fs"]
    del cond["fs"]
    if noise_shape[-1] == 32:
        timestep_spacing = "uniform"
        guidance_rescale = 0.0
    else:
        timestep_spacing = "uniform_trailing"
        guidance_rescale = 0.7
    # construct unconditional guidance
    if cfg_scale != 1.0:
        if uncond_type == "empty_seq":
            prompts = batch_size * [""]
            # prompts = N * T * [""]  ## if is_imgbatch=True
            uc_emb = model.get_learned_conditioning(prompts)
        elif uncond_type == "zero_embed":
            c_emb = cond["c_crossattn"][0] if isinstance(cond, dict) else cond
            uc_emb = torch.zeros_like(c_emb)

        # process image embedding token
        if hasattr(model, "embedder"):
            uc_img = torch.zeros(noise_shape[0], 3, 224, 224).to(model.device)
            ## img: b c h w >> b l c
            uc_img = model.embedder(uc_img)
            uc_img = model.image_proj_model(uc_img)
            uc_emb = torch.cat([uc_emb, uc_img], dim=1)

        if isinstance(cond, dict):
            uc = {key: cond[key] for key in cond.keys()}
            uc.update({"c_crossattn": [uc_emb]})
        else:
            uc = uc_emb
    else:
        uc = None

    x_T = None
    batch_variants = []

    for _ in range(n_samples):
        if ddim_sampler is not None:
            kwargs.update({"clean_cond": True})
            samples, _ = ddim_sampler.sample(
                S=ddim_steps,
                conditioning=cond,
                batch_size=noise_shape[0],
                shape=noise_shape[1:],
                verbose=False,
                unconditional_guidance_scale=cfg_scale,
                unconditional_conditioning=uc,
                eta=ddim_eta,
                temporal_length=noise_shape[2],
                conditional_guidance_scale_temporal=temporal_cfg_scale,
                x_T=x_T,
                fs=fs,
                timestep_spacing=timestep_spacing,
                guidance_rescale=guidance_rescale,
                **kwargs,
            )
        # reconstruct from latent to pixel space
        batch_images = model.decode_first_stage(samples)
        batch_variants.append(batch_images)
    # batch, <samples>, c, t, h, w
    batch_variants = torch.stack(batch_variants, dim=1)
    return batch_variants


# monkey patching to replace the original method 'register_buffer' that uses CUDA
DDIMSampler.register_buffer = types.MethodType(register_buffer, DDIMSampler)


def save_videos(batch_tensors, savedir, filenames, fps=10):
    # b,samples,c,t,h,w
    n_samples = batch_tensors.shape[1]
    for idx, vid_tensor in enumerate(batch_tensors):
        video = vid_tensor.detach().cpu()
        video = torch.clamp(video.float(), -1.0, 1.0)
        video = video.permute(2, 0, 1, 3, 4)  # t,n,c,h,w
        frame_grids = [torchvision.utils.make_grid(framesheet, nrow=int(n_samples)) for framesheet in video]  # [3, 1*h, n*w]
        grid = torch.stack(frame_grids, dim=0)  # stack in temporal dim [t, 3, n*h, w]
        grid = (grid + 1.0) / 2.0
        grid = (grid * 255).to(torch.uint8).permute(0, 2, 3, 1)
        savepath = os.path.join(savedir, f"{filenames[idx]}.mp4")
        torchvision.io.write_video(savepath, grid, fps=fps, video_codec="h264", options={"crf": "10"})


def get_image(image, prompt, steps=5, cfg_scale=7.5, eta=1.0, fs=3, seed=123, model=model, result_dir="results"):
    if not os.path.exists(result_dir):
        os.mkdir(result_dir)

    seed_everything(seed)

    # torch.cuda.empty_cache()
    print("start:", prompt, time.strftime("%Y-%m-%d %H:%M:%S", time.localtime(time.time())))
    start = time.time()
    if steps > 60:
        steps = 60
    model = model.cpu()
    batch_size = 1
    channels = model.model.diffusion_model.out_channels
    frames = model.temporal_length
    h, w = 256 // 8, 256 // 8
    noise_shape = [batch_size, channels, frames, h, w]

    # text cond
    with torch.no_grad(), torch.cpu.amp.autocast():
        cond = process_input(model, prompt, image, transform, fs=3)

        ## inference
        batch_samples = batch_ddim_sampling(model, cond, noise_shape, n_samples=1, ddim_steps=steps, ddim_eta=eta, cfg_scale=cfg_scale)
        ## b,samples,c,t,h,w
        prompt_str = prompt.replace("/", "_slash_") if "/" in prompt else prompt
        prompt_str = prompt_str.replace(" ", "_") if " " in prompt else prompt_str
        prompt_str = prompt_str[:40]
        if len(prompt_str) == 0:
            prompt_str = "empty_prompt"

    save_videos(batch_samples, result_dir, filenames=[prompt_str], fps=8)
    print(f"Saved in {prompt_str}.mp4. Time used: {(time.time() - start):.2f} seconds")

    return os.path.join(result_dir, f"{prompt_str}.mp4")
image_path = "DynamiCrafter/prompts/256/art.png"
prompt = "man fishing in a boat at sunset"
seed = 234
image = Image.open(image_path)
image = np.asarray(image)
result_dir = "results"
video_path = get_image(image, prompt, steps=20, seed=seed, model=model, result_dir=result_dir)
Seed set to 234
/tmp/ipykernel_2173449/2451984876.py:25: UserWarning: The given NumPy array is not writable, and PyTorch does not support non-writable tensors. This means writing to this tensor will result in undefined behavior. You may want to copy the array to protect its data or make it writable before converting it to a tensor. This type of warning will be suppressed for the rest of this program. (Triggered internally at ../torch/csrc/utils/tensor_numpy.cpp:206.)
  img_tensor = torch.from_numpy(image).permute(2, 0, 1).float().to(model.device)
start: man fishing in a boat at sunset 2024-12-09 23:46:36
Saved in man_fishing_in_a_boat_at_sunset.mp4. Time used: 194.37 seconds
from IPython.display import HTML

HTML(
    f"""
    <video alt="video" controls>
        <source src="{video_path}" type="video/mp4">
    </video>
"""
)

Quantization#

NNCF enables post-training quantization by adding quantization layers into model graph and then using a subset of the training dataset to initialize the parameters of these additional quantization layers. Quantized operations are executed in INT8 instead of FP32/FP16 making model inference faster.

According to DynamiCrafter structure, denoising UNet model is used in the cycle repeating inference on each diffusion step, while other parts of pipeline take part only once. Now we will show you how to optimize pipeline using NNCF to reduce memory and computation cost.

Please select below whether you would like to run quantization to improve model inference speed.

NOTE: Quantization is time and memory consuming operation. Running quantization code below may take some time.

from notebook_utils import quantization_widget

to_quantize = quantization_widget()

to_quantize
Checkbox(value=True, description='Quantization')

Let’s load skip magic extension to skip quantization if to_quantize is not selected

# Fetch `skip_kernel_extension` module
import requests

r = requests.get(
    url="https://raw.githubusercontent.com/openvinotoolkit/openvino_notebooks/latest/utils/skip_kernel_extension.py",
)
open("skip_kernel_extension.py", "w").write(r.text)

int8_model = None
MODEL_INT8_OV_PATH = MODEL_DIR / "model_ir_int8.xml"
COND_STAGE_MODEL_INT8_OV_PATH = MODEL_DIR / "cond_stage_model_int8.xml"
DECODER_FIRST_STAGE_INT8_OV_PATH = MODEL_DIR / "decoder_first_stage_ir_int8.xml"
ENCODER_FIRST_STAGE_INT8_OV_PATH = MODEL_DIR / "encoder_first_stage_ir_int8.xml"
EMBEDDER_INT8_OV_PATH = MODEL_DIR / "embedder_ir_int8.xml"

%load_ext skip_kernel_extension

Prepare calibration dataset#

We use a portion of jovianzm/Pexels-400k dataset from Hugging Face as calibration data.

from io import BytesIO


def download_image(url):
    try:
        response = requests.get(url)
        response.raise_for_status()
        img = Image.open(BytesIO(response.content))
        # Convert the image to a NumPy array
        img_array = np.array(img)
        return img_array
    except Exception as err:
        print(f"Error occurred: {err}")
        return None

To collect intermediate model inputs for calibration we should customize CompiledModel.

%%skip not $to_quantize.value

import datasets
from tqdm.notebook import tqdm
import pickle


class CompiledModelDecorator(ov.CompiledModel):
    def __init__(self, compiled_model, keep_prob, data_cache = None):
        super().__init__(compiled_model)
        self.data_cache = data_cache if data_cache else []
        self.keep_prob = np.clip(keep_prob, 0, 1)

    def __call__(self, *args, **kwargs):
        if np.random.rand() <= self.keep_prob:
            self.data_cache.append(*args)
        return super().__call__(*args, **kwargs)

def collect_calibration_data(model, subset_size):
    calibration_dataset_filepath = Path("calibration_data")/f"{subset_size}.pkl"
    calibration_dataset_filepath.parent.mkdir(exist_ok=True, parents=True)
    if not calibration_dataset_filepath.exists():
        original_diffusion_model = model.model.diffusion_model.diffusion_model
        modified_model = CompiledModelDecorator(original_diffusion_model, keep_prob=1)
        model.model.diffusion_model = CModelWrapper(modified_model, model.model.diffusion_model.out_channels)

        dataset = datasets.load_dataset("google-research-datasets/conceptual_captions", trust_remote_code=True, split="train", streaming=True).shuffle(seed=42).take(subset_size)

        pbar = tqdm(total=subset_size)
        channels = model.model.diffusion_model.out_channels
        frames = model.temporal_length
        h, w = 256 // 8, 256 // 8
        noise_shape = [1, channels, frames, h, w]
        for batch in dataset:
            prompt = batch["caption"]
            image_path = batch["image_url"]
            image = download_image(image_path)
            if image is None:
                continue

            cond = process_input(model, prompt, image)
            batch_ddim_sampling(model, cond, noise_shape, n_samples=1, ddim_steps=20, ddim_eta=1.0, cfg_scale=7.5)

            collected_subset_size = len(model.model.diffusion_model.diffusion_model.data_cache)
            if collected_subset_size >= subset_size:
                pbar.update(subset_size - pbar.n)
                break
            pbar.update(collected_subset_size - pbar.n)

        calibration_dataset = model.model.diffusion_model.diffusion_model.data_cache[:subset_size]
        model.model.diffusion_model.diffusion_model = original_diffusion_model
        with open(calibration_dataset_filepath, 'wb') as f:
            pickle.dump(calibration_dataset, f)
    with open(calibration_dataset_filepath, 'rb') as f:
        calibration_dataset = pickle.load(f)
    return calibration_dataset
%%skip not $to_quantize.value


if not MODEL_INT8_OV_PATH.exists():
    subset_size = 300
    calibration_data = collect_calibration_data(model, subset_size=subset_size)
0%|          | 0/300 [00:00<?, ?it/s]
Error occurred: 403 Client Error: Forbidden for url: http://1.bp.blogspot.com/-c2pSbigvVm8/T9JqOXKIrsI/AAAAAAAACWs/ASXRA3Mbd0A/s1600/upsidedownnile.jpg
Error occurred: ('Connection aborted.', ConnectionResetError(104, 'Connection reset by peer'))
Error occurred: 400 Client Error: Bad Request for url: http://i2.wp.com/www.monsoonbreeze123.com/wp-content/uploads/2016/04/edited-5.jpg?resize=781%2C512
Error occurred: 403 Client Error: Forbidden for url: http://i.dailymail.co.uk/i/pix/2017/07/26/16/42B41FE900000578-4732576-It_seems_that_Emma_and_her_cat_have_an_extremely_close_bond_one_-a-50_1501083105178.jpg
Error occurred: HTTPSConnectionPool(host='thewondrous.com', port=443): Max retries exceeded with url: /wp-content/uploads/2013/04/Egg-on-the-Head-of-Jack-Dog-600x799.jpg (Caused by SSLError(SSLCertVerificationError(1, '[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: self signed certificate in certificate chain (_ssl.c:1131)')))
Error occurred: HTTPSConnectionPool(host='captainsmanorinn.com', port=443): Max retries exceeded with url: /wp-content/uploads/2014/02/Museum-on-the-greengardenweb-5.jpg (Caused by SSLError(SSLCertVerificationError(1, '[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: self signed certificate in certificate chain (_ssl.c:1131)')))
Error occurred: 400 Client Error: Bad Request for url: https://media.gettyimages.com/photos/drew-henson-of-the-michigan-wolverines-looks-to-pass-in-a-game-on-picture-id111493145?s=612x612
Error occurred: 403 Client Error: Forbidden for url: http://www.bostonherald.com/sites/default/files/styles/featured_big/public/media/ap/2017/11/25/3cca13b05ad041ba8681174958cba941.jpg?itok=t31VhFHJ
Error occurred: HTTPSConnectionPool(host='i.pinimg.com', port=443): Max retries exceeded with url: /736x/b6/79/e8/b679e809c4a2995777852c9d77f93e6e--royal-wedding-cakes-royal-weddings.jpg (Caused by SSLError(SSLCertVerificationError(1, '[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: self signed certificate in certificate chain (_ssl.c:1131)')))

Run Quantization#

Quantization of the first and last Convolution layers impacts the generation results. We recommend using IgnoredScope to keep accuracy sensitive layers in FP16 precision. FastBiasCorrection algorithm is disabled due to minimal accuracy improvement in SD models and increased quantization time.

%%skip not $to_quantize.value

import nncf


if MODEL_INT8_OV_PATH.exists():
    print("Model already quantized")
else:
    ov_model_ir = core.read_model(MODEL_OV_PATH)
    quantized_model = nncf.quantize(
        model=ov_model_ir,
        subset_size=subset_size,
        calibration_dataset=nncf.Dataset(calibration_data),
        model_type=nncf.ModelType.TRANSFORMER,
        ignored_scope=nncf.IgnoredScope(names=[
            "__module.diffusion_model.input_blocks.0.0/aten::_convolution/Convolution",
            "__module.diffusion_model.out.2/aten::_convolution/Convolution",
        ]),
        advanced_parameters=nncf.AdvancedQuantizationParameters(disable_bias_correction=True)
    )
    ov.save_model(quantized_model, MODEL_INT8_OV_PATH)
INFO:nncf:NNCF initialized successfully. Supported frameworks detected: torch, tensorflow, onnx, openvino
Output()

Run Weights Compression#

Quantizing of the remaining components of the pipeline does not significantly improve inference performance but can lead to a substantial degradation of accuracy. The weight compression will be applied to footprint reduction.

%%skip not $to_quantize.value

def compress_model_weights(fp_model_path, int8_model_path):
    if not int8_model_path.exists():
        model = core.read_model(fp_model_path)
        compressed_model = nncf.compress_weights(model)
        ov.save_model(compressed_model, int8_model_path)


compress_model_weights(COND_STAGE_MODEL_OV_PATH, COND_STAGE_MODEL_INT8_OV_PATH)
compress_model_weights(DECODER_FIRST_STAGE_OV_PATH, DECODER_FIRST_STAGE_INT8_OV_PATH)
compress_model_weights(ENCODER_FIRST_STAGE_OV_PATH, ENCODER_FIRST_STAGE_INT8_OV_PATH)
compress_model_weights(EMBEDDER_OV_PATH, EMBEDDER_INT8_OV_PATH)
INFO:nncf:Statistics of the bitwidth distribution:
┍━━━━━━━━━━━━━━━━┯━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┯━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┑
│   Num bits (N) │ % all parameters (layers)   │ % ratio-defining parameters (layers)   │
┝━━━━━━━━━━━━━━━━┿━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┿━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┥
│              8 │ 100% (97 / 97)              │ 100% (97 / 97)                         │
┕━━━━━━━━━━━━━━━━┷━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┷━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┙
Output()
INFO:nncf:Statistics of the bitwidth distribution:
┍━━━━━━━━━━━━━━━━┯━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┯━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┑
│   Num bits (N) │ % all parameters (layers)   │ % ratio-defining parameters (layers)   │
┝━━━━━━━━━━━━━━━━┿━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┿━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┥
│              8 │ 100% (39 / 39)              │ 100% (39 / 39)                         │
┕━━━━━━━━━━━━━━━━┷━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┷━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┙
Output()
INFO:nncf:Statistics of the bitwidth distribution:
┍━━━━━━━━━━━━━━━━┯━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┯━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┑
│   Num bits (N) │ % all parameters (layers)   │ % ratio-defining parameters (layers)   │
┝━━━━━━━━━━━━━━━━┿━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┿━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┥
│              8 │ 100% (31 / 31)              │ 100% (31 / 31)                         │
┕━━━━━━━━━━━━━━━━┷━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┷━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┙
Output()
INFO:nncf:Statistics of the bitwidth distribution:
┍━━━━━━━━━━━━━━━━┯━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┯━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┑
│   Num bits (N) │ % all parameters (layers)   │ % ratio-defining parameters (layers)   │
┝━━━━━━━━━━━━━━━━┿━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┿━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┥
│              8 │ 100% (129 / 129)            │ 100% (129 / 129)                       │
┕━━━━━━━━━━━━━━━━┷━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┷━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┙
Output()

Let’s run the optimized pipeline

%%skip not $to_quantize.value

compiled_cond_stage_model = core.compile_model(core.read_model(COND_STAGE_MODEL_INT8_OV_PATH), device.value)
compiled_encode_first_stage = core.compile_model(core.read_model(ENCODER_FIRST_STAGE_INT8_OV_PATH), device.value)
compiled_embedder = core.compile_model(core.read_model(EMBEDDER_INT8_OV_PATH), device.value)
compiled_model = core.compile_model(core.read_model(MODEL_INT8_OV_PATH), device.value)
compiled_decoder_first_stage = core.compile_model(core.read_model(DECODER_FIRST_STAGE_INT8_OV_PATH), device.value)
%%skip not $to_quantize.value

int8_model = download_model()
int8_model.first_stage_model.decode = types.MethodType(decode, int8_model.first_stage_model)
int8_model.embedder.model.visual.input_patchnorm = None  # fix error: visual model has not  attribute 'input_patchnorm'

int8_model.cond_stage_model = CondStageModelWrapper(compiled_cond_stage_model)
int8_model.first_stage_model.encoder = EncoderFirstStageModelWrapper(compiled_encode_first_stage)
int8_model.embedder = EmbedderWrapper(compiled_embedder)
int8_model.model.diffusion_model = CModelWrapper(compiled_model, out_channels)
int8_model.first_stage_model.decoder = DecoderFirstStageModelWrapper(compiled_decoder_first_stage)
AE working on z of shape (1, 4, 32, 32) = 4096 dimensions.
>>> model checkpoint loaded.
%%skip not $to_quantize.value

image_path = "DynamiCrafter/prompts/256/art.png"
prompt = "man fishing in a boat at sunset"
seed = 234
image = Image.open(image_path)
image = np.asarray(image)

result_dir = "results_int8"
video_path = get_image(image, prompt, steps=20, seed=seed, model=int8_model, result_dir=result_dir)
Seed set to 234
start: man fishing in a boat at sunset 2024-12-10 01:17:34
Saved in man_fishing_in_a_boat_at_sunset.mp4. Time used: 98.80 seconds
%%skip not $to_quantize.value

from IPython.display import display, HTML

display(HTML(f"""
    <video alt="video" controls>
        <source src={video_path} type="video/mp4">
    </video>
"""))

Compare model file sizes#

%%skip not $to_quantize.value

fp32_model_paths = [COND_STAGE_MODEL_OV_PATH, DECODER_FIRST_STAGE_OV_PATH, ENCODER_FIRST_STAGE_OV_PATH, EMBEDDER_OV_PATH, MODEL_OV_PATH]
int8_model_paths = [COND_STAGE_MODEL_INT8_OV_PATH, DECODER_FIRST_STAGE_INT8_OV_PATH, ENCODER_FIRST_STAGE_INT8_OV_PATH, EMBEDDER_INT8_OV_PATH, MODEL_INT8_OV_PATH]

for fp16_path, int8_path in zip(fp32_model_paths, int8_model_paths):
    fp32_ir_model_size = fp16_path.with_suffix(".bin").stat().st_size
    int8_model_size = int8_path.with_suffix(".bin").stat().st_size
    print(f"{fp16_path.stem} compression rate: {fp32_ir_model_size / int8_model_size:.3f}")
cond_stage_model compression rate: 3.977
decoder_first_stage_ir compression rate: 3.987
encoder_first_stage_ir compression rate: 3.986
embedder_ir compression rate: 3.977
model_ir compression rate: 3.981

Compare inference time of the FP32 and INT8 models#

To measure the inference performance of the FP32 and INT8 models, we use median inference time on calibration subset.

NOTE: For the most accurate performance estimation, it is recommended to run benchmark_app in a terminal/command prompt after closing other applications.

%%skip not $to_quantize.value

import time


def calculate_inference_time(model, validation_size=3):
    calibration_dataset = datasets.load_dataset("jovianzm/Pexels-400k", split="train", streaming=True).take(validation_size)
    inference_time = []
    channels = model.model.diffusion_model.out_channels
    frames = model.temporal_length
    h, w = 256 // 8, 256 // 8
    noise_shape = [1, channels, frames, h, w]
    for batch in calibration_dataset:
        prompt = batch["title"]
        image_path = batch["thumbnail"]
        image = download_image(image_path)
        cond = process_input(model, prompt, image, transform, fs=3)

        start = time.perf_counter()
        _ = batch_ddim_sampling(model, cond, noise_shape, n_samples=1, ddim_steps=20, ddim_eta=1.0, cfg_scale=7.5)
        end = time.perf_counter()
        delta = end - start
        inference_time.append(delta)
    return np.median(inference_time)
%%skip not $to_quantize.value

fp_latency = calculate_inference_time(model)
print(f"FP32 latency: {fp_latency:.3f}")
int8_latency = calculate_inference_time(int8_model)
print(f"INT8 latency: {int8_latency:.3f}")
print(f"Performance speed up: {fp_latency / int8_latency:.3f}")
FP32 latency: 193.524
INT8 latency: 97.073
Performance speed up: 1.994

Interactive inference#

Please select below whether you would like to use the quantized models to launch the interactive demo.

from ipywidgets import widgets

quantized_models_present = int8_model is not None

use_quantized_models = widgets.Checkbox(
    value=quantized_models_present,
    description="Use quantized models",
    disabled=not quantized_models_present,
)

use_quantized_models
Checkbox(value=True, description='Use quantized models')
from functools import partial

demo_model = int8_model if use_quantized_models.value else model
get_image_fn = partial(get_image, model=demo_model)

if not Path("gradio_helper.py").exists():
    r = requests.get(
        url="https://raw.githubusercontent.com/openvinotoolkit/openvino_notebooks/latest/notebooks/dynamicrafter-animating-images/gradio_helper.py"
    )
    open("gradio_helper.py", "w").write(r.text)

from gradio_helper import make_demo

demo = make_demo(fn=get_image_fn)

try:
    demo.queue().launch(debug=False)
except Exception:
    demo.queue().launch(debug=False, share=True)
# if you are launching remotely, specify server_name and server_port
# demo.launch(server_name='your server name', server_port='server port in int')
# Read more in the docs: https://gradio.app/docs/
Running on local URL:  http://127.0.0.1:7860

To create a public link, set share=True in launch().