Virtual Try-On with CatVTON and OpenVINO#

This Jupyter notebook can be launched after a local installation only.

Virtual try-on methods based on diffusion models achieve realistic try-on effects but replicate the backbone network as a ReferenceNet or leverage additional image encoders to process condition inputs, resulting in high training and inference costs. In this work, authors rethink the necessity of ReferenceNet and image encoders and innovate the interaction between garment and person, proposing CatVTON, a simple and efficient virtual try-on diffusion model. It facilitates the seamless transfer of in-shop or worn garments of arbitrary categories to target persons by simply concatenating them in spatial dimensions as inputs. The efficiency of the model is demonstrated in three aspects: 1. Lightweight network. Only the original diffusion modules are used, without additional network modules. The text encoder and cross attentions for text injection in the backbone are removed, further reducing the parameters by 167.02M. 2. Parameter-efficient training. We identified the try-on relevant modules through experiments and achieved high-quality try-on effects by training only 49.57M parameters (∼5.51% of the backbone network’s parameters). 3. Simplified inference. CatVTON eliminates all unnecessary conditions and preprocessing steps, including pose estimation, human parsing, and text input, requiring only garment reference, target person image, and mask for the virtual try-on process. Extensive experiments demonstrate that CatVTON achieves superior qualitative and quantitative results with fewer prerequisites and trainable parameters than baseline methods. Furthermore, CatVTON shows good generalization in in-the-wild scenarios despite using open-source datasets with only 73K samples.

Teaser image from CatVTON GitHub teaser

In this tutorial we consider how to convert and run this model using OpenVINO. An additional part demonstrates how to run optimization with NNCF to speed up pipeline.

Table of contents:

Prerequisites
Convert the model to OpenVINO IR
Compiling models
Optimize model using NNCF Post-Training Quantization API
Interactive demo

Installation Instructions#

This is a self-contained example that relies solely on its own code.

We recommend running the notebook in a virtual environment. You only need a Jupyter server to start. For details, please refer to Installation Guide.

Prerequisites#

import platform

%pip install -q "openvino>=2024.4" "nncf>=2.13.0"
%pip install -q "torch>=2.1" "diffusers>=0.29.1" torchvision opencv_python --extra-index-url https://download.pytorch.org/whl/cpu
%pip install -q fvcore "pillow" "tqdm" "gradio>=4.36" "omegaconf==2.4.0.dev3" av pycocotools cloudpickle scipy accelerate "transformers>=4.27.3"

if platform.system() == "Darwin":
    %pip install -q "numpy<2.0.0"

import requests
from pathlib import Path

if not Path("notebook_utils.py").exists():
    r = requests.get(
        url="https://raw.githubusercontent.com/openvinotoolkit/openvino_notebooks/latest/utils/notebook_utils.py",
    )
    open("notebook_utils.py", "w").write(r.text)

if not Path("cmd_helper.py").exists():
    r = requests.get(
        url="https://raw.githubusercontent.com/openvinotoolkit/openvino_notebooks/latest/utils/cmd_helper.py",
    )
    open("cmd_helper.py", "w").write(r.text)

# Read more about telemetry collection at https://github.com/openvinotoolkit/openvino_notebooks?tab=readme-ov-file#-telemetry
from notebook_utils import collect_telemetry

collect_telemetry("catvton.ipynb")

from cmd_helper import clone_repo

clone_repo("https://github.com/Zheng-Chong/CatVTON.git", "3b795364a4d2f3b5adb365f39cdea376d20bc53c")

Convert the model to OpenVINO IR#

OpenVINO supports PyTorch models via conversion to OpenVINO Intermediate Representation (IR). OpenVINO model conversion API should be used for these purposes. ov.convert_model function accepts original PyTorch model instance and example input for tracing and returns ov.Model representing this model in OpenVINO framework. Converted model can be used for saving on disk using ov.save_model function or directly loading on device using core.complie_model.

ov_catvton_helper.py script contains helper function for models downloading and models conversion, please check its content if you interested in conversion details.

To download checkpoints and load models, just call the helper function download_models. It takes care about it. Functions convert_pipeline_models and convert_automasker_models will convert models from pipeline and automasker in OpenVINO format.

The original pipeline contains VAE encoder and decoder and UNET. CatVTON-overview

The automasker contains DensePose with detectron2.GeneralizedRCNN model and SCHP (LIP and ATR version).

from ov_catvton_helper import download_models, convert_pipeline_models, convert_automasker_models

pipeline, mask_processor, automasker = download_models()
vae_scaling_factor = pipeline.vae.config.scaling_factor
convert_pipeline_models(pipeline)
convert_automasker_models(automasker)

Compiling models#

Select device from dropdown list for running inference using OpenVINO.

import openvino as ov

from notebook_utils import device_widget

core = ov.Core()

device = device_widget()

device

get_compiled_pipeline and get_compiled_automasker functions defined in ov_catvton_helper.py provides convenient way for getting the pipeline and the automasker with compiled ov-models that are compatible with the original interface. It accepts the original pipeline and automasker, inference device and directories with converted models as arguments. Under the hood we create callable wrapper classes for compiled models to allow interaction with original pipelines. Note that all of wrapper classes return torch.Tensors instead of np.arrays. And then insert wrappers instances in the pipeline.

from ov_catvton_helper import (
    get_compiled_pipeline,
    get_compiled_automasker,
    VAE_ENCODER_PATH,
    VAE_DECODER_PATH,
    UNET_PATH,
    DENSEPOSE_PROCESSOR_PATH,
    SCHP_PROCESSOR_ATR,
    SCHP_PROCESSOR_LIP,
)

pipeline = get_compiled_pipeline(pipeline, core, device, VAE_ENCODER_PATH, VAE_DECODER_PATH, UNET_PATH, vae_scaling_factor)
automasker = get_compiled_automasker(automasker, core, device, DENSEPOSE_PROCESSOR_PATH, SCHP_PROCESSOR_ATR, SCHP_PROCESSOR_LIP)

Optimize model using NNCF Post-Training Quantization API#

NNCF provides a suite of advanced algorithms for Neural Networks inference optimization in OpenVINO with minimal accuracy drop. We will use 8-bit quantization in post-training mode (without the fine-tuning pipeline) for the UNet model, and 4-bit weight compression for the remaining models.

NOTE: Quantization is time and memory consuming operation. Running quantization code below may take some time. You can disable it using widget below:

from notebook_utils import quantization_widget

to_quantize = quantization_widget()

to_quantize

Let’s load skip magic extension to skip quantization if to_quantize is not selected

is_optimized_pipe_available = False

# Fetch skip_kernel_extension module
if not Path("skip_kernel_extension.py").exists():
    r = requests.get(
        url="https://raw.githubusercontent.com/openvinotoolkit/openvino_notebooks/latest/utils/skip_kernel_extension.py",
    )
    open("skip_kernel_extension.py", "w").write(r.text)

%load_ext skip_kernel_extension

Run Post-Training Quantization#

The optimization process contains the following steps:

Create a Dataset for quantization.
Run nncf.quantize for getting an optimized model.
Serialize an OpenVINO IR model, using the openvino.save_model function.

We use a couple of images from the original repository as calibration data.

%%skip not $to_quantize.value

from pathlib import Path
from catvton_quantization_helper import collect_calibration_data, UNET_INT8_PATH

dataset = [
    (
        Path("CatVTON/resource/demo/example/person/men/model_5.png"),
        Path("CatVTON/resource/demo/example/condition/upper/24083449_54173465_2048.jpg"),
    ),
    (
        Path("CatVTON/resource/demo/example/person/women/2-model_4.png"),
        Path("CatVTON/resource/demo/example/condition/overall/21744571_51588794_1000.jpg"),
    ),
]

if not UNET_INT8_PATH.exists():
    subset_size = 100
    calibration_data = collect_calibration_data(pipeline, automasker, mask_processor, dataset, subset_size)

%%skip not $to_quantize.value

import gc
import nncf
from ov_catvton_helper import UNET_PATH

# cleanup before quantization to free memory
del pipeline
del automasker
gc.collect()


if not UNET_INT8_PATH.exists():
    unet = core.read_model(UNET_PATH)
    quantized_model = nncf.quantize(
        model=unet,
        calibration_dataset=nncf.Dataset(calibration_data),
        subset_size=subset_size,
        model_type=nncf.ModelType.TRANSFORMER,
    )
    ov.save_model(quantized_model, UNET_INT8_PATH)
    del quantized_model
    gc.collect()

Run Weights Compression#

Quantizing of the remaining components of the pipeline does not significantly improve inference performance but can lead to a substantial degradation of accuracy. The weight compression will be applied to footprint reduction.

%%skip not $to_quantize.value

from catvton_quantization_helper import compress_models

compress_models(core)

is_optimized_pipe_available = True

Compare model file sizes#

%%skip not $to_quantize.value
from catvton_quantization_helper import compare_models_size

compare_models_size()

vae_encoder compression rate: 2.011
vae_decoder compression rate: 2.007
unet compression rate: 1.995
densepose_processor compression rate: 2.019
schp_processor_atr compression rate: 1.993
schp_processor_lip compression rate: 1.993

Interactive inference#

Please select below whether you would like to use the quantized models to launch the interactive demo.

from ov_catvton_helper import get_pipeline_selection_option

use_quantized_models = get_pipeline_selection_option(is_optimized_pipe_available)

use_quantized_models

from gradio_helper import make_demo

from catvton_quantization_helper import (
    VAE_ENCODER_INT4_PATH,
    VAE_DECODER_INT4_PATH,
    DENSEPOSE_PROCESSOR_INT4_PATH,
    SCHP_PROCESSOR_ATR_INT4,
    SCHP_PROCESSOR_LIP_INT4,
    UNET_INT8_PATH,
)

pipeline, mask_processor, automasker = download_models()
if use_quantized_models.value:
    pipeline = get_compiled_pipeline(pipeline, core, device, VAE_ENCODER_INT4_PATH, VAE_DECODER_INT4_PATH, UNET_INT8_PATH, vae_scaling_factor)
    automasker = get_compiled_automasker(automasker, core, device, DENSEPOSE_PROCESSOR_INT4_PATH, SCHP_PROCESSOR_ATR_INT4, SCHP_PROCESSOR_LIP_INT4)
else:
    pipeline = get_compiled_pipeline(pipeline, core, device, VAE_ENCODER_PATH, VAE_DECODER_PATH, UNET_PATH, vae_scaling_factor)
    automasker = get_compiled_automasker(automasker, core, device, DENSEPOSE_PROCESSOR_PATH, SCHP_PROCESSOR_ATR, SCHP_PROCESSOR_LIP)

output_dir = "output"
demo = make_demo(pipeline, mask_processor, automasker, output_dir)
try:
    demo.launch(debug=True)
except Exception:
    demo.launch(debug=True, share=True)