Wav2Lip: Accurately Lip-syncing Videos and OpenVINO#

This Jupyter notebook can be launched after a local installation only.

Lip sync technologies are widely used for digital human use cases, which enhance the user experience in dialog scenarios.

Wav2Lip is an approach to generate accurate 2D lip-synced videos in the wild with only one video and an audio clip. Wav2Lip leverages an accurate lip-sync “expert” model and consecutive face frames for accurate, natural lip motion generation.

In this notebook, we introduce how to enable and optimize Wav2Lippipeline with OpenVINO. This is adaptation of the blog article Enable 2D Lip Sync Wav2Lip Pipeline with OpenVINO Runtime.

Here is Wav2Lip pipeline overview:

Table of contents:

Prerequisites
Convert the model to OpenVINO IR
Compiling models and prepare pipeline
Interactive inference

Installation Instructions#

This is a self-contained example that relies solely on its own code.

We recommend running the notebook in a virtual environment. You only need a Jupyter server to start. For details, please refer to Installation Guide.

Prerequisites#

import requests
from pathlib import Path


r = requests.get(
    url="https://raw.githubusercontent.com/openvinotoolkit/openvino_notebooks/latest/utils/notebook_utils.py",
)
open("notebook_utils.py", "w").write(r.text)

r = requests.get(
    url="https://raw.githubusercontent.com/openvinotoolkit/openvino_notebooks/latest/utils/pip_helper.py",
)
open("pip_helper.py", "w").write(r.text)

r = requests.get(
    url="https://raw.githubusercontent.com/openvinotoolkit/openvino_notebooks/latest/utils/cmd_helper.py",
)
open("cmd_helper.py", "w").write(r.text)

from pip_helper import pip_install

pip_install("-q", "openvino>=2024.4.0")
pip_install(
    "-q",
    "huggingface_hub",
    "torch>=2.1",
    "gradio>=4.19",
    "librosa==0.9.2",
    "opencv-contrib-python",
    "opencv-python",
    "tqdm",
    "numba",
    "numpy<2",
    "--extra-index-url",
    "https://download.pytorch.org/whl/cpu",
)

helpers = ["gradio_helper.py", "ov_inference.py", "ov_wav2lip_helper.py"]
for helper_file in helpers:
    if not Path(helper_file).exists():
        r = requests.get(url=f"https://raw.githubusercontent.com/openvinotoolkit/openvino_notebooks/latest/notebooks/wav2lip/{helper_file}")
        open(helper_file, "w").write(r.text)

from cmd_helper import clone_repo

clone_repo("https://github.com/Rudrabha/Wav2Lip.git")

PosixPath('Wav2Lip')

Download example files.

from notebook_utils import download_file


download_file("https://github.com/sammysun0711/openvino_aigc_samples/blob/main/Wav2Lip/data_audio_sun_5s.wav?raw=true")
download_file("https://github.com/sammysun0711/openvino_aigc_samples/blob/main/Wav2Lip/data_video_sun_5s.mp4?raw=true")

data_audio_sun_5s.wav:   0%|          | 0.00/436k [00:00<?, ?B/s]

data_video_sun_5s.mp4:   0%|          | 0.00/916k [00:00<?, ?B/s]

PosixPath('/opt/home/k8sworker/ci-ai/cibuilds/jobs/ov-notebook/jobs/OVNotebookOps/builds/835/archive/.workspace/scm/ov-notebook/notebooks/wav2lip/data_video_sun_5s.mp4')

Convert the model to OpenVINO IR#

You don’t need to download checkpoints and load models, just call the helper function download_and_convert_models. It takes care about it and will convert both model in OpenVINO format.

from ov_wav2lip_helper import download_and_convert_models

OV_FACE_DETECTION_MODEL_PATH = Path("models/face_detection.xml")
OV_WAV2LIP_MODEL_PATH = Path("models/wav2lip.xml")

download_and_convert_models(OV_FACE_DETECTION_MODEL_PATH, OV_WAV2LIP_MODEL_PATH)

Convert Face Detection Model ...

s3fd-619a316812.pth:   0%|          | 0.00/85.7M [00:00<?, ?B/s]

/opt/home/k8sworker/ci-ai/cibuilds/jobs/ov-notebook/jobs/OVNotebookOps/builds/835/archive/.workspace/scm/ov-notebook/notebooks/wav2lip/ov_wav2lip_helper.py:43: FutureWarning: You are using torch.load with weights_only=False (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See pytorch/pytorch for more details). In a future release, the default value for weights_only will be flipped to True. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via torch.serialization.add_safe_globals. We recommend you start setting weights_only=True for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature.
  model_weights = torch.load(path_to_detector)

Converted face detection OpenVINO model:  models/face_detection.xml
Convert Wav2Lip Model ...

wav2lip.pth:   0%|          | 0.00/436M [00:00<?, ?B/s]

Load checkpoint from: checkpoints/Wav2lip/wav2lip.pth

/opt/home/k8sworker/ci-ai/cibuilds/jobs/ov-notebook/jobs/OVNotebookOps/builds/835/archive/.workspace/scm/ov-notebook/notebooks/wav2lip/ov_wav2lip_helper.py:16: FutureWarning: You are using torch.load with weights_only=False (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See pytorch/pytorch for more details). In a future release, the default value for weights_only will be flipped to True. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via torch.serialization.add_safe_globals. We recommend you start setting weights_only=True for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature.
  checkpoint = torch.load(checkpoint_path, map_location=lambda storage, loc: storage)

Converted face detection OpenVINO model:  models/wav2lip.xml

Compiling models and prepare pipeline#

Select device from dropdown list for running inference using OpenVINO.

from notebook_utils import device_widget

device = device_widget()

device

Dropdown(description='Device:', index=1, options=('CPU', 'AUTO'), value='AUTO')

ov_inference.py is an adaptation of original pipeline that has only cli-interface. ov_inference allows running the inference using python API and converted OpenVINO models.

import os

from ov_inference import ov_inference


if not os.path.exists("results"):
    os.mkdir("results")

ov_inference(
    "data_video_sun_5s.mp4",
    "data_audio_sun_5s.wav",
    face_detection_path=OV_FACE_DETECTION_MODEL_PATH,
    wav2lip_path=OV_WAV2LIP_MODEL_PATH,
    inference_device=device.value,
    outfile="results/result_voice.mp4",
)

Reading video frames...
Number of frames available for inference: 125

/opt/home/k8sworker/ci-ai/cibuilds/jobs/ov-notebook/jobs/OVNotebookOps/builds/835/archive/.workspace/scm/ov-notebook/notebooks/wav2lip/Wav2Lip/audio.py:100: FutureWarning: Pass sr=16000, n_fft=800 as keyword args. From version 0.10 passing these as positional arguments will result in an error
  return librosa.filters.mel(hp.sample_rate, hp.n_fft, n_mels=hp.num_mels,

(80, 405)
Length of mel chunks: 123

0%|          | 0/1 [00:00<?, ?it/s]

face_detect_ov images[0].shape:  (768, 576, 3)

  0%|          | 0/8 [00:00<?, ?it/s][A
 12%|█▎        | 1/8 [00:02<00:19,  2.76s/it][A
 25%|██▌       | 2/8 [00:05<00:16,  2.68s/it][A
 38%|███▊      | 3/8 [00:08<00:13,  2.66s/it][A
 50%|█████     | 4/8 [00:10<00:10,  2.65s/it][A
 62%|██████▎   | 5/8 [00:13<00:07,  2.64s/it][A
 75%|███████▌  | 6/8 [00:15<00:05,  2.64s/it][A
 88%|████████▊ | 7/8 [00:18<00:02,  2.65s/it][A
100%|██████████| 8/8 [00:20<00:00,  2.55s/it]

Model loaded

100%|██████████| 1/1 [00:22<00:00, 22.66s/it]
ffmpeg version 4.2.7-0ubuntu0.1 Copyright (c) 2000-2022 the FFmpeg developers
  built with gcc 9 (Ubuntu 9.4.0-1ubuntu1~20.04.1)
  configuration: --prefix=/usr --extra-version=0ubuntu0.1 --toolchain=hardened --libdir=/usr/lib/x86_64-linux-gnu --incdir=/usr/include/x86_64-linux-gnu --arch=amd64 --enable-gpl --disable-stripping --enable-avresample --disable-filter=resample --enable-avisynth --enable-gnutls --enable-ladspa --enable-libaom --enable-libass --enable-libbluray --enable-libbs2b --enable-libcaca --enable-libcdio --enable-libcodec2 --enable-libflite --enable-libfontconfig --enable-libfreetype --enable-libfribidi --enable-libgme --enable-libgsm --enable-libjack --enable-libmp3lame --enable-libmysofa --enable-libopenjpeg --enable-libopenmpt --enable-libopus --enable-libpulse --enable-librsvg --enable-librubberband --enable-libshine --enable-libsnappy --enable-libsoxr --enable-libspeex --enable-libssh --enable-libtheora --enable-libtwolame --enable-libvidstab --enable-libvorbis --enable-libvpx --enable-libwavpack --enable-libwebp --enable-libx265 --enable-libxml2 --enable-libxvid --enable-libzmq --enable-libzvbi --enable-lv2 --enable-omx --enable-openal --enable-opencl --enable-opengl --enable-sdl2 --enable-libdc1394 --enable-libdrm --enable-libiec61883 --enable-nvenc --enable-chromaprint --enable-frei0r --enable-libx264 --enable-shared
  libavutil      56. 31.100 / 56. 31.100
  libavcodec     58. 54.100 / 58. 54.100
  libavformat    58. 29.100 / 58. 29.100
  libavdevice    58.  8.100 / 58.  8.100
  libavfilter     7. 57.100 /  7. 57.100
  libavresample   4.  0.  0 /  4.  0.  0
  libswscale      5.  5.100 /  5.  5.100
  libswresample   3.  5.100 /  3.  5.100
  libpostproc    55.  5.100 / 55.  5.100
Guessed Channel Layout for Input Stream #0.0 : mono
Input #0, wav, from 'data_audio_sun_5s.wav':
  Metadata:
    encoder         : Lavf58.20.100
  Duration: 00:00:05.06, bitrate: 705 kb/s
    Stream #0:0: Audio: pcm_s16le ([1][0][0][0] / 0x0001), 44100 Hz, mono, s16, 705 kb/s
Input #1, avi, from 'Wav2Lip/temp/result.avi':
  Metadata:
    encoder         : Lavf59.27.100
  Duration: 00:00:04.92, start: 0.000000, bitrate: 1891 kb/s
    Stream #1:0: Video: mpeg4 (Simple Profile) (DIVX / 0x58564944), yuv420p, 576x768 [SAR 1:1 DAR 3:4], 1893 kb/s, 25 fps, 25 tbr, 25 tbn, 25 tbc
Stream mapping:
  Stream #1:0 -> #0:0 (mpeg4 (native) -> h264 (libx264))
  Stream #0:0 -> #0:1 (pcm_s16le (native) -> aac (native))
Press [q] to stop, [?] for help
[libx264 @ 0x55ec6513e840] -qscale is ignored, -crf is recommended.
[libx264 @ 0x55ec6513e840] using SAR=1/1
[libx264 @ 0x55ec6513e840] using cpu capabilities: MMX2 SSE2Fast SSSE3 SSE4.2 AVX FMA3 BMI2 AVX2 AVX512
[libx264 @ 0x55ec6513e840] profile High, level 3.1
[libx264 @ 0x55ec6513e840] 264 - core 155 r2917 0a84d98 - H.264/MPEG-4 AVC codec - Copyleft 2003-2018 - http://www.videolan.org/x264.html - options: cabac=1 ref=3 deblock=1:0:0 analyse=0x3:0x113 me=hex subme=7 psy=1 psy_rd=1.00:0.00 mixed_ref=1 me_range=16 chroma_me=1 trellis=1 8x8dct=1 cqm=0 deadzone=21,11 fast_pskip=1 chroma_qp_offset=-2 threads=24 lookahead_threads=4 sliced_threads=0 nr=0 decimate=1 interlaced=0 bluray_compat=0 constrained_intra=0 bframes=3 b_pyramid=2 b_adapt=1 b_bias=0 direct=1 weightb=1 open_gop=0 weightp=2 keyint=250 keyint_min=25 scenecut=40 intra_refresh=0 rc_lookahead=40 rc=crf mbtree=1 crf=23.0 qcomp=0.60 qpmin=0 qpmax=69 qpstep=4 ip_ratio=1.40 aq=1:1.00
Output #0, mp4, to 'results/result_voice.mp4':
  Metadata:
    encoder         : Lavf58.29.100
    Stream #0:0: Video: h264 (libx264) (avc1 / 0x31637661), yuv420p(progressive), 576x768 [SAR 1:1 DAR 3:4], q=-1--1, 25 fps, 12800 tbn, 25 tbc
    Metadata:
      encoder         : Lavc58.54.100 libx264
    Side data:
      cpb: bitrate max/min/avg: 0/0/0 buffer size: 0 vbv_delay: -1
    Stream #0:1: Audio: aac (LC) (mp4a / 0x6134706D), 44100 Hz, mono, fltp, 69 kb/s
    Metadata:
      encoder         : Lavc58.54.100 aac
frame=  123 fps=0.0 q=-1.0 Lsize=     621kB time=00:00:05.06 bitrate=1005.8kbits/s speed=10.6x
video:573kB audio:43kB subtitle:0kB other streams:0kB global headers:0kB muxing overhead: 0.827166%
[libx264 @ 0x55ec6513e840] frame I:1     Avg QP:22.24  size: 31028
[libx264 @ 0x55ec6513e840] frame P:75    Avg QP:22.01  size:  6954
[libx264 @ 0x55ec6513e840] frame B:47    Avg QP:25.58  size:   718
[libx264 @ 0x55ec6513e840] consecutive B-frames: 38.2% 27.6% 14.6% 19.5%
[libx264 @ 0x55ec6513e840] mb I  I16..4: 14.0% 83.9%  2.1%
[libx264 @ 0x55ec6513e840] mb P  I16..4:  1.3%  3.3%  0.1%  P16..4: 37.8%  8.2%  6.4%  0.0%  0.0%    skip:43.0%
[libx264 @ 0x55ec6513e840] mb B  I16..4:  0.2%  0.7%  0.0%  B16..8: 27.9%  0.4%  0.1%  direct: 0.2%  skip:70.6%  L0:43.9% L1:54.2% BI: 1.9%
[libx264 @ 0x55ec6513e840] 8x8 transform intra:73.3% inter:77.1%
[libx264 @ 0x55ec6513e840] coded y,uvDC,uvAC intra: 56.9% 72.4% 8.1% inter: 11.4% 13.0% 0.2%
[libx264 @ 0x55ec6513e840] i16 v,h,dc,p: 20% 23%  9% 48%
[libx264 @ 0x55ec6513e840] i8 v,h,dc,ddl,ddr,vr,hd,vl,hu: 25% 23% 36%  3%  3%  2%  2%  3%  3%
[libx264 @ 0x55ec6513e840] i4 v,h,dc,ddl,ddr,vr,hd,vl,hu: 39% 14% 14%  4%  6%  7%  4%  9%  3%
[libx264 @ 0x55ec6513e840] i8c dc,h,v,p: 42% 25% 29%  4%
[libx264 @ 0x55ec6513e840] Weighted P-Frames: Y:0.0% UV:0.0%
[libx264 @ 0x55ec6513e840] ref P L0: 74.2% 10.4% 11.1%  4.3%
[libx264 @ 0x55ec6513e840] ref B L0: 86.1% 11.2%  2.8%
[libx264 @ 0x55ec6513e840] ref B L1: 98.3%  1.7%
[libx264 @ 0x55ec6513e840] kb/s:953.36
[aac @ 0x55ec65140140] Qavg: 121.673

'results/result_voice.mp4'

Here is an example to compare the original video and the generated video after the Wav2Lip pipeline:

from IPython.display import Video, Audio

Video("data_video_sun_5s.mp4", embed=True)

Audio("data_audio_sun_5s.wav")

The generated video:

Video("results/result_voice.mp4", embed=True)

Interactive inference#

from gradio_helper import make_demo


demo = make_demo(fn=ov_inference)

try:
    demo.queue().launch(debug=False)
except Exception:
    demo.queue().launch(debug=False, share=True)
# if you are launching remotely, specify server_name and server_port
# demo.launch(server_name='your server name', server_port='server port in int')
# Read more in the docs: https://gradio.app/docs/"

Running on local URL:  http://127.0.0.1:7860

To create a public link, set share=True in launch().