Wav2Lip: Accurately Lip-syncing Videos and OpenVINO#
This Jupyter notebook can be launched after a local installation only.
Lip sync technologies are widely used for digital human use cases, which enhance the user experience in dialog scenarios.
Wav2Lip is an approach to generate accurate 2D lip-synced videos in the wild with only one video and an audio clip. Wav2Lip leverages an accurate lip-sync “expert” model and consecutive face frames for accurate, natural lip motion generation.
In this notebook, we introduce how to enable and optimize Wav2Lippipeline with OpenVINO. This is adaptation of the blog article Enable 2D Lip Sync Wav2Lip Pipeline with OpenVINO Runtime.
Here is Wav2Lip pipeline overview:
Table of contents:
Installation Instructions#
This is a self-contained example that relies solely on its own code.
We recommend running the notebook in a virtual environment. You only need a Jupyter server to start. For details, please refer to Installation Guide.
Prerequisites#
import requests
from pathlib import Path
r = requests.get(
url="https://raw.githubusercontent.com/openvinotoolkit/openvino_notebooks/latest/utils/notebook_utils.py",
)
open("notebook_utils.py", "w").write(r.text)
r = requests.get(
url="https://raw.githubusercontent.com/openvinotoolkit/openvino_notebooks/latest/utils/pip_helper.py",
)
open("pip_helper.py", "w").write(r.text)
from pip_helper import pip_install
pip_install("-q", "openvino>=2024.4.0")
pip_install(
"-q",
"huggingface_hub",
"torch>=2.1",
"gradio>=4.19",
"librosa==0.9.2",
"opencv-contrib-python",
"opencv-python",
"tqdm",
"numba",
"numpy<2",
"--extra-index-url",
"https://download.pytorch.org/whl/cpu",
)
helpers = ["gradio_helper.py", "ov_inference.py", "ov_wav2lip_helper.py"]
for helper_file in helpers:
if not Path(helper_file).exists():
r = requests.get(url=f"https://raw.githubusercontent.com/openvinotoolkit/openvino_notebooks/latest/notebooks/wav2lip/{helper_file}")
open(helper_file, "w").write(r.text)
import os
import sys
wav2lip_path = Path("Wav2Lip")
if not wav2lip_path.exists():
exit_code = os.system("git clone https://github.com/Rudrabha/Wav2Lip")
if exit_code != 0:
raise Exception("Failed to clone the repository!")
sys.path.insert(0, str(wav2lip_path))
Cloning into 'Wav2Lip'...
Download example files.
from notebook_utils import download_file
download_file("https://github.com/sammysun0711/openvino_aigc_samples/blob/main/Wav2Lip/data_audio_sun_5s.wav?raw=true")
download_file("https://github.com/sammysun0711/openvino_aigc_samples/blob/main/Wav2Lip/data_video_sun_5s.mp4?raw=true")
data_audio_sun_5s.wav: 0%| | 0.00/436k [00:00<?, ?B/s]
data_video_sun_5s.mp4: 0%| | 0.00/916k [00:00<?, ?B/s]
PosixPath('/opt/home/k8sworker/ci-ai/cibuilds/jobs/ov-notebook/jobs/OVNotebookOps/builds/801/archive/.workspace/scm/ov-notebook/notebooks/wav2lip/data_video_sun_5s.mp4')
Convert the model to OpenVINO IR#
You don’t need to download checkpoints and load models, just call the
helper function download_and_convert_models
. It takes care about it
and will convert both model in OpenVINO format.
from ov_wav2lip_helper import download_and_convert_models
OV_FACE_DETECTION_MODEL_PATH = Path("models/face_detection.xml")
OV_WAV2LIP_MODEL_PATH = Path("models/wav2lip.xml")
download_and_convert_models(OV_FACE_DETECTION_MODEL_PATH, OV_WAV2LIP_MODEL_PATH)
Convert Face Detection Model ...
s3fd-619a316812.pth: 0%| | 0.00/85.7M [00:00<?, ?B/s]
/opt/home/k8sworker/ci-ai/cibuilds/jobs/ov-notebook/jobs/OVNotebookOps/builds/801/archive/.workspace/scm/ov-notebook/notebooks/wav2lip/ov_wav2lip_helper.py:43: FutureWarning: You are using torch.load with weights_only=False (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See pytorch/pytorch for more details). In a future release, the default value for weights_only will be flipped to True. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via torch.serialization.add_safe_globals. We recommend you start setting weights_only=True for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature. model_weights = torch.load(path_to_detector)
Converted face detection OpenVINO model: models/face_detection.xml
Convert Wav2Lip Model ...
wav2lip.pth: 0%| | 0.00/436M [00:00<?, ?B/s]
Load checkpoint from: checkpoints/Wav2lip/wav2lip.pth
/opt/home/k8sworker/ci-ai/cibuilds/jobs/ov-notebook/jobs/OVNotebookOps/builds/801/archive/.workspace/scm/ov-notebook/notebooks/wav2lip/ov_wav2lip_helper.py:16: FutureWarning: You are using torch.load with weights_only=False (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See pytorch/pytorch for more details). In a future release, the default value for weights_only will be flipped to True. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via torch.serialization.add_safe_globals. We recommend you start setting weights_only=True for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature. checkpoint = torch.load(checkpoint_path, map_location=lambda storage, loc: storage)
Converted face detection OpenVINO model: models/wav2lip.xml
Compiling models and prepare pipeline#
Select device from dropdown list for running inference using OpenVINO.
from notebook_utils import device_widget
device = device_widget()
device
Dropdown(description='Device:', index=1, options=('CPU', 'AUTO'), value='AUTO')
ov_inference.py
is an adaptation of original pipeline that has only
cli-interface. ov_inference
allows running the inference using
python API and converted OpenVINO models.
from ov_inference import ov_inference
if not os.path.exists("results"):
os.mkdir("results")
ov_inference(
"data_video_sun_5s.mp4",
"data_audio_sun_5s.wav",
face_detection_path=OV_FACE_DETECTION_MODEL_PATH,
wav2lip_path=OV_WAV2LIP_MODEL_PATH,
inference_device=device.value,
outfile="results/result_voice.mp4",
)
Reading video frames...
Number of frames available for inference: 125
/opt/home/k8sworker/ci-ai/cibuilds/jobs/ov-notebook/jobs/OVNotebookOps/builds/801/archive/.workspace/scm/ov-notebook/notebooks/wav2lip/Wav2Lip/audio.py:100: FutureWarning: Pass sr=16000, n_fft=800 as keyword args. From version 0.10 passing these as positional arguments will result in an error
return librosa.filters.mel(hp.sample_rate, hp.n_fft, n_mels=hp.num_mels,
(80, 405)
Length of mel chunks: 123
0%| | 0/1 [00:00<?, ?it/s]
face_detect_ov images[0].shape: (768, 576, 3)
0%| | 0/8 [00:00<?, ?it/s][A
12%|█▎ | 1/8 [00:02<00:19, 2.77s/it][A
25%|██▌ | 2/8 [00:05<00:16, 2.70s/it][A
38%|███▊ | 3/8 [00:08<00:13, 2.68s/it][A
50%|█████ | 4/8 [00:10<00:10, 2.66s/it][A
62%|██████▎ | 5/8 [00:13<00:07, 2.65s/it][A
75%|███████▌ | 6/8 [00:15<00:05, 2.64s/it][A
88%|████████▊ | 7/8 [00:18<00:02, 2.64s/it][A
100%|██████████| 8/8 [00:20<00:00, 2.56s/it]
Model loaded
100%|██████████| 1/1 [00:22<00:00, 22.96s/it]
ffmpeg version 4.2.7-0ubuntu0.1 Copyright (c) 2000-2022 the FFmpeg developers
built with gcc 9 (Ubuntu 9.4.0-1ubuntu1~20.04.1)
configuration: --prefix=/usr --extra-version=0ubuntu0.1 --toolchain=hardened --libdir=/usr/lib/x86_64-linux-gnu --incdir=/usr/include/x86_64-linux-gnu --arch=amd64 --enable-gpl --disable-stripping --enable-avresample --disable-filter=resample --enable-avisynth --enable-gnutls --enable-ladspa --enable-libaom --enable-libass --enable-libbluray --enable-libbs2b --enable-libcaca --enable-libcdio --enable-libcodec2 --enable-libflite --enable-libfontconfig --enable-libfreetype --enable-libfribidi --enable-libgme --enable-libgsm --enable-libjack --enable-libmp3lame --enable-libmysofa --enable-libopenjpeg --enable-libopenmpt --enable-libopus --enable-libpulse --enable-librsvg --enable-librubberband --enable-libshine --enable-libsnappy --enable-libsoxr --enable-libspeex --enable-libssh --enable-libtheora --enable-libtwolame --enable-libvidstab --enable-libvorbis --enable-libvpx --enable-libwavpack --enable-libwebp --enable-libx265 --enable-libxml2 --enable-libxvid --enable-libzmq --enable-libzvbi --enable-lv2 --enable-omx --enable-openal --enable-opencl --enable-opengl --enable-sdl2 --enable-libdc1394 --enable-libdrm --enable-libiec61883 --enable-nvenc --enable-chromaprint --enable-frei0r --enable-libx264 --enable-shared
libavutil 56. 31.100 / 56. 31.100
libavcodec 58. 54.100 / 58. 54.100
libavformat 58. 29.100 / 58. 29.100
libavdevice 58. 8.100 / 58. 8.100
libavfilter 7. 57.100 / 7. 57.100
libavresample 4. 0. 0 / 4. 0. 0
libswscale 5. 5.100 / 5. 5.100
libswresample 3. 5.100 / 3. 5.100
libpostproc 55. 5.100 / 55. 5.100
Guessed Channel Layout for Input Stream #0.0 : mono
Input #0, wav, from 'data_audio_sun_5s.wav':
Metadata:
encoder : Lavf58.20.100
Duration: 00:00:05.06, bitrate: 705 kb/s
Stream #0:0: Audio: pcm_s16le ([1][0][0][0] / 0x0001), 44100 Hz, mono, s16, 705 kb/s
Input #1, avi, from 'Wav2Lip/temp/result.avi':
Metadata:
encoder : Lavf59.27.100
Duration: 00:00:04.92, start: 0.000000, bitrate: 1891 kb/s
Stream #1:0: Video: mpeg4 (Simple Profile) (DIVX / 0x58564944), yuv420p, 576x768 [SAR 1:1 DAR 3:4], 1893 kb/s, 25 fps, 25 tbr, 25 tbn, 25 tbc
Stream mapping:
Stream #1:0 -> #0:0 (mpeg4 (native) -> h264 (libx264))
Stream #0:0 -> #0:1 (pcm_s16le (native) -> aac (native))
Press [q] to stop, [?] for help
[libx264 @ 0x56537dc66840] -qscale is ignored, -crf is recommended.
[libx264 @ 0x56537dc66840] using SAR=1/1
[libx264 @ 0x56537dc66840] using cpu capabilities: MMX2 SSE2Fast SSSE3 SSE4.2 AVX FMA3 BMI2 AVX2 AVX512
[libx264 @ 0x56537dc66840] profile High, level 3.1
[libx264 @ 0x56537dc66840] 264 - core 155 r2917 0a84d98 - H.264/MPEG-4 AVC codec - Copyleft 2003-2018 - http://www.videolan.org/x264.html - options: cabac=1 ref=3 deblock=1:0:0 analyse=0x3:0x113 me=hex subme=7 psy=1 psy_rd=1.00:0.00 mixed_ref=1 me_range=16 chroma_me=1 trellis=1 8x8dct=1 cqm=0 deadzone=21,11 fast_pskip=1 chroma_qp_offset=-2 threads=24 lookahead_threads=4 sliced_threads=0 nr=0 decimate=1 interlaced=0 bluray_compat=0 constrained_intra=0 bframes=3 b_pyramid=2 b_adapt=1 b_bias=0 direct=1 weightb=1 open_gop=0 weightp=2 keyint=250 keyint_min=25 scenecut=40 intra_refresh=0 rc_lookahead=40 rc=crf mbtree=1 crf=23.0 qcomp=0.60 qpmin=0 qpmax=69 qpstep=4 ip_ratio=1.40 aq=1:1.00
Output #0, mp4, to 'results/result_voice.mp4':
Metadata:
encoder : Lavf58.29.100
Stream #0:0: Video: h264 (libx264) (avc1 / 0x31637661), yuv420p(progressive), 576x768 [SAR 1:1 DAR 3:4], q=-1--1, 25 fps, 12800 tbn, 25 tbc
Metadata:
encoder : Lavc58.54.100 libx264
Side data:
cpb: bitrate max/min/avg: 0/0/0 buffer size: 0 vbv_delay: -1
Stream #0:1: Audio: aac (LC) (mp4a / 0x6134706D), 44100 Hz, mono, fltp, 69 kb/s
Metadata:
encoder : Lavc58.54.100 aac
frame= 123 fps=0.0 q=-1.0 Lsize= 621kB time=00:00:05.06 bitrate=1005.8kbits/s speed=10.8x
video:573kB audio:43kB subtitle:0kB other streams:0kB global headers:0kB muxing overhead: 0.827166%
[libx264 @ 0x56537dc66840] frame I:1 Avg QP:22.24 size: 31028
[libx264 @ 0x56537dc66840] frame P:75 Avg QP:22.01 size: 6954
[libx264 @ 0x56537dc66840] frame B:47 Avg QP:25.58 size: 718
[libx264 @ 0x56537dc66840] consecutive B-frames: 38.2% 27.6% 14.6% 19.5%
[libx264 @ 0x56537dc66840] mb I I16..4: 14.0% 83.9% 2.1%
[libx264 @ 0x56537dc66840] mb P I16..4: 1.3% 3.3% 0.1% P16..4: 37.8% 8.2% 6.4% 0.0% 0.0% skip:43.0%
[libx264 @ 0x56537dc66840] mb B I16..4: 0.2% 0.7% 0.0% B16..8: 27.9% 0.4% 0.1% direct: 0.2% skip:70.6% L0:43.9% L1:54.2% BI: 1.9%
[libx264 @ 0x56537dc66840] 8x8 transform intra:73.3% inter:77.1%
[libx264 @ 0x56537dc66840] coded y,uvDC,uvAC intra: 56.9% 72.4% 8.1% inter: 11.4% 13.0% 0.2%
[libx264 @ 0x56537dc66840] i16 v,h,dc,p: 20% 23% 9% 48%
[libx264 @ 0x56537dc66840] i8 v,h,dc,ddl,ddr,vr,hd,vl,hu: 25% 23% 36% 3% 3% 2% 2% 3% 3%
[libx264 @ 0x56537dc66840] i4 v,h,dc,ddl,ddr,vr,hd,vl,hu: 39% 14% 14% 4% 6% 7% 4% 9% 3%
[libx264 @ 0x56537dc66840] i8c dc,h,v,p: 42% 25% 29% 4%
[libx264 @ 0x56537dc66840] Weighted P-Frames: Y:0.0% UV:0.0%
[libx264 @ 0x56537dc66840] ref P L0: 74.2% 10.4% 11.1% 4.3%
[libx264 @ 0x56537dc66840] ref B L0: 86.1% 11.2% 2.8%
[libx264 @ 0x56537dc66840] ref B L1: 98.3% 1.7%
[libx264 @ 0x56537dc66840] kb/s:953.36
[aac @ 0x56537dc68140] Qavg: 121.673
'results/result_voice.mp4'
Here is an example to compare the original video and the generated video after the Wav2Lip pipeline:
from IPython.display import Video, Audio
Video("data_video_sun_5s.mp4", embed=True)