Audio compression with EnCodec and OpenVINO¶
This tutorial is also available as a Jupyter notebook that can be cloned directly from GitHub. See the installation guide for instructions to run this tutorial locally on Windows, Linux or macOS.
Compression is an important part of the Internet today because it enables people to easily share high-quality photos, listen to audio messages, stream their favorite shows, and so much more. Even when using today’s state-of-the-art techniques, enjoying these rich multimedia experiences requires a high speed Internet connection and plenty of storage space. AI helps to overcome these limitations: “Imagine listening to a friend’s audio message in an area with low connectivity and not having it stall or glitch.”
This tutorial considers ways to use OpenVINO and EnCodec algorithm for hyper compression of audio. EnCodec is a real-time, high-fidelity audio codec that uses AI to compress audio files without losing quality. It was introduced in High Fidelity Neural Audio Compression paper by Meta AI. The researchers claimed they achieved an approximate 10x compression rate without loss of quality and made it work for CD-quality audio. More details about this approach can be found in Meta AI blog and original repo.
Table of contents:
Install required dependencies:
%pip install -q -r requirements.txt
DEPRECATION: git+https://****@github.com/eaidova/encodec#egg=encodec;python_version=="3.7" contains an egg fragment with a non-PEP 508 name pip 25.0 will enforce this behaviour change. A possible replacement is to use the req @ url syntax, and remove the egg fragment. Discussion can be found at https://github.com/pypa/pip/issues/11617 DEPRECATION: pytorch-lightning 1.6.5 has a non-standard dependency specifier torch>=1.8.*. pip 24.0 will enforce this behaviour change. A possible replacement is to upgrade to a newer version of pytorch-lightning or contact the author to suggest that they release a version with a conforming dependency specifiers. Discussion can be found at https://github.com/pypa/pip/issues/12063 Note: you may need to restart the kernel to use updated packages.
Instantiate audio compression pipeline¶
Codecs, which act as encoders and decoders for streams of data, help empower most of the audio compression people currently use online. Some examples of commonly used codecs include MP3, Opus, and EVS. Classic codecs like these decompose the signal between different frequencies and encode as efficiently as possible. Most classic codecs leverage human hearing knowledge (psychoacoustics) but have a finite or given set of handcrafted ways to efficiently encode and decode the file. EnCodec, a neural network that is trained from end to end to reconstruct the input signal, was introduced as an attempt to overcome this limitation. It consists of three parts:
The encoder, which takes the uncompressed data in and transforms it into a higher dimensional and lower frame rate representation.
The quantizer, which compresses this representation to the target size. This compressed representation is what is stored on disk or will be sent through the network.
The decoder is the final step. It turns the compressed signal back into a waveform that is as similar as possible to the original. The key to lossless compression is to identify changes that will not be perceivable by humans, as perfect reconstruction is impossible at low bit rates.
The authors provide two multi-bandwidth models: *
encodec_model_24khz - a causal model operating at 24 kHz on
monophonic audio trained on a variety of audio data. *
encodec_model_48khz - a non-causal model operating at 48 kHz on
stereophonic audio trained on music-only data.
In this tutorial, we will use
encodec_model_24khz as an example, but
the same actions are also applicable to
encodec_model_48khz model as
well. To start working with this model, we need to instantiate model
EncodecModel.encodec_model_24khz() and select required
compression bandwidth among available: 1.5, 3, 6, 12 or 24 kbps for 24
kHz model and 3, 6, 12 and 24 kbps for 48 kHz model. We will use 6 kbps
from encodec import EncodecModel, compress, decompress from encodec.utils import convert_audio, save_audio import torchaudio import torch import typing as tp # Instantiate a pretrained EnCodec model model = EncodecModel.encodec_model_24khz() model.set_target_bandwidth(6.0)
Explore EnCodec pipeline¶
Let us explore model capabilities on example audio:
import sys import librosa import matplotlib.pyplot as plt import librosa.display import IPython.display as ipd sys.path.append("../utils") from notebook_utils import download_file test_data_url = "https://github.com/facebookresearch/encodec/raw/main/test_24k.wav" sample_file = 'test_24k.wav' download_file(test_data_url, sample_file) audio, sr = librosa.load(sample_file) plt.figure(figsize=(14, 5)) librosa.display.waveshow(audio, sr=sr) ipd.Audio(sample_file)
test_24k.wav: 0%| | 0.00/938k [00:00<?, ?B/s]