Speaker diarization is the process of partitioning an audio stream
containing human speech into homogeneous segments according to the
identity of each speaker. It can enhance the readability of an automatic
speech transcription by structuring the audio stream into speaker turns
and, when used together with speaker recognition systems, by providing
the speaker’s true identity. It is used to answer the question “who
spoke when?”
With the increasing number of broadcasts, meeting recordings and voice
mail collected every year, speaker diarization has received much
attention by the speech community. Speaker diarization is an essential
feature for a speech recognition system to enrich the transcription with
speaker labels.
Speaker diarization is used to increase transcript readability and
better understand what a conversation is about. Speaker diarization can
help extract important points or action items from the conversation and
identify who said what. It also helps to identify how many speakers were
on the audio.
This tutorial considers ways to build speaker diarization pipeline using
pyannote.audio and OpenVINO. pyannote.audio is an open-source
toolkit written in Python for speaker diarization. Based on PyTorch deep
learning framework, it provides a set of trainable end-to-end neural
building blocks that can be combined and jointly optimized to build
speaker diarization pipelines. You can find more information about
pyannote pre-trained models in model
card,
repo and
paper.
ERROR:pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.googleapis-common-protos1.62.0requiresprotobuf!=3.20.0,!=3.20.1,!=4.21.1,!=4.21.2,!=4.21.3,!=4.21.4,!=4.21.5,<5.0.0.dev0,>=3.19.5,butyouhaveprotobuf3.20.1whichisincompatible.onnx1.15.0requiresprotobuf>=3.20.2,butyouhaveprotobuf3.20.1whichisincompatible.paddlepaddle2.6.0requiresprotobuf>=3.20.2;platform_system!="Windows",butyouhaveprotobuf3.20.1whichisincompatible.ppgan2.1.0requiresimageio==2.9.0,butyouhaveimageio2.33.1whichisincompatible.ppgan2.1.0requireslibrosa==0.8.1,butyouhavelibrosa0.9.2whichisincompatible.ppgan2.1.0requiresopencv-python<=4.6.0.66,butyouhaveopencv-python4.9.0.80whichisincompatible.tensorflow2.12.0requiresprotobuf!=4.21.0,!=4.21.1,!=4.21.2,!=4.21.3,!=4.21.4,!=4.21.5,<5.0.0dev,>=3.20.3,butyouhaveprotobuf3.20.1whichisincompatible.tensorflow-metadata1.14.0requiresprotobuf<4.21,>=3.20.3,butyouhaveprotobuf3.20.1whichisincompatible.
Traditional Speaker Diarization systems can be generalized into a
five-step process:
Feature extraction: transform the raw waveform into audio
features like mel spectrogram.
Voice activity detection: identify the chunks in the audio where
some voice activity was observed. As we are not interested in silence
and noise, we ignore those irrelevant chunks.
Speaker change detection: identify the speaker change points in
the conversation present in the audio.
Speech turn representation: encode each subchunk by creating
feature representations.
Speech turn clustering: cluster the subchunks based on their
vector representation. Different clustering algorithms may be applied
based on the availability of cluster count (k) and the embedding
process of the previous step.
The final output will be the clusters of different subchunks from the
audio stream. Each cluster can be given an anonymous identifier
(speaker_a, ..) and then it can be mapped with the audio stream to
create the speaker-aware audio timeline.
On the diagram, you can see a typical speaker diarization pipeline:
From a simplified point of view, speaker diarization is a combination of
speaker segmentation and speaker clustering. The first aims at finding
speaker change points in an audio stream. The second aims at grouping
together speech segments based on speaker characteristics.
For instantiating speaker diarization pipeline with pyannote.audio
library, we should import Pipeline class and use from_pretrained
method by providing a path to the directory with pipeline configuration
or identification from HuggingFace
hub.
NOTE: This tutorial uses a non-official version of model
philschmid/pyannote-speaker-diarization-endpoint, provided only
for demo purposes. The original model
(pyannote/speaker-diarization) requires you to accept the model
license before downloading or using its weights, visit the
pyannote/speaker-diarization
to read accept the license before you proceed. To use this model, you
must be a registered user in Hugging Face Hub. You will need to use
an access token for the code below to run. For more information on
access tokens, please refer to this section of the
documentation.
You can log in on HuggingFace Hub in the notebook environment using
the following code:
## login to huggingfacehub to get access to pre-trained modelfromhuggingface_hubimportnotebook_login,whoamitry:whoami()print('Authorization token already provided')exceptOSError:notebook_login()
Convert model to OpenVINO Intermediate Representation format¶
For best results with OpenVINO, it is recommended to convert the model
to OpenVINO IR format. OpenVINO supports PyTorch via ONNX conversion. We
will use torch.onnx.export for exporting the ONNX model from
PyTorch. We need to provide initialized model’s instance and example of
inputs for shape inference. We will use ov.convert_model
functionality to convert the ONNX models. The mo.convert_model
Python function returns an OpenVINO model ready to load on the device
and start making predictions. We can save it on disk for the next usage
with ov.save_model.
frompathlibimportPathimporttorchimportopenvinoasovcore=ov.Core()ov_speaker_segmentation_path=Path("pyannote-segmentation.xml")ifnotov_speaker_segmentation_path.exists():onnx_path=ov_speaker_segmentation_path.with_suffix(".onnx")torch.onnx.export(pipeline._segmentation.model,torch.zeros((1,1,80000)),onnx_path,input_names=["chunks"],output_names=["outputs"],dynamic_axes={"chunks":{0:"batch_size",2:"wave_len"}})ov_speaker_segmentation=ov.convert_model(onnx_path)ov.save_model(ov_speaker_segmentation,str(ov_speaker_segmentation_path))print(f"Model successfully converted to IR and saved to {ov_speaker_segmentation_path}")else:ov_speaker_segmentation=core.read_model(ov_speaker_segmentation_path)print(f"Model successfully loaded from {ov_speaker_segmentation_path}")