Speaker diarization¶
This tutorial is also available as a Jupyter notebook that can be cloned directly from GitHub. See the installation guide for instructions to run this tutorial locally on Windows, Linux or macOS.
Speaker diarization is the process of partitioning an audio stream containing human speech into homogeneous segments according to the identity of each speaker. It can enhance the readability of an automatic speech transcription by structuring the audio stream into speaker turns and, when used together with speaker recognition systems, by providing the speaker’s true identity. It is used to answer the question “who spoke when?”

image.png¶
With the increasing number of broadcasts, meeting recordings and voice mail collected every year, speaker diarization has received much attention by the speech community. Speaker diarization is an essential feature for a speech recognition system to enrich the transcription with speaker labels.
Speaker diarization is used to increase transcript readability and better understand what a conversation is about. Speaker diarization can help extract important points or action items from the conversation and identify who said what. It also helps to identify how many speakers were on the audio.
This tutorial considers ways to build speaker diarization pipeline using
pyannote.audio and OpenVINO. pyannote.audio
is an open-source
toolkit written in Python for speaker diarization. Based on PyTorch deep
learning framework, it provides a set of trainable end-to-end neural
building blocks that can be combined and jointly optimized to build
speaker diarization pipelines. You can find more information about
pyannote pre-trained models in model
card,
repo and
paper.
Table of contents:
Prerequisites¶
!pip install -q -r requirements.txt
DEPRECATION: pytorch-lightning 1.6.5 has a non-standard dependency specifier torch>=1.8.*. pip 23.3 will enforce this behaviour change. A possible replacement is to upgrade to a newer version of pytorch-lightning or contact the author to suggest that they release a version with a conforming dependency specifiers. Discussion can be found at https://github.com/pypa/pip/issues/12063
ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
onnx 1.14.1 requires protobuf>=3.20.2, but you have protobuf 3.20.1 which is incompatible.
onnxconverter-common 1.14.0 requires protobuf==3.20.2, but you have protobuf 3.20.1 which is incompatible.
paddlepaddle 2.5.0rc0 requires protobuf>=3.20.2; platform_system != "Windows", but you have protobuf 3.20.1 which is incompatible.
ppgan 2.1.0 requires imageio==2.9.0, but you have imageio 2.31.3 which is incompatible.
ppgan 2.1.0 requires librosa==0.8.1, but you have librosa 0.9.2 which is incompatible.
ppgan 2.1.0 requires opencv-python<=4.6.0.66, but you have opencv-python 4.8.0.76 which is incompatible.
tensorflow 2.12.0 requires protobuf!=4.21.0,!=4.21.1,!=4.21.2,!=4.21.3,!=4.21.4,!=4.21.5,<5.0.0dev,>=3.20.3, but you have protobuf 3.20.1 which is incompatible.
tf2onnx 1.15.1 requires protobuf~=3.20.2, but you have protobuf 3.20.1 which is incompatible.
Prepare pipeline¶
Traditional Speaker Diarization systems can be generalized into a five-step process:
Feature extraction: transform the raw waveform into audio features like mel spectrogram.
Voice activity detection: identify the chunks in the audio where some voice activity was observed. As we are not interested in silence and noise, we ignore those irrelevant chunks.
Speaker change detection: identify the speaker change points in the conversation present in the audio.
Speech turn representation: encode each subchunk by creating feature representations.
Speech turn clustering: cluster the subchunks based on their vector representation. Different clustering algorithms may be applied based on the availability of cluster count (k) and the embedding process of the previous step.
The final output will be the clusters of different subchunks from the audio stream. Each cluster can be given an anonymous identifier (speaker_a, ..) and then it can be mapped with the audio stream to create the speaker-aware audio timeline.
On the diagram, you can see a typical speaker diarization pipeline:

diarization_pipeline¶
From a simplified point of view, speaker diarization is a combination of speaker segmentation and speaker clustering. The first aims at finding speaker change points in an audio stream. The second aims at grouping together speech segments based on speaker characteristics.
For instantiating speaker diarization pipeline with pyannote.audio
library, we should import Pipeline
class and use from_pretrained
method by providing a path to the directory with pipeline configuration
or identification from HuggingFace
hub.
Note
This tutorial uses a non-official version of model
philschmid/pyannote-speaker-diarization-endpoint
, provided only
for demo purposes. The original model
(pyannote/speaker-diarization
) requires you to accept the model
license before downloading or using its weights, visit the
pyannote/speaker-diarization
to read accept the license before you proceed. To use this model, you
must be a registered user in 🤗 Hugging Face Hub. You will need to use
an access token for the code below to run. For more information on
access tokens, please refer to this section of the
documentation.
You can log in on HuggingFace Hub in the notebook environment using
the following code:
## login to huggingfacehub to get access to pre-trained model
from huggingface_hub import notebook_login, whoami
try:
whoami()
print('Authorization token already provided')
except OSError:
notebook_login()
from pyannote.audio import Pipeline
pipeline = Pipeline.from_pretrained("philschmid/pyannote-speaker-diarization-endpoint")
2023-09-08 23:36:40.468953: I tensorflow/core/util/port.cc:110] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable TF_ENABLE_ONEDNN_OPTS=0. 2023-09-08 23:36:40.503440: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations. To enable the following instructions: AVX2 AVX512F AVX512_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags. 2023-09-08 23:36:41.110289: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
Load test audio file¶
import sys
sys.path.append("../utils")
from notebook_utils import download_file
test_data_url = "https://github.com/pyannote/pyannote-audio/raw/develop/tutorials/assets/sample.wav"
sample_file = 'sample.wav'
download_file(test_data_url, 'sample.wav')
AUDIO_FILE = {'uri': sample_file.replace('.wav', ''), 'audio': sample_file}
sample.wav: 0%| | 0.00/938k [00:00<?, ?B/s]
import librosa
import matplotlib.pyplot as plt
import librosa.display
import IPython.display as ipd
audio, sr = librosa.load(sample_file)
plt.figure(figsize=(14, 5))
librosa.display.waveshow(audio, sr=sr)
ipd.Audio(sample_file)