Merge branch 'release/3.3.0'

pyannote · Jun 14, 2024 · adaf770 · adaf770
2 parents 70a8507 + d260ba0
commit adaf770
Show file tree

Hide file tree

Showing 16 changed files with 2,438 additions and 47 deletions.
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -1,5 +1,33 @@
 # Changelog
 
+## Version 3.3.0 (2024-06-14)
+
+### TL;DR
+
+`pyannote.audio` does [speech separation](https://hf.co/pyannote/speech-separation-ami-1.0): multi-speaker audio in, one audio channel per speaker out!
+
+```bash
+pip install pyannote.audio[separation]==3.3.0
+```
+
+### New features
+
+- feat(task): add `PixIT` joint speaker diarization and speech separation task (with [@joonaskalda](https://github.com/joonaskalda/))
+- feat(model): add `ToTaToNet` joint speaker diarization and speech separation model (with [@joonaskalda](https://github.com/joonaskalda/))
+- feat(pipeline): add `SpeechSeparation` pipeline (with [@joonaskalda](https://github.com/joonaskalda/))
+- feat(io): add option to select torchaudio `backend`
+
+### Fixes
+
+- fix(task): fix wrong train/development split when training with (some) meta-protocols ([#1709](https://github.com/pyannote/pyannote-audio/issues/1709))
+- fix(task): fix metadata preparation with missing validation subset ([@clement-pages](https://github.com/clement-pages/))
+
+### Improvements
+
+- improve(io): when available, default to using `soundfile` backend
+- improve(pipeline): do not extract embeddings when `max_speakers` is set to 1
+- improve(pipeline): optimize memory usage of most pipelines ([#1713](https://github.com/pyannote/pyannote-audio/pull/1713) by [@benniekiss](https://github.com/benniekiss/))
+
 ## Version 3.2.0 (2024-05-08)
 
 ### New features
@@ -18,6 +46,7 @@
 - fix(task): fix estimation of training set size (with [@FrenchKrab](https://github.com/FrenchKrab))
 - fix(hook): fix `torch.Tensor` support in `ArtifactHook`
 - fix(doc): fix typo in `Powerset` docstring (with [@lukasstorck](https://github.com/lukasstorck))
+- fix(doc): remove mention of unsupported `numpy.ndarray` waveform (with [@Purfview](https://github.com/Purfview))
 
 ### Improvements
 
@@ -26,12 +55,12 @@
 - improve(io): switch to `torchaudio >= 2.2.0`
 - improve(doc): update tutorials (with [@clement-pages](https://github.com/clement-pages/))
 
-## Breaking changes
+### Breaking changes
 
 - BREAKING(model): get rid of `Model.example_output` in favor of `num_frames` method, `receptive_field` property, and `dimension` property
 - BREAKING(task): custom tasks need to be updated (see "Add your own task" tutorial)
 
-## Community contributions
+### Community contributions
 
 - community: add tutorial for offline use of `pyannote/speaker-diarization-3.1` (by [@simonottenhauskenbun](https://github.com/simonottenhauskenbun))
 

diff --git a/README.md b/README.md
@@ -1,5 +1,5 @@
-Using `pyannote.audio` open-source toolkit in production?
-Make the most of it thanks to our [consulting services](https://herve.niderb.fr/consulting.html).
+Using `pyannote.audio` open-source toolkit in production?  
+Consider switching to [pyannoteAI](https://www.pyannote.ai) for better and faster options.
 
 # `pyannote.audio` speaker diarization toolkit
 
@@ -79,7 +79,7 @@ for turn, _, speaker in diarization.itertracks(yield_label=True):
 Out of the box, `pyannote.audio` speaker diarization [pipeline](https://hf.co/pyannote/speaker-diarization-3.1) v3.1 is expected to be much better (and faster) than v2.x.
 Those numbers are diarization error rates (in %):
 
-| Benchmark                                                                                                                   | [v2.1](https://hf.co/pyannote/speaker-diarization-2.1) | [v3.1](https://hf.co/pyannote/speaker-diarization-3.1) | [Premium](https://forms.office.com/e/GdqwVgkZ5C) |
+| Benchmark                                                                                                                   | [v2.1](https://hf.co/pyannote/speaker-diarization-2.1) | [v3.1](https://hf.co/pyannote/speaker-diarization-3.1) | [pyannoteAI](https://www.pyannote.ai) |
 | --------------------------------------------------------------------------------------------------------------------------- | ------------------------------------------------------ | ------------------------------------------------------ | ------------------------------------------------ |
 | [AISHELL-4](https://arxiv.org/abs/2104.03603)                                                                               | 14.1                                                   | 12.2                                                   | 11.9                                             |
 | [AliMeeting](https://www.openslr.org/119/) (channel 1)                                                                      | 27.4                                                   | 24.4                                                   | 22.5                                             |

diff --git a/pyannote/audio/core/inference.py b/pyannote/audio/core/inference.py
@@ -559,9 +559,6 @@ def aggregate(
             step=frames.step,
         )
 
-        masks = 1 - np.isnan(scores)
-        scores.data = np.nan_to_num(scores.data, copy=True, nan=0.0)
-
         # Hamming window used for overlap-add aggregation
         hamming_window = (
             np.hamming(num_frames_per_chunk).reshape(-1, 1)
@@ -613,11 +610,13 @@ def aggregate(
         )
 
         # loop on the scores of sliding chunks
-        for (chunk, score), (_, mask) in zip(scores, masks):
+        for chunk, score in scores:
             # chunk ~ Segment
             # score ~ (num_frames_per_chunk, num_classes)-shaped np.ndarray
             # mask ~ (num_frames_per_chunk, num_classes)-shaped np.ndarray
-
+            mask = 1 - np.isnan(score)
+            np.nan_to_num(score, copy=False, nan=0.0)
+
             start_frame = frames.closest_frame(chunk.start + 0.5 * frames.duration)
 
             aggregated_output[start_frame : start_frame + num_frames_per_chunk] += (

diff --git a/pyannote/audio/core/io.py b/pyannote/audio/core/io.py
@@ -48,21 +48,41 @@
     - a "IOBase" instance with "read" and "seek" support: open("audio.wav", "rb")
     - a "Mapping" with any of the above as "audio" key: {"audio": ...}
     - a "Mapping" with both "waveform" and "sample_rate" key:
-        {"waveform": (channel, time) numpy.ndarray or torch.Tensor, "sample_rate": 44100}
+        {"waveform": (channel, time) torch.Tensor, "sample_rate": 44100}
 
 For last two options, an additional "channel" key can be provided as a zero-indexed
 integer to load a specific channel: {"audio": "stereo.wav", "channel": 0}
 """
 
 
-def get_torchaudio_info(file: AudioFile):
+def get_torchaudio_info(
+    file: AudioFile, backend: str = None
+) -> torchaudio.AudioMetaData:
     """Protocol preprocessor used to cache output of torchaudio.info
 
     This is useful to speed future random access to this file, e.g.
     in dataloaders using Audio.crop a lot....
+
+    Parameters
+    ----------
+    file : AudioFile
+    backend : str
+        torchaudio backend to use. Defaults to 'soundfile' if available,
+        or the first available backend.
+
+    Returns
+    -------
+    info : torchaudio.AudioMetaData
+        Audio file metadata
     """
 
-    info = torchaudio.info(file["audio"])
+    if not backend:
+        backends = (
+            torchaudio.list_audio_backends()
+        )  # e.g ['ffmpeg', 'soundfile', 'sox']
+        backend = "soundfile" if "soundfile" in backends else backends[0]
+
+    info = torchaudio.info(file["audio"], backend=backend)
 
     # rewind if needed
     if isinstance(file["audio"], IOBase):
@@ -82,6 +102,9 @@ class Audio:
         In case of multi-channel audio, convert to single-channel audio
         using one of the following strategies: select one channel at
         'random' or 'downmix' by averaging all channels.
+    backend : str
+        torchaudio backend to use. Defaults to 'soundfile' if available,
+        or the first available backend.
 
     Usage
     -----
@@ -126,7 +149,7 @@ def validate_file(file: AudioFile) -> Mapping:
         -------
         validated_file : Mapping
             {"audio": str, "uri": str, ...}
-            {"waveform": array or tensor, "sample_rate": int, "uri": str, ...}
+            {"waveform": tensor, "sample_rate": int, "uri": str, ...}
             {"audio": file, "uri": "stream"} if `file` is an IOBase instance
 
         Raises
@@ -148,7 +171,7 @@ def validate_file(file: AudioFile) -> Mapping:
             raise ValueError(AudioFileDocString)
 
         if "waveform" in file:
-            waveform: Union[np.ndarray, Tensor] = file["waveform"]
+            waveform: Tensor = file["waveform"]
             if len(waveform.shape) != 2 or waveform.shape[0] > waveform.shape[1]:
                 raise ValueError(
                     "'waveform' must be provided as a (channel, time) torch Tensor."
@@ -179,11 +202,19 @@ def validate_file(file: AudioFile) -> Mapping:
 
         return file
 
-    def __init__(self, sample_rate=None, mono=None):
+    def __init__(self, sample_rate: int = None, mono=None, backend: str = None):
         super().__init__()
         self.sample_rate = sample_rate
         self.mono = mono
 
+        if not backend:
+            backends = (
+                torchaudio.list_audio_backends()
+            )  # e.g ['ffmpeg', 'soundfile', 'sox']
+            backend = "soundfile" if "soundfile" in backends else backends[0]
+
+        self.backend = backend
+
     def downmix_and_resample(self, waveform: Tensor, sample_rate: int) -> Tensor:
         """Downmix and resample
 
@@ -244,7 +275,7 @@ def get_duration(self, file: AudioFile) -> float:
             if "torchaudio.info" in file:
                 info = file["torchaudio.info"]
             else:
-                info = get_torchaudio_info(file)
+                info = get_torchaudio_info(file, backend=self.backend)
 
             frames = info.num_frames
             sample_rate = info.sample_rate
@@ -291,7 +322,7 @@ def __call__(self, file: AudioFile) -> Tuple[Tensor, int]:
             sample_rate = file["sample_rate"]
 
         elif "audio" in file:
-            waveform, sample_rate = torchaudio.load(file["audio"])
+            waveform, sample_rate = torchaudio.load(file["audio"], backend=self.backend)
 
             # rewind if needed
             if isinstance(file["audio"], IOBase):
@@ -349,7 +380,7 @@ def crop(
             sample_rate = info.sample_rate
 
         else:
-            info = get_torchaudio_info(file)
+            info = get_torchaudio_info(file, backend=self.backend)
             frames = info.num_frames
             sample_rate = info.sample_rate
 
@@ -401,7 +432,10 @@ def crop(
         else:
             try:
                 data, _ = torchaudio.load(
-                    file["audio"], frame_offset=start_frame, num_frames=num_frames
+                    file["audio"],
+                    frame_offset=start_frame,
+                    num_frames=num_frames,
+                    backend=self.backend,
                 )
                 # rewind if needed
                 if isinstance(file["audio"], IOBase):

diff --git a/pyannote/audio/core/task.py b/pyannote/audio/core/task.py
@@ -362,12 +362,13 @@ def prepare_data(self):
 
         if self.has_validation:
             files_iter = itertools.chain(
-                self.protocol.train(), self.protocol.development()
+                zip(itertools.repeat("train"), self.protocol.train()),
+                zip(itertools.repeat("development"), self.protocol.development()),
             )
         else:
-            files_iter = self.protocol.train()
+            files_iter = zip(itertools.repeat("train"), self.protocol.train())
 
-        for file_id, file in enumerate(files_iter):
+        for file_id, (subset, file) in enumerate(files_iter):
             # gather metadata and update metadata_unique_values so that each metadatum
             # (e.g. source database or label) is represented by an integer.
             metadatum = dict()
@@ -378,7 +379,8 @@ def prepare_data(self):
             metadatum["database"] = metadata_unique_values["database"].index(
                 file["database"]
             )
-            metadatum["subset"] = Subsets.index(file["subset"])
+
+            metadatum["subset"] = Subsets.index(subset)
 
             # keep track of label scope (file, database, or global)
             metadatum["scope"] = Scopes.index(file["scope"])
@@ -593,7 +595,9 @@ def prepare_data(self):
         prepared_data["metadata-labels"] = np.array(unique_labels, dtype=np.str_)
         unique_labels.clear()
 
-        self.prepare_validation(prepared_data)
+        if self.has_validation:
+            self.prepare_validation(prepared_data)
+
         self.post_prepare_data(prepared_data)
 
         # save prepared data on the disk