Skip to content

Commit

Permalink
Merge branch 'release/3.3.0'
Browse files Browse the repository at this point in the history
  • Loading branch information
hbredin committed Jun 14, 2024
2 parents 70a8507 + d260ba0 commit adaf770
Show file tree
Hide file tree
Showing 16 changed files with 2,438 additions and 47 deletions.
33 changes: 31 additions & 2 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,33 @@
# Changelog

## Version 3.3.0 (2024-06-14)

### TL;DR

`pyannote.audio` does [speech separation](https://hf.co/pyannote/speech-separation-ami-1.0): multi-speaker audio in, one audio channel per speaker out!

```bash
pip install pyannote.audio[separation]==3.3.0
```

### New features

- feat(task): add `PixIT` joint speaker diarization and speech separation task (with [@joonaskalda](https://github.com/joonaskalda/))
- feat(model): add `ToTaToNet` joint speaker diarization and speech separation model (with [@joonaskalda](https://github.com/joonaskalda/))
- feat(pipeline): add `SpeechSeparation` pipeline (with [@joonaskalda](https://github.com/joonaskalda/))
- feat(io): add option to select torchaudio `backend`

### Fixes

- fix(task): fix wrong train/development split when training with (some) meta-protocols ([#1709](https://github.com/pyannote/pyannote-audio/issues/1709))
- fix(task): fix metadata preparation with missing validation subset ([@clement-pages](https://github.com/clement-pages/))

### Improvements

- improve(io): when available, default to using `soundfile` backend
- improve(pipeline): do not extract embeddings when `max_speakers` is set to 1
- improve(pipeline): optimize memory usage of most pipelines ([#1713](https://github.com/pyannote/pyannote-audio/pull/1713) by [@benniekiss](https://github.com/benniekiss/))

## Version 3.2.0 (2024-05-08)

### New features
Expand All @@ -18,6 +46,7 @@
- fix(task): fix estimation of training set size (with [@FrenchKrab](https://github.com/FrenchKrab))
- fix(hook): fix `torch.Tensor` support in `ArtifactHook`
- fix(doc): fix typo in `Powerset` docstring (with [@lukasstorck](https://github.com/lukasstorck))
- fix(doc): remove mention of unsupported `numpy.ndarray` waveform (with [@Purfview](https://github.com/Purfview))

### Improvements

Expand All @@ -26,12 +55,12 @@
- improve(io): switch to `torchaudio >= 2.2.0`
- improve(doc): update tutorials (with [@clement-pages](https://github.com/clement-pages/))

## Breaking changes
### Breaking changes

- BREAKING(model): get rid of `Model.example_output` in favor of `num_frames` method, `receptive_field` property, and `dimension` property
- BREAKING(task): custom tasks need to be updated (see "Add your own task" tutorial)

## Community contributions
### Community contributions

- community: add tutorial for offline use of `pyannote/speaker-diarization-3.1` (by [@simonottenhauskenbun](https://github.com/simonottenhauskenbun))

Expand Down
6 changes: 3 additions & 3 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
Using `pyannote.audio` open-source toolkit in production?
Make the most of it thanks to our [consulting services](https://herve.niderb.fr/consulting.html).
Using `pyannote.audio` open-source toolkit in production?
Consider switching to [pyannoteAI](https://www.pyannote.ai) for better and faster options.

# `pyannote.audio` speaker diarization toolkit

Expand Down Expand Up @@ -79,7 +79,7 @@ for turn, _, speaker in diarization.itertracks(yield_label=True):
Out of the box, `pyannote.audio` speaker diarization [pipeline](https://hf.co/pyannote/speaker-diarization-3.1) v3.1 is expected to be much better (and faster) than v2.x.
Those numbers are diarization error rates (in %):

| Benchmark | [v2.1](https://hf.co/pyannote/speaker-diarization-2.1) | [v3.1](https://hf.co/pyannote/speaker-diarization-3.1) | [Premium](https://forms.office.com/e/GdqwVgkZ5C) |
| Benchmark | [v2.1](https://hf.co/pyannote/speaker-diarization-2.1) | [v3.1](https://hf.co/pyannote/speaker-diarization-3.1) | [pyannoteAI](https://www.pyannote.ai) |
| --------------------------------------------------------------------------------------------------------------------------- | ------------------------------------------------------ | ------------------------------------------------------ | ------------------------------------------------ |
| [AISHELL-4](https://arxiv.org/abs/2104.03603) | 14.1 | 12.2 | 11.9 |
| [AliMeeting](https://www.openslr.org/119/) (channel 1) | 27.4 | 24.4 | 22.5 |
Expand Down
9 changes: 4 additions & 5 deletions pyannote/audio/core/inference.py
Original file line number Diff line number Diff line change
Expand Up @@ -559,9 +559,6 @@ def aggregate(
step=frames.step,
)

masks = 1 - np.isnan(scores)
scores.data = np.nan_to_num(scores.data, copy=True, nan=0.0)

# Hamming window used for overlap-add aggregation
hamming_window = (
np.hamming(num_frames_per_chunk).reshape(-1, 1)
Expand Down Expand Up @@ -613,11 +610,13 @@ def aggregate(
)

# loop on the scores of sliding chunks
for (chunk, score), (_, mask) in zip(scores, masks):
for chunk, score in scores:
# chunk ~ Segment
# score ~ (num_frames_per_chunk, num_classes)-shaped np.ndarray
# mask ~ (num_frames_per_chunk, num_classes)-shaped np.ndarray

mask = 1 - np.isnan(score)
np.nan_to_num(score, copy=False, nan=0.0)

start_frame = frames.closest_frame(chunk.start + 0.5 * frames.duration)

aggregated_output[start_frame : start_frame + num_frames_per_chunk] += (
Expand Down
54 changes: 44 additions & 10 deletions pyannote/audio/core/io.py
Original file line number Diff line number Diff line change
Expand Up @@ -48,21 +48,41 @@
- a "IOBase" instance with "read" and "seek" support: open("audio.wav", "rb")
- a "Mapping" with any of the above as "audio" key: {"audio": ...}
- a "Mapping" with both "waveform" and "sample_rate" key:
{"waveform": (channel, time) numpy.ndarray or torch.Tensor, "sample_rate": 44100}
{"waveform": (channel, time) torch.Tensor, "sample_rate": 44100}
For last two options, an additional "channel" key can be provided as a zero-indexed
integer to load a specific channel: {"audio": "stereo.wav", "channel": 0}
"""


def get_torchaudio_info(file: AudioFile):
def get_torchaudio_info(
file: AudioFile, backend: str = None
) -> torchaudio.AudioMetaData:
"""Protocol preprocessor used to cache output of torchaudio.info
This is useful to speed future random access to this file, e.g.
in dataloaders using Audio.crop a lot....
Parameters
----------
file : AudioFile
backend : str
torchaudio backend to use. Defaults to 'soundfile' if available,
or the first available backend.
Returns
-------
info : torchaudio.AudioMetaData
Audio file metadata
"""

info = torchaudio.info(file["audio"])
if not backend:
backends = (
torchaudio.list_audio_backends()
) # e.g ['ffmpeg', 'soundfile', 'sox']
backend = "soundfile" if "soundfile" in backends else backends[0]

info = torchaudio.info(file["audio"], backend=backend)

# rewind if needed
if isinstance(file["audio"], IOBase):
Expand All @@ -82,6 +102,9 @@ class Audio:
In case of multi-channel audio, convert to single-channel audio
using one of the following strategies: select one channel at
'random' or 'downmix' by averaging all channels.
backend : str
torchaudio backend to use. Defaults to 'soundfile' if available,
or the first available backend.
Usage
-----
Expand Down Expand Up @@ -126,7 +149,7 @@ def validate_file(file: AudioFile) -> Mapping:
-------
validated_file : Mapping
{"audio": str, "uri": str, ...}
{"waveform": array or tensor, "sample_rate": int, "uri": str, ...}
{"waveform": tensor, "sample_rate": int, "uri": str, ...}
{"audio": file, "uri": "stream"} if `file` is an IOBase instance
Raises
Expand All @@ -148,7 +171,7 @@ def validate_file(file: AudioFile) -> Mapping:
raise ValueError(AudioFileDocString)

if "waveform" in file:
waveform: Union[np.ndarray, Tensor] = file["waveform"]
waveform: Tensor = file["waveform"]
if len(waveform.shape) != 2 or waveform.shape[0] > waveform.shape[1]:
raise ValueError(
"'waveform' must be provided as a (channel, time) torch Tensor."
Expand Down Expand Up @@ -179,11 +202,19 @@ def validate_file(file: AudioFile) -> Mapping:

return file

def __init__(self, sample_rate=None, mono=None):
def __init__(self, sample_rate: int = None, mono=None, backend: str = None):
super().__init__()
self.sample_rate = sample_rate
self.mono = mono

if not backend:
backends = (
torchaudio.list_audio_backends()
) # e.g ['ffmpeg', 'soundfile', 'sox']
backend = "soundfile" if "soundfile" in backends else backends[0]

self.backend = backend

def downmix_and_resample(self, waveform: Tensor, sample_rate: int) -> Tensor:
"""Downmix and resample
Expand Down Expand Up @@ -244,7 +275,7 @@ def get_duration(self, file: AudioFile) -> float:
if "torchaudio.info" in file:
info = file["torchaudio.info"]
else:
info = get_torchaudio_info(file)
info = get_torchaudio_info(file, backend=self.backend)

frames = info.num_frames
sample_rate = info.sample_rate
Expand Down Expand Up @@ -291,7 +322,7 @@ def __call__(self, file: AudioFile) -> Tuple[Tensor, int]:
sample_rate = file["sample_rate"]

elif "audio" in file:
waveform, sample_rate = torchaudio.load(file["audio"])
waveform, sample_rate = torchaudio.load(file["audio"], backend=self.backend)

# rewind if needed
if isinstance(file["audio"], IOBase):
Expand Down Expand Up @@ -349,7 +380,7 @@ def crop(
sample_rate = info.sample_rate

else:
info = get_torchaudio_info(file)
info = get_torchaudio_info(file, backend=self.backend)
frames = info.num_frames
sample_rate = info.sample_rate

Expand Down Expand Up @@ -401,7 +432,10 @@ def crop(
else:
try:
data, _ = torchaudio.load(
file["audio"], frame_offset=start_frame, num_frames=num_frames
file["audio"],
frame_offset=start_frame,
num_frames=num_frames,
backend=self.backend,
)
# rewind if needed
if isinstance(file["audio"], IOBase):
Expand Down
14 changes: 9 additions & 5 deletions pyannote/audio/core/task.py
Original file line number Diff line number Diff line change
Expand Up @@ -362,12 +362,13 @@ def prepare_data(self):

if self.has_validation:
files_iter = itertools.chain(
self.protocol.train(), self.protocol.development()
zip(itertools.repeat("train"), self.protocol.train()),
zip(itertools.repeat("development"), self.protocol.development()),
)
else:
files_iter = self.protocol.train()
files_iter = zip(itertools.repeat("train"), self.protocol.train())

for file_id, file in enumerate(files_iter):
for file_id, (subset, file) in enumerate(files_iter):
# gather metadata and update metadata_unique_values so that each metadatum
# (e.g. source database or label) is represented by an integer.
metadatum = dict()
Expand All @@ -378,7 +379,8 @@ def prepare_data(self):
metadatum["database"] = metadata_unique_values["database"].index(
file["database"]
)
metadatum["subset"] = Subsets.index(file["subset"])

metadatum["subset"] = Subsets.index(subset)

# keep track of label scope (file, database, or global)
metadatum["scope"] = Scopes.index(file["scope"])
Expand Down Expand Up @@ -593,7 +595,9 @@ def prepare_data(self):
prepared_data["metadata-labels"] = np.array(unique_labels, dtype=np.str_)
unique_labels.clear()

self.prepare_validation(prepared_data)
if self.has_validation:
self.prepare_validation(prepared_data)

self.post_prepare_data(prepared_data)

# save prepared data on the disk
Expand Down
Loading

0 comments on commit adaf770

Please sign in to comment.