PHANTOM
🇮🇳 IN
Skip to content

Add Myst children speech recipe#2997

Open
Usanter wants to merge 5 commits intospeechbrain:developfrom
Usanter:Myst_children_speech
Open

Add Myst children speech recipe#2997
Usanter wants to merge 5 commits intospeechbrain:developfrom
Usanter:Myst_children_speech

Conversation

@Usanter
Copy link

@Usanter Usanter commented Nov 17, 2025

What does this PR do?

This pull request adds the support of the Myst children's speech corpus (link) to Speechbrain as a new and fully reproducible recipe. The main contributions include:

  • Data preparation pipeline for Myst children's speech, including automated data cleaning using zero-shot Whisper to identify and discard misaligned or low-quality transcripts.
  • Scripts for fine-tuning Whisper on the Myst children's speech dataset, either through standard fine-tuning (Encoder or Encoder+Decoder) or using parameter-efficient fine-tuning (LoRA).
  • Training utilities for a Transformer ASR model from scratch.
Before submitting
  • Did you read the contributor guideline?
  • Did you make sure your PR does only one thing, instead of bundling different changes together?
  • Did you make sure to update the documentation with your changes? (if necessary)
  • Did you write any new necessary tests? (not for typos and docs)
  • Did you verify new and existing tests pass locally with your changes?
  • Did you list all the breaking changes introduced by this pull request?
  • Does your code adhere to project-specific code style and conventions?

PR review

Reviewer checklist
  • [] Is this pull request ready for review? (if not, please submit in draft mode)
  • [] Check that all items from Before submitting are resolved
  • [] Make sure the title is self-explanatory and the description concisely explains the PR
  • [] Add labels and milestones (and optionally projects) to the PR so it can be classified
  • [] Confirm that the changes adhere to compatibility requirements (e.g., Python version, platform)
  • [] Review the self-review checklist to ensure the code is ready for review

@Usanter Usanter changed the title Myst children speech Add Myst children speech recipe Nov 17, 2025
@pplantinga pplantinga added the recipes Changes to recipes only (add/edit) label Nov 19, 2025
@Adel-Moumen Adel-Moumen self-requested a review November 20, 2025 10:31
Copy link
Collaborator

@Adel-Moumen Adel-Moumen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi!

Thanks a lot for this new recipe! Much appreciated. I left few comments throughout the PR. I think the only major problem I can see is how you've done the data prep. I think this is relying on a lot of external tools that could be replaced by SB or HF. And could you please fix the CI? Thanks!

Once this is done, could you please add your recipe in the tests/recipes/ folder? Thanks.

Adel

Comment on lines 29 to 62
# Optional dependencies used by LibriSpeech prep
try:
from speechbrain.dataio.dataio import (
load_pkl,
merge_csvs,
read_audio_info,
save_pkl,
)
from speechbrain.utils.data_utils import get_all_files
from speechbrain.utils.logger import get_logger
from speechbrain.utils.parallel import parallel_map
except Exception: # graceful fallback if SpeechBrain is not present
load_pkl = None
merge_csvs = None
read_audio_info = None
save_pkl = None
parallel_map = None
def get_all_files(folder, match_and=None, match_or=None, exclude_or=None):
out = []
for root, _dirs, files in os.walk(folder):
for fn in files:
path = os.path.join(root, fn)
if match_or and not any(path.endswith(m) for m in match_or):
continue
if match_and and not all(m in path for m in match_and):
continue
if exclude_or and any(ex in path for ex in exclude_or):
continue
out.append(path)
return out
class _Logger:
def info(self, *a, **k): print(*a)
def warning(self, *a, **k): print(*a)
def get_logger(name): return _Logger()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand this code. You should have the assumption that SpeechBrain is already installed on the user system and that get_all_files etc are all accessible functions.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have added this to the script to be compatible with other toolkits. However, I acknowledge that it may not be entirely logical as it is a speechbrain recipe. I will change that!

Comment on lines 69 to 72
try:
from whisper_normalizer.english import EnglishTextNormalizer # type: ignore
except Exception:
EnglishTextNormalizer = None
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can probably pass the normalizer directly in the input main function as the normalizer is accessible within the tokenizer object of whisper

Comment on lines 74 to 97
# ASR + WER (optional)
try:
from faster_whisper import WhisperModel # type: ignore
_HAS_FASTER = True
except Exception:
_HAS_FASTER = False
WhisperModel = None # type: ignore
try:
import whisper # type: ignore
_HAS_OPENAI = True
except Exception:
_HAS_OPENAI = False
try:
from jiwer import wer # type: ignore
_HAS_JIWER = True
except Exception:
_HAS_JIWER = False

# librosa for duration (fallback if read_audio_info absent)
try:
import librosa
except Exception:
librosa = None

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not a huge fan of using external libs while we do have an interface for whisper (and the wer as well!).

Comment on lines 179 to 185
if librosa is None:
raise RuntimeError("Neither speechbrain.read_audio_info nor librosa is available to compute duration.")
try:
return float(librosa.get_duration(path=path))
except Exception:
y, sr = librosa.load(path, sr=16000, mono=True)
return float(len(y) / sr)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please do not use librosa and instead use our own audio loading functions pls

Comment on lines +49 to +52
2025-11-13 | large-v3 | Decoder | train_hf_whisper.yaml | No | 8.36% | [Save](https://cloud.inesc-id.pt/s/eknR4y73RHKSB7F) |
2025-11-13 | medium.en | Decoder | train_hf_whisper.yaml | No | 8.50% | [Save](https://cloud.inesc-id.pt/s/oJeyJCM7R2tGmPG) |
2025-11-13 | medium.en | Encoder + Decoder | train_hf_whisper.yaml |No | 8.75% |[Save](https://cloud.inesc-id.pt/s/px3KWAditRo7wHH) |
2025-11-13 | medium.en | LoRA (r=16) in Decoder | train_whisper_lora.yaml | No | 9.38% | [Save](https://cloud.inesc-id.pt/s/6YrRKPjNpKdMgoW)|
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

may I ask you how competitive are your results? Btw, thanks for uploading the models. I will transfer them on dropbox so that we can host them ourselves

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SOTA for Myst is approximately 8-9% WER, which significantly varies based on data preparation and filtering methods. Consequently, direct comparison is challenging. This was my motivation for this PR, to provide a standardised data preparation method that facilitates comparison among works.

# directory containing the lm.ckpt and tokenizer.ckpt may also be specified
# instead. E.g if you want to use your own LM / tokenizer.
# Here we used Librispeech Lm as Myst is also English
pretrained_lm_tokenizer_path: speechbrain/asr-transformer-transformerlm-librispeech
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

quick question: how different are Myst utterances vs Librispeech?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MyST is spontaneous speech produced by American English children. The key differences between MyST and Librispeech can be summarised as follows:

  • Speaker demographics and voice characteristics (acoustic variations)
  • Task type: MyST is spontaneous speech, while Librispeech is read speech
  • Utterance length: Overall, MyST utterances are shorter than Librispeech utterances.

I agree that training a children’s LM would be more beneficial than using Librispeech.

Comment on lines 2 to 15
"""
Myst data preparation (SpeechBrain-style) — silence detection removed.

This mirrors the API of `librispeech_prepare.py` while adding zero-shot
WER filtering across all splits (train/valid/test).

Outputs CSVs with columns:
ID,duration,wav,spk_id,wrd

Expected layout per split directory:
<data_folder>/<split>/**/<audio>.(wav|flac|mp3|m4a|ogg)
with required sidecar transcripts: <audio>.trn

Authors: Thomas Rolland 2025
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the header could be improved in terms of readability

)
test_data = test_data.filtered_sorted(sort_key="duration")

datasets = [train_data, valid_data, test_data] #+ [i for k, i in test_datasets.items()]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove the comment pls

@Usanter
Copy link
Author

Usanter commented Dec 16, 2025

Hi @Adel-Moumen, I’ve updated most of the files based on your review. I’m not entirely sure whether the two files (the filtered version and the unfiltered one) are handled in the way you intended. Please let me know if you notice anything incorrect or if further adjustments are needed.

@Usanter Usanter requested a review from Adel-Moumen January 14, 2026 17:58
@Adel-Moumen
Copy link
Collaborator

Hi @Usanter, I will get back to you during the weekend about this PR!

Copy link
Collaborator

@Adel-Moumen Adel-Moumen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey! Thanks a lot for this wonderful PR.

Do you think it would be possible to merge myst_prepare_no_filtering with the filtering one? I suppose that this could be just a "if" condition i.e. if the input arg to the prepare function is 'filtering' then do ASR and filter out samples. In SpeechBrain, we try to avoid redundancy if this is not costing much in terms of understanding/coherency. And I think in this case it could be possible to merge both together. This also means that you could merge the yamls together and just keep a single flag e.g. wer_filtering turn on or off.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

recipes Changes to recipes only (add/edit)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants