Add Myst children speech recipe#2997
Conversation
Updated WER for Transformer model in README.
There was a problem hiding this comment.
Hi!
Thanks a lot for this new recipe! Much appreciated. I left few comments throughout the PR. I think the only major problem I can see is how you've done the data prep. I think this is relying on a lot of external tools that could be replaced by SB or HF. And could you please fix the CI? Thanks!
Once this is done, could you please add your recipe in the tests/recipes/ folder? Thanks.
Adel
recipes/Myst/myst_prepare.py
Outdated
| # Optional dependencies used by LibriSpeech prep | ||
| try: | ||
| from speechbrain.dataio.dataio import ( | ||
| load_pkl, | ||
| merge_csvs, | ||
| read_audio_info, | ||
| save_pkl, | ||
| ) | ||
| from speechbrain.utils.data_utils import get_all_files | ||
| from speechbrain.utils.logger import get_logger | ||
| from speechbrain.utils.parallel import parallel_map | ||
| except Exception: # graceful fallback if SpeechBrain is not present | ||
| load_pkl = None | ||
| merge_csvs = None | ||
| read_audio_info = None | ||
| save_pkl = None | ||
| parallel_map = None | ||
| def get_all_files(folder, match_and=None, match_or=None, exclude_or=None): | ||
| out = [] | ||
| for root, _dirs, files in os.walk(folder): | ||
| for fn in files: | ||
| path = os.path.join(root, fn) | ||
| if match_or and not any(path.endswith(m) for m in match_or): | ||
| continue | ||
| if match_and and not all(m in path for m in match_and): | ||
| continue | ||
| if exclude_or and any(ex in path for ex in exclude_or): | ||
| continue | ||
| out.append(path) | ||
| return out | ||
| class _Logger: | ||
| def info(self, *a, **k): print(*a) | ||
| def warning(self, *a, **k): print(*a) | ||
| def get_logger(name): return _Logger() |
There was a problem hiding this comment.
I don't understand this code. You should have the assumption that SpeechBrain is already installed on the user system and that get_all_files etc are all accessible functions.
There was a problem hiding this comment.
I have added this to the script to be compatible with other toolkits. However, I acknowledge that it may not be entirely logical as it is a speechbrain recipe. I will change that!
recipes/Myst/myst_prepare.py
Outdated
| try: | ||
| from whisper_normalizer.english import EnglishTextNormalizer # type: ignore | ||
| except Exception: | ||
| EnglishTextNormalizer = None |
There was a problem hiding this comment.
you can probably pass the normalizer directly in the input main function as the normalizer is accessible within the tokenizer object of whisper
recipes/Myst/myst_prepare.py
Outdated
| # ASR + WER (optional) | ||
| try: | ||
| from faster_whisper import WhisperModel # type: ignore | ||
| _HAS_FASTER = True | ||
| except Exception: | ||
| _HAS_FASTER = False | ||
| WhisperModel = None # type: ignore | ||
| try: | ||
| import whisper # type: ignore | ||
| _HAS_OPENAI = True | ||
| except Exception: | ||
| _HAS_OPENAI = False | ||
| try: | ||
| from jiwer import wer # type: ignore | ||
| _HAS_JIWER = True | ||
| except Exception: | ||
| _HAS_JIWER = False | ||
|
|
||
| # librosa for duration (fallback if read_audio_info absent) | ||
| try: | ||
| import librosa | ||
| except Exception: | ||
| librosa = None | ||
|
|
There was a problem hiding this comment.
I am not a huge fan of using external libs while we do have an interface for whisper (and the wer as well!).
recipes/Myst/myst_prepare.py
Outdated
| if librosa is None: | ||
| raise RuntimeError("Neither speechbrain.read_audio_info nor librosa is available to compute duration.") | ||
| try: | ||
| return float(librosa.get_duration(path=path)) | ||
| except Exception: | ||
| y, sr = librosa.load(path, sr=16000, mono=True) | ||
| return float(len(y) / sr) |
There was a problem hiding this comment.
please do not use librosa and instead use our own audio loading functions pls
| 2025-11-13 | large-v3 | Decoder | train_hf_whisper.yaml | No | 8.36% | [Save](https://cloud.inesc-id.pt/s/eknR4y73RHKSB7F) | | ||
| 2025-11-13 | medium.en | Decoder | train_hf_whisper.yaml | No | 8.50% | [Save](https://cloud.inesc-id.pt/s/oJeyJCM7R2tGmPG) | | ||
| 2025-11-13 | medium.en | Encoder + Decoder | train_hf_whisper.yaml |No | 8.75% |[Save](https://cloud.inesc-id.pt/s/px3KWAditRo7wHH) | | ||
| 2025-11-13 | medium.en | LoRA (r=16) in Decoder | train_whisper_lora.yaml | No | 9.38% | [Save](https://cloud.inesc-id.pt/s/6YrRKPjNpKdMgoW)| |
There was a problem hiding this comment.
may I ask you how competitive are your results? Btw, thanks for uploading the models. I will transfer them on dropbox so that we can host them ourselves
There was a problem hiding this comment.
SOTA for Myst is approximately 8-9% WER, which significantly varies based on data preparation and filtering methods. Consequently, direct comparison is challenging. This was my motivation for this PR, to provide a standardised data preparation method that facilitates comparison among works.
| # directory containing the lm.ckpt and tokenizer.ckpt may also be specified | ||
| # instead. E.g if you want to use your own LM / tokenizer. | ||
| # Here we used Librispeech Lm as Myst is also English | ||
| pretrained_lm_tokenizer_path: speechbrain/asr-transformer-transformerlm-librispeech |
There was a problem hiding this comment.
quick question: how different are Myst utterances vs Librispeech?
There was a problem hiding this comment.
MyST is spontaneous speech produced by American English children. The key differences between MyST and Librispeech can be summarised as follows:
- Speaker demographics and voice characteristics (acoustic variations)
- Task type: MyST is spontaneous speech, while Librispeech is read speech
- Utterance length: Overall, MyST utterances are shorter than Librispeech utterances.
I agree that training a children’s LM would be more beneficial than using Librispeech.
| """ | ||
| Myst data preparation (SpeechBrain-style) — silence detection removed. | ||
|
|
||
| This mirrors the API of `librispeech_prepare.py` while adding zero-shot | ||
| WER filtering across all splits (train/valid/test). | ||
|
|
||
| Outputs CSVs with columns: | ||
| ID,duration,wav,spk_id,wrd | ||
|
|
||
| Expected layout per split directory: | ||
| <data_folder>/<split>/**/<audio>.(wav|flac|mp3|m4a|ogg) | ||
| with required sidecar transcripts: <audio>.trn | ||
|
|
||
| Authors: Thomas Rolland 2025 |
There was a problem hiding this comment.
I think the header could be improved in terms of readability
recipes/Myst/train_with_whisper.py
Outdated
| ) | ||
| test_data = test_data.filtered_sorted(sort_key="duration") | ||
|
|
||
| datasets = [train_data, valid_data, test_data] #+ [i for k, i in test_datasets.items()] |
There was a problem hiding this comment.
remove the comment pls
|
Hi @Adel-Moumen, I’ve updated most of the files based on your review. I’m not entirely sure whether the two files (the filtered version and the unfiltered one) are handled in the way you intended. Please let me know if you notice anything incorrect or if further adjustments are needed. |
|
Hi @Usanter, I will get back to you during the weekend about this PR! |
There was a problem hiding this comment.
Hey! Thanks a lot for this wonderful PR.
Do you think it would be possible to merge myst_prepare_no_filtering with the filtering one? I suppose that this could be just a "if" condition i.e. if the input arg to the prepare function is 'filtering' then do ASR and filter out samples. In SpeechBrain, we try to avoid redundancy if this is not costing much in terms of understanding/coherency. And I think in this case it could be possible to merge both together. This also means that you could merge the yamls together and just keep a single flag e.g. wer_filtering turn on or off.
What does this PR do?
This pull request adds the support of the Myst children's speech corpus (link) to Speechbrain as a new and fully reproducible recipe. The main contributions include:
Before submitting
PR review
Reviewer checklist