Add Myst children speech recipe by Usanter · Pull Request #2997 · speechbrain/speechbrain

Usanter · 2025-11-17T12:12:15Z

What does this PR do?

This pull request adds the support of the Myst children's speech corpus (link) to Speechbrain as a new and fully reproducible recipe. The main contributions include:

Data preparation pipeline for Myst children's speech, including automated data cleaning using zero-shot Whisper to identify and discard misaligned or low-quality transcripts.
Scripts for fine-tuning Whisper on the Myst children's speech dataset, either through standard fine-tuning (Encoder or Encoder+Decoder) or using parameter-efficient fine-tuning (LoRA).
Training utilities for a Transformer ASR model from scratch.

Before submitting

Did you read the contributor guideline?
Did you make sure your PR does only one thing, instead of bundling different changes together?
Did you make sure to update the documentation with your changes? (if necessary)
Did you write any new necessary tests? (not for typos and docs)
Did you verify new and existing tests pass locally with your changes?
Did you list all the breaking changes introduced by this pull request?
Does your code adhere to project-specific code style and conventions?

PR review

Reviewer checklist

[] Is this pull request ready for review? (if not, please submit in draft mode)
[] Check that all items from Before submitting are resolved
[] Make sure the title is self-explanatory and the description concisely explains the PR
[] Add labels and milestones (and optionally projects) to the PR so it can be classified
[] Confirm that the changes adhere to compatibility requirements (e.g., Python version, platform)
[] Review the self-review checklist to ensure the code is ready for review

Updated WER for Transformer model in README.

Adel-Moumen

Hi!

Thanks a lot for this new recipe! Much appreciated. I left few comments throughout the PR. I think the only major problem I can see is how you've done the data prep. I think this is relying on a lot of external tools that could be replaced by SB or HF. And could you please fix the CI? Thanks!

Once this is done, could you please add your recipe in the tests/recipes/ folder? Thanks.

Adel

recipes/Myst/ASR/README.md

Adel-Moumen · 2025-11-20T10:43:34Z

recipes/Myst/myst_prepare.py

+# Optional dependencies used by LibriSpeech prep
+try:
+    from speechbrain.dataio.dataio import (
+        load_pkl,
+        merge_csvs,
+        read_audio_info,
+        save_pkl,
+    )
+    from speechbrain.utils.data_utils import get_all_files
+    from speechbrain.utils.logger import get_logger
+    from speechbrain.utils.parallel import parallel_map
+except Exception:  # graceful fallback if SpeechBrain is not present
+    load_pkl = None
+    merge_csvs = None
+    read_audio_info = None
+    save_pkl = None
+    parallel_map = None
+    def get_all_files(folder, match_and=None, match_or=None, exclude_or=None):
+        out = []
+        for root, _dirs, files in os.walk(folder):
+            for fn in files:
+                path = os.path.join(root, fn)
+                if match_or and not any(path.endswith(m) for m in match_or):
+                    continue
+                if match_and and not all(m in path for m in match_and):
+                    continue
+                if exclude_or and any(ex in path for ex in exclude_or):
+                    continue
+                out.append(path)
+        return out
+    class _Logger:
+        def info(self, *a, **k): print(*a)
+        def warning(self, *a, **k): print(*a)
+    def get_logger(name): return _Logger()


I don't understand this code. You should have the assumption that SpeechBrain is already installed on the user system and that get_all_files etc are all accessible functions.

I have added this to the script to be compatible with other toolkits. However, I acknowledge that it may not be entirely logical as it is a speechbrain recipe. I will change that!

Adel-Moumen · 2025-11-20T11:13:41Z

recipes/Myst/myst_prepare.py

+try:
+    from whisper_normalizer.english import EnglishTextNormalizer  # type: ignore
+except Exception:
+    EnglishTextNormalizer = None


you can probably pass the normalizer directly in the input main function as the normalizer is accessible within the tokenizer object of whisper

Adel-Moumen · 2025-11-20T11:14:19Z

recipes/Myst/myst_prepare.py

+# ASR + WER (optional)
+try:
+    from faster_whisper import WhisperModel  # type: ignore
+    _HAS_FASTER = True
+except Exception:
+    _HAS_FASTER = False
+    WhisperModel = None  # type: ignore
+try:
+    import whisper  # type: ignore
+    _HAS_OPENAI = True
+except Exception:
+    _HAS_OPENAI = False
+try:
+    from jiwer import wer  # type: ignore
+    _HAS_JIWER = True
+except Exception:
+    _HAS_JIWER = False
+
+# librosa for duration (fallback if read_audio_info absent)
+try:
+    import librosa
+except Exception:
+    librosa = None
+


I am not a huge fan of using external libs while we do have an interface for whisper (and the wer as well!).

Adel-Moumen · 2025-11-20T11:14:57Z

recipes/Myst/myst_prepare.py

+    if librosa is None:
+        raise RuntimeError("Neither speechbrain.read_audio_info nor librosa is available to compute duration.")
+    try:
+        return float(librosa.get_duration(path=path))
+    except Exception:
+        y, sr = librosa.load(path, sr=16000, mono=True)
+        return float(len(y) / sr)


please do not use librosa and instead use our own audio loading functions pls

Adel-Moumen · 2025-11-20T11:25:19Z

recipes/Myst/ASR/README.md

+2025-11-13 | large-v3 | Decoder | train_hf_whisper.yaml | No | 8.36% | [Save](https://cloud.inesc-id.pt/s/eknR4y73RHKSB7F) |
+2025-11-13 | medium.en | Decoder | train_hf_whisper.yaml | No | 8.50% | [Save](https://cloud.inesc-id.pt/s/oJeyJCM7R2tGmPG) |
+2025-11-13 | medium.en | Encoder + Decoder | train_hf_whisper.yaml |No | 8.75% |[Save](https://cloud.inesc-id.pt/s/px3KWAditRo7wHH) |
+2025-11-13 | medium.en | LoRA (r=16) in Decoder | train_whisper_lora.yaml | No | 9.38% | [Save](https://cloud.inesc-id.pt/s/6YrRKPjNpKdMgoW)|


may I ask you how competitive are your results? Btw, thanks for uploading the models. I will transfer them on dropbox so that we can host them ourselves

SOTA for Myst is approximately 8-9% WER, which significantly varies based on data preparation and filtering methods. Consequently, direct comparison is challenging. This was my motivation for this PR, to provide a standardised data preparation method that facilitates comparison among works.

Adel-Moumen · 2025-11-20T11:25:46Z

recipes/Myst/ASR/hparams/transformer.yaml

+# directory containing the lm.ckpt and tokenizer.ckpt may also be specified
+# instead. E.g if you want to use your own LM / tokenizer.
+# Here we used Librispeech Lm as Myst is also English
+pretrained_lm_tokenizer_path: speechbrain/asr-transformer-transformerlm-librispeech


quick question: how different are Myst utterances vs Librispeech?

MyST is spontaneous speech produced by American English children. The key differences between MyST and Librispeech can be summarised as follows:

Speaker demographics and voice characteristics (acoustic variations)

Task type: MyST is spontaneous speech, while Librispeech is read speech

Utterance length: Overall, MyST utterances are shorter than Librispeech utterances.

I agree that training a children’s LM would be more beneficial than using Librispeech.

Adel-Moumen · 2025-11-20T11:26:07Z

recipes/Myst/myst_prepare.py

+"""
+Myst data preparation (SpeechBrain-style) — silence detection removed.
+
+This mirrors the API of `librispeech_prepare.py` while adding zero-shot
+WER filtering across all splits (train/valid/test).
+
+Outputs CSVs with columns:
+    ID,duration,wav,spk_id,wrd
+
+Expected layout per split directory:
+    <data_folder>/<split>/**/<audio>.(wav|flac|mp3|m4a|ogg)
+    with required sidecar transcripts: <audio>.trn
+
+Authors: Thomas Rolland 2025


I think the header could be improved in terms of readability

recipes/Myst/ASR/train.py

Adel-Moumen · 2025-11-20T11:29:19Z

recipes/Myst/train_with_whisper.py

+    )
+    test_data = test_data.filtered_sorted(sort_key="duration")
+
+    datasets = [train_data, valid_data, test_data] #+ [i for k, i in test_datasets.items()]


remove the comment pls

Usanter · 2025-12-16T00:35:43Z

Hi @Adel-Moumen, I’ve updated most of the files based on your review. I’m not entirely sure whether the two files (the filtered version and the unfiltered one) are handled in the way you intended. Please let me know if you notice anything incorrect or if further adjustments are needed.

Adel-Moumen · 2026-02-06T09:51:00Z

Hi @Usanter, I will get back to you during the weekend about this PR!

Adel-Moumen

Hey! Thanks a lot for this wonderful PR.

Do you think it would be possible to merge myst_prepare_no_filtering with the filtering one? I suppose that this could be just a "if" condition i.e. if the input arg to the prepare function is 'filtering' then do ASR and filter out samples. In SpeechBrain, we try to avoid redundancy if this is not costing much in terms of understanding/coherency. And I think in this case it could be possible to merge both together. This also means that you could merge the yamls together and just keep a single flag e.g. wer_filtering turn on or off.

Usanter and others added 3 commits November 17, 2025 11:48

Add Myst children speech recipe to Speechbrain

2d5624d

Fix formatting in Transformers table in README.md

832e41d

Fix WER value for Transformer model

94ada40

Updated WER for Transformer model in README.

Usanter changed the title ~~Myst children speech~~ Add Myst children speech recipe Nov 17, 2025

Update LM column for Transformer model entry

d9a922a

pplantinga added the recipes Changes to recipes only (add/edit) label Nov 19, 2025

Adel-Moumen self-requested a review November 20, 2025 10:31

Adel-Moumen requested changes Nov 20, 2025

View reviewed changes

Review Pull-request Adel

2f4b322

Usanter requested a review from Adel-Moumen January 14, 2026 17:58

Adel-Moumen reviewed Feb 7, 2026

View reviewed changes

Conversation

Usanter commented Nov 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

PR review

Uh oh!

Adel-Moumen left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Usanter commented Dec 16, 2025

Uh oh!

Adel-Moumen commented Feb 6, 2026

Uh oh!

Adel-Moumen left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Usanter commented Nov 17, 2025 •

edited

Loading

Adel-Moumen left a comment •

edited

Loading

Adel-Moumen left a comment •

edited

Loading