QuartzNet 15x5 CTC Bambara

Model architecture | Model size | Language

stt-bm-quartznet15x5 is a fine-tuned version of NVIDIA’s stt_fr_quartznet15x5 optimized for Bambara ASR. This model cannot write Punctuations and Capitalizations, it utilizes a character encoding scheme, and transcribes text in the standard character set that is provided in the training set of bam-asr-all dataset.

The model was fine-tuned using NVIDIA NeMo and is trained with CTC (Connectionist Temporal Classification) Loss.

🚨 Important Note

This model, along with its associated resources, is part of an ongoing research effort, improvements and refinements are expected in future versions. Users should be aware that:

  • The model may not generalize very well accross all speaking conditions and dialects.
  • Community feedback is welcome, and contributions are encouraged to refine the model further.

NVIDIA NeMo: Training

To fine-tune or use the model, install NVIDIA NeMo. We recommend installing it after setting up the latest PyTorch version.

pip install nemo_toolkit['asr']

How to Use This Model

Load Model with NeMo

import nemo.collections.asr as nemo_asr
asr_model = nemo_asr.models.EncDecCTCModel.from_pretrained(model_name="RobotsMali/stt-bm-quartznet15x5")

Transcribe Audio

# Assuming you have a test audio file named sample_audio.wav
asr_model.transcribe(['sample_audio.wav'])

Input

This model accepts 16 kHz mono-channel audio (wav files) as input.

Output

This model provides transcribed speech as a string for a given speech sample.

Model Architecture

QuartzNet is a convolutional architecture, which consists of 1D time-channel separable convolutions optimized for speech recognition. More information on QuartzNet can be found here: QuartzNet Model.

Training

The NeMo toolkit was used to fine-tune this model for 25939 steps over the stt_fr_quartznet15x5 model. This model is trained with this base config. The full training configurations, scripts, and experimental logs are available here:

πŸ”— Bambara-ASR Experiments

Dataset

This model was fine-tuned on the bam-asr-all dataset, which consists of 37 hours of transcribed Bambara speech data. The dataset is primarily derived from Jeli-ASR dataset (~87%).

Performance

The performance of Automatic Speech Recognition models is measured using Word Error Rate (WER%).

Version Tokenizer Vocabulary Size bam-asr-all (test set)
V2 Character-wise 45 46.5

These are greedy WER numbers without external LM.

License

This model is released under the CC-BY-4.0 license. By using this model, you agree to the terms of the license.


More details are available in the Experimental Technical Report: πŸ“„ Draft Technical Report - Weights & Biases.

Feel free to open a discussion on Hugging Face or file an issue on GitHub if you have any contributions.


Downloads last month
22
Inference Providers NEW
This model is not currently available via any of the supported third-party Inference Providers, and the model is not deployed on the HF Inference API.

Dataset used to train RobotsMali/stt-bm-quartznet15x5

Evaluation results