QuartzNet 15x5 CTC Bambara

| |

stt-bm-quartznet15x5 is a fine-tuned version of NVIDIA’s stt_fr_quartznet15x5 optimized for Bambara ASR. This model cannot write Punctuations and Capitalizations, it utilizes a character encoding scheme, and transcribes text in the standard character set that is provided in the training set of bam-asr-all dataset.

The model was fine-tuned using NVIDIA NeMo and is trained with CTC (Connectionist Temporal Classification) Loss.

🚨 Important Note

This model, along with its associated resources, is part of an ongoing research effort, improvements and refinements are expected in future versions. Users should be aware that:

The model may not generalize very well accross all speaking conditions and dialects.
Community feedback is welcome, and contributions are encouraged to refine the model further.

NVIDIA NeMo: Training

To fine-tune or use the model, install NVIDIA NeMo. We recommend installing it after setting up the latest PyTorch version.

pip install nemo_toolkit['asr']

How to Use This Model

Load Model with NeMo

import nemo.collections.asr as nemo_asr
asr_model = nemo_asr.models.EncDecCTCModel.from_pretrained(model_name="RobotsMali/stt-bm-quartznet15x5")

Transcribe Audio

# Assuming you have a test audio file named sample_audio.wav
asr_model.transcribe(['sample_audio.wav'])

Input

This model accepts 16 kHz mono-channel audio (wav files) as input.

Output

This model provides transcribed speech as a string for a given speech sample.

Model Architecture

QuartzNet is a convolutional architecture, which consists of 1D time-channel separable convolutions optimized for speech recognition. More information on QuartzNet can be found here: QuartzNet Model.

Training

The NeMo toolkit was used to fine-tune this model for 25939 steps over the stt_fr_quartznet15x5 model. This model is trained with this base config. The full training configurations, scripts, and experimental logs are available here:

🔗 Bambara-ASR Experiments

Dataset

This model was fine-tuned on the bam-asr-all dataset, which consists of 37 hours of transcribed Bambara speech data. The dataset is primarily derived from Jeli-ASR dataset (~87%).

Performance

The performance of Automatic Speech Recognition models is measured using Word Error Rate (WER%).

Version	Tokenizer	Vocabulary Size	bam-asr-all (test set)
V2	Character-wise	45	46.5

These are greedy WER numbers without external LM.

License

This model is released under the CC-BY-4.0 license. By using this model, you agree to the terms of the license.

More details are available in the Experimental Technical Report: 📄 Draft Technical Report - Weights & Biases.

Feel free to open a discussion on Hugging Face or file an issue on GitHub if you have any contributions.

RobotsMali
/

stt-bm-quartznet15x5