QuartzNet 15x5 CTC Bambara
stt-bm-quartznet15x5
is a fine-tuned version of NVIDIAβs stt_fr_quartznet15x5
optimized for Bambara ASR. This model cannot write Punctuations and Capitalizations, it utilizes a character encoding scheme, and transcribes text in the standard character set that is provided in the training set of bam-asr-all dataset.
The model was fine-tuned using NVIDIA NeMo and is trained with CTC (Connectionist Temporal Classification) Loss.
π¨ Important Note
This model, along with its associated resources, is part of an ongoing research effort, improvements and refinements are expected in future versions. Users should be aware that:
- The model may not generalize very well accross all speaking conditions and dialects.
- Community feedback is welcome, and contributions are encouraged to refine the model further.
NVIDIA NeMo: Training
To fine-tune or use the model, install NVIDIA NeMo. We recommend installing it after setting up the latest PyTorch version.
pip install nemo_toolkit['asr']
How to Use This Model
Load Model with NeMo
import nemo.collections.asr as nemo_asr
asr_model = nemo_asr.models.EncDecCTCModel.from_pretrained(model_name="RobotsMali/stt-bm-quartznet15x5")
Transcribe Audio
# Assuming you have a test audio file named sample_audio.wav
asr_model.transcribe(['sample_audio.wav'])
Input
This model accepts 16 kHz mono-channel audio (wav files) as input.
Output
This model provides transcribed speech as a string for a given speech sample.
Model Architecture
QuartzNet is a convolutional architecture, which consists of 1D time-channel separable convolutions optimized for speech recognition. More information on QuartzNet can be found here: QuartzNet Model.
Training
The NeMo toolkit was used to fine-tune this model for 25939 steps over the stt_fr_quartznet15x5
model. This model is trained with this base config. The full training configurations, scripts, and experimental logs are available here:
Dataset
This model was fine-tuned on the bam-asr-all dataset, which consists of 37 hours of transcribed Bambara speech data. The dataset is primarily derived from Jeli-ASR dataset (~87%).
Performance
The performance of Automatic Speech Recognition models is measured using Word Error Rate (WER%).
Version | Tokenizer | Vocabulary Size | bam-asr-all (test set) |
---|---|---|---|
V2 | Character-wise | 45 | 46.5 |
These are greedy WER numbers without external LM.
License
This model is released under the CC-BY-4.0 license. By using this model, you agree to the terms of the license.
More details are available in the Experimental Technical Report: π Draft Technical Report - Weights & Biases.
Feel free to open a discussion on Hugging Face or file an issue on GitHub if you have any contributions.
- Downloads last month
- 22