speecht5_tts-wolof-v0.2
This model is a fine-tuned version of speecht5_tts-wolof that enhances Text-to-Speech (TTS) synthesis for both Wolof and French. It is based on Microsoft's SpeechT5 and incorporates a custom tokenizer and additional fine-tuning to improve performance across these two languages.
Model Description
This model builds upon the SpeechT5
architecture, which unifies speech recognition and synthesis. The fine-tuning process introduced modifications to the original Wolof model, enabling it to generate natural speech in both Wolof and French. The model maintains the same general structure but learns a more robust alignment between textual inputs and speech synthesis, improving pronunciation and fluency in both languages.
Installation Instructions for Users
To install the necessary dependencies, run the following command:
pip install transformers datasets torch
Model Loading and Speech Generation Code
import torch
from transformers import SpeechT5ForTextToSpeech, SpeechT5Processor, SpeechT5HifiGan
from datasets import load_dataset
from IPython.display import Audio, display
def load_speech_model(checkpoint="bilalfaye/speecht5_tts-wolof-v0.2", vocoder_checkpoint="microsoft/speecht5_hifigan"):
""" Load the SpeechT5 model, processor, and vocoder for text-to-speech. """
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
processor = SpeechT5Processor.from_pretrained(checkpoint)
model = SpeechT5ForTextToSpeech.from_pretrained(checkpoint).to(device)
vocoder = SpeechT5HifiGan.from_pretrained(vocoder_checkpoint).to(device)
return processor, model, vocoder, device
# Load the model
processor, model, vocoder, device = load_speech_model()
# Load speaker embeddings (pretrained from CMU Arctic dataset)
embeddings_dataset = load_dataset("Matthijs/cmu-arctic-xvectors", split="validation")
speaker_embedding = torch.tensor(embeddings_dataset[7306]["xvector"]).unsqueeze(0)
def generate_speech_from_text(text, speaker_embedding=speaker_embedding, processor=processor, model=model, vocoder=vocoder):
""" Generates speech from input text using SpeechT5 and HiFi-GAN vocoder. """
inputs = processor(text=text, return_tensors="pt", padding=True, truncation=True, max_length=model.config.max_text_positions)
inputs = {key: value.to(model.device) for key, value in inputs.items()}
speech = model.generate(
inputs["input_ids"],
speaker_embeddings=speaker_embedding.to(model.device),
vocoder=vocoder,
num_beams=7,
temperature=0.6,
no_repeat_ngram_size=3,
repetition_penalty=1.5,
)
speech = speech.detach().cpu().numpy()
display(Audio(speech, rate=16000))
# Example usage French
text = "Bonjour, bienvenue dans le modèle de synthèse vocale Wolof et Français."
generate_speech_from_text(text)
# Example usage Wolof
text = "ñu ne ñoom ñooy nattukaay satélite yi"
generate_speech_from_text(text)
Intended Uses & Limitations
Intended Uses
- Multilingual TTS: Converts Wolof and French text into natural-sounding speech.
- Voice Assistants & Speech Interfaces: Can be used for audio-based applications supporting both languages.
- Linguistic Research: Facilitates speech synthesis research in low-resource languages.
Limitations
- Data Dependency: The quality of synthesized speech depends on the dataset used for fine-tuning.
- Pronunciation Variations: Some complex or uncommon words may be mispronounced.
- Limited Speaker Variety: The model was trained on a single speaker embedding and may not generalize well to different voice profiles.
Training and Evaluation Data
The model was fine-tuned on an extended dataset containing text in both Wolof and French, ensuring improved synthesis capabilities across these two languages.
Training Procedure
Training Hyperparameters
Hyperparameter | Value |
---|---|
Learning Rate | 1e-05 |
Training Batch Size | 8 |
Evaluation Batch Size | 2 |
Gradient Accumulation Steps | 8 |
Total Train Batch Size | 64 |
Optimizer | Adam (β1=0.9, β2=0.999, ϵ=1e-08) |
Learning Rate Scheduler | Linear |
Warmup Steps | 500 |
Training Steps | 25,500 |
Mixed Precision Training | AMP (Automatic Mixed Precision) |
Training Results
Training Loss | Epoch | Step | Validation Loss |
---|---|---|---|
0.5372 | 0.9995 | 954 | 0.4398 |
0.4646 | 2.0 | 1909 | 0.4214 |
0.4505 | 2.9995 | 2863 | 0.4163 |
0.4443 | 4.0 | 3818 | 0.4109 |
0.4403 | 4.9995 | 4772 | 0.4080 |
0.4368 | 6.0 | 5727 | 0.4057 |
0.4343 | 6.9995 | 6681 | 0.4034 |
0.4315 | 8.0 | 7636 | 0.4018 |
0.4311 | 8.9995 | 8590 | 0.4015 |
0.4273 | 10.0 | 9545 | 0.4017 |
0.4282 | 10.9995 | 10499 | 0.3990 |
0.4249 | 12.0 | 11454 | 0.3986 |
0.4242 | 12.9995 | 12408 | 0.3973 |
0.4225 | 14.0 | 13363 | 0.3966 |
0.4217 | 14.9995 | 14317 | 0.3951 |
0.4208 | 16.0 | 15272 | 0.3950 |
0.4200 | 16.9995 | 16226 | 0.3950 |
0.4202 | 18.0 | 17181 | 0.3952 |
0.4200 | 18.9995 | 18135 | 0.3943 |
0.4183 | 20.0 | 19090 | 0.3962 |
0.4175 | 20.9995 | 20044 | 0.3937 |
0.4161 | 22.0 | 20999 | 0.3940 |
0.4193 | 22.9995 | 21953 | 0.3932 |
0.4177 | 24.0 | 22908 | 0.3939 |
0.4166 | 24.9995 | 23862 | 0.3936 |
0.4156 | 26.0 | 24817 | 0.3938 |
Framework Versions
- Transformers: 4.41.2
- PyTorch: 2.4.0+cu121
- Datasets: 3.2.0
- Tokenizers: 0.19.1
Author
- Bilal FAYE
This model contributes to enhancing TTS accessibility for Wolof and French, making it a valuable resource for multilingual voice applications. 🚀
- Downloads last month
- 26