speecht5_tts-wolof-v0.2

This model is a fine-tuned version of speecht5_tts-wolof that enhances Text-to-Speech (TTS) synthesis for both Wolof and French. It is based on Microsoft's SpeechT5 and incorporates a custom tokenizer and additional fine-tuning to improve performance across these two languages.

Model Description

This model builds upon the SpeechT5 architecture, which unifies speech recognition and synthesis. The fine-tuning process introduced modifications to the original Wolof model, enabling it to generate natural speech in both Wolof and French. The model maintains the same general structure but learns a more robust alignment between textual inputs and speech synthesis, improving pronunciation and fluency in both languages.

Installation Instructions for Users

To install the necessary dependencies, run the following command:

pip install transformers datasets torch

Model Loading and Speech Generation Code

import torch
from transformers import SpeechT5ForTextToSpeech, SpeechT5Processor, SpeechT5HifiGan
from datasets import load_dataset
from IPython.display import Audio, display

def load_speech_model(checkpoint="bilalfaye/speecht5_tts-wolof-v0.2", vocoder_checkpoint="microsoft/speecht5_hifigan"):
    """ Load the SpeechT5 model, processor, and vocoder for text-to-speech. """
    
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    
    processor = SpeechT5Processor.from_pretrained(checkpoint)
    model = SpeechT5ForTextToSpeech.from_pretrained(checkpoint).to(device)
    vocoder = SpeechT5HifiGan.from_pretrained(vocoder_checkpoint).to(device)

    return processor, model, vocoder, device

# Load the model
processor, model, vocoder, device = load_speech_model()

# Load speaker embeddings (pretrained from CMU Arctic dataset)
embeddings_dataset = load_dataset("Matthijs/cmu-arctic-xvectors", split="validation")
speaker_embedding = torch.tensor(embeddings_dataset[7306]["xvector"]).unsqueeze(0)

def generate_speech_from_text(text, speaker_embedding=speaker_embedding, processor=processor, model=model, vocoder=vocoder):  
    """ Generates speech from input text using SpeechT5 and HiFi-GAN vocoder. """  

    inputs = processor(text=text, return_tensors="pt", padding=True, truncation=True, max_length=model.config.max_text_positions)
    inputs = {key: value.to(model.device) for key, value in inputs.items()}

    speech = model.generate(
        inputs["input_ids"],
        speaker_embeddings=speaker_embedding.to(model.device),
        vocoder=vocoder,
        num_beams=7,
        temperature=0.6,
        no_repeat_ngram_size=3,
        repetition_penalty=1.5,
    )

    speech = speech.detach().cpu().numpy()
    display(Audio(speech, rate=16000))

# Example usage French
text = "Bonjour, bienvenue dans le modèle de synthèse vocale Wolof et Français."
generate_speech_from_text(text)

# Example usage Wolof
text = "ñu ne ñoom ñooy nattukaay satélite yi"
generate_speech_from_text(text)

Intended Uses & Limitations

Intended Uses

Multilingual TTS: Converts Wolof and French text into natural-sounding speech.
Voice Assistants & Speech Interfaces: Can be used for audio-based applications supporting both languages.
Linguistic Research: Facilitates speech synthesis research in low-resource languages.

Limitations

Data Dependency: The quality of synthesized speech depends on the dataset used for fine-tuning.
Pronunciation Variations: Some complex or uncommon words may be mispronounced.
Limited Speaker Variety: The model was trained on a single speaker embedding and may not generalize well to different voice profiles.

Training and Evaluation Data

The model was fine-tuned on an extended dataset containing text in both Wolof and French, ensuring improved synthesis capabilities across these two languages.

Training Procedure

Training Hyperparameters

Hyperparameter	Value
Learning Rate	1e-05
Training Batch Size	8
Evaluation Batch Size	2
Gradient Accumulation Steps	8
Total Train Batch Size	64
Optimizer	Adam (β1=0.9, β2=0.999, ϵ=1e-08)
Learning Rate Scheduler	Linear
Warmup Steps	500
Training Steps	25,500
Mixed Precision Training	AMP (Automatic Mixed Precision)

Training Results

Training Loss	Epoch	Step	Validation Loss
0.5372	0.9995	954	0.4398
0.4646	2.0	1909	0.4214
0.4505	2.9995	2863	0.4163
0.4443	4.0	3818	0.4109
0.4403	4.9995	4772	0.4080
0.4368	6.0	5727	0.4057
0.4343	6.9995	6681	0.4034
0.4315	8.0	7636	0.4018
0.4311	8.9995	8590	0.4015
0.4273	10.0	9545	0.4017
0.4282	10.9995	10499	0.3990
0.4249	12.0	11454	0.3986
0.4242	12.9995	12408	0.3973
0.4225	14.0	13363	0.3966
0.4217	14.9995	14317	0.3951
0.4208	16.0	15272	0.3950
0.4200	16.9995	16226	0.3950
0.4202	18.0	17181	0.3952
0.4200	18.9995	18135	0.3943
0.4183	20.0	19090	0.3962
0.4175	20.9995	20044	0.3937
0.4161	22.0	20999	0.3940
0.4193	22.9995	21953	0.3932
0.4177	24.0	22908	0.3939
0.4166	24.9995	23862	0.3936
0.4156	26.0	24817	0.3938

Framework Versions

Transformers: 4.41.2
PyTorch: 2.4.0+cu121
Datasets: 3.2.0
Tokenizers: 0.19.1

Author

Bilal FAYE

This model contributes to enhancing TTS accessibility for Wolof and French, making it a valuable resource for multilingual voice applications. 🚀

bilalfaye
/

speecht5_tts-wolof-v0.2