speecht5_tts-wolof-v0.2

This model is a fine-tuned version of speecht5_tts-wolof that enhances Text-to-Speech (TTS) synthesis for both Wolof and French. It is based on Microsoft's SpeechT5 and incorporates a custom tokenizer and additional fine-tuning to improve performance across these two languages.

Model Description

This model builds upon the SpeechT5 architecture, which unifies speech recognition and synthesis. The fine-tuning process introduced modifications to the original Wolof model, enabling it to generate natural speech in both Wolof and French. The model maintains the same general structure but learns a more robust alignment between textual inputs and speech synthesis, improving pronunciation and fluency in both languages.


Installation Instructions for Users

To install the necessary dependencies, run the following command:

pip install transformers datasets torch

Model Loading and Speech Generation Code

import torch
from transformers import SpeechT5ForTextToSpeech, SpeechT5Processor, SpeechT5HifiGan
from datasets import load_dataset
from IPython.display import Audio, display

def load_speech_model(checkpoint="bilalfaye/speecht5_tts-wolof-v0.2", vocoder_checkpoint="microsoft/speecht5_hifigan"):
    """ Load the SpeechT5 model, processor, and vocoder for text-to-speech. """
    
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    
    processor = SpeechT5Processor.from_pretrained(checkpoint)
    model = SpeechT5ForTextToSpeech.from_pretrained(checkpoint).to(device)
    vocoder = SpeechT5HifiGan.from_pretrained(vocoder_checkpoint).to(device)

    return processor, model, vocoder, device

# Load the model
processor, model, vocoder, device = load_speech_model()

# Load speaker embeddings (pretrained from CMU Arctic dataset)
embeddings_dataset = load_dataset("Matthijs/cmu-arctic-xvectors", split="validation")
speaker_embedding = torch.tensor(embeddings_dataset[7306]["xvector"]).unsqueeze(0)

def generate_speech_from_text(text, speaker_embedding=speaker_embedding, processor=processor, model=model, vocoder=vocoder):  
    """ Generates speech from input text using SpeechT5 and HiFi-GAN vocoder. """  

    inputs = processor(text=text, return_tensors="pt", padding=True, truncation=True, max_length=model.config.max_text_positions)
    inputs = {key: value.to(model.device) for key, value in inputs.items()}

    speech = model.generate(
        inputs["input_ids"],
        speaker_embeddings=speaker_embedding.to(model.device),
        vocoder=vocoder,
        num_beams=7,
        temperature=0.6,
        no_repeat_ngram_size=3,
        repetition_penalty=1.5,
    )

    speech = speech.detach().cpu().numpy()
    display(Audio(speech, rate=16000))

# Example usage French
text = "Bonjour, bienvenue dans le modèle de synthèse vocale Wolof et Français."
generate_speech_from_text(text)

# Example usage Wolof
text = "ñu ne ñoom ñooy nattukaay satélite yi"
generate_speech_from_text(text)

Intended Uses & Limitations

Intended Uses

  • Multilingual TTS: Converts Wolof and French text into natural-sounding speech.
  • Voice Assistants & Speech Interfaces: Can be used for audio-based applications supporting both languages.
  • Linguistic Research: Facilitates speech synthesis research in low-resource languages.

Limitations

  • Data Dependency: The quality of synthesized speech depends on the dataset used for fine-tuning.
  • Pronunciation Variations: Some complex or uncommon words may be mispronounced.
  • Limited Speaker Variety: The model was trained on a single speaker embedding and may not generalize well to different voice profiles.

Training and Evaluation Data

The model was fine-tuned on an extended dataset containing text in both Wolof and French, ensuring improved synthesis capabilities across these two languages.


Training Procedure

Training Hyperparameters

Hyperparameter Value
Learning Rate 1e-05
Training Batch Size 8
Evaluation Batch Size 2
Gradient Accumulation Steps 8
Total Train Batch Size 64
Optimizer Adam (β1=0.9, β2=0.999, ϵ=1e-08)
Learning Rate Scheduler Linear
Warmup Steps 500
Training Steps 25,500
Mixed Precision Training AMP (Automatic Mixed Precision)

Training Results

Training Loss Epoch Step Validation Loss
0.5372 0.9995 954 0.4398
0.4646 2.0 1909 0.4214
0.4505 2.9995 2863 0.4163
0.4443 4.0 3818 0.4109
0.4403 4.9995 4772 0.4080
0.4368 6.0 5727 0.4057
0.4343 6.9995 6681 0.4034
0.4315 8.0 7636 0.4018
0.4311 8.9995 8590 0.4015
0.4273 10.0 9545 0.4017
0.4282 10.9995 10499 0.3990
0.4249 12.0 11454 0.3986
0.4242 12.9995 12408 0.3973
0.4225 14.0 13363 0.3966
0.4217 14.9995 14317 0.3951
0.4208 16.0 15272 0.3950
0.4200 16.9995 16226 0.3950
0.4202 18.0 17181 0.3952
0.4200 18.9995 18135 0.3943
0.4183 20.0 19090 0.3962
0.4175 20.9995 20044 0.3937
0.4161 22.0 20999 0.3940
0.4193 22.9995 21953 0.3932
0.4177 24.0 22908 0.3939
0.4166 24.9995 23862 0.3936
0.4156 26.0 24817 0.3938

Framework Versions

  • Transformers: 4.41.2
  • PyTorch: 2.4.0+cu121
  • Datasets: 3.2.0
  • Tokenizers: 0.19.1

Author

  • Bilal FAYE

This model contributes to enhancing TTS accessibility for Wolof and French, making it a valuable resource for multilingual voice applications. 🚀

Downloads last month
26
Safetensors
Model size
144M params
Tensor type
F32
·
Inference Providers NEW
This model is not currently available via any of the supported third-party Inference Providers, and HF Inference API was unable to determine this model's library.

Model tree for bilalfaye/speecht5_tts-wolof-v0.2

Finetuned
(1)
this model