Specific speaker example not working

#1
by Delfshkrimm - opened

Hello! Thank you for your training :)
I have an issue using the provided example for usage with a specific voice to prepend to the target_prompt.

It seems using the example you provide, it's still a random voice from the original model training that is used instead of the wav file.
I've looked over the transcription by whisper and it is fine, and i've also tried with manual transcription.
My wav file is mono and 16kHz so there shouldnt be a problem with it.
Moreover, when using the default example from Llasa-1B-Multilingual by the original model authors, it works fine.

Actually I don't really see in your example how the original wavform is used to generate speeker tokens, only the transcription input seems to be passed to the model.

MultiLlasa org

You are right. Missed adding the audio data to the xcodec2 model. Will fix it shortly!

Awesome! Thank you. Does adding the audio data to xcodec2 reduces inference time (compared to using the second example from original Llasa-1B-Multilingual and having them generated and removed from output)?

MultiLlasa org
import torch
import torchaudio
import tempfile
import soundfile as sf
from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline

# Input your reference audio and optional the text
sample_audio_path = "male.wav"
sample_audio_text = None # Set it to none to use whisper for transcription
# Input the target text here
target_text = "Und apropos Spannungen und Unfälle, in Stuttgart gibt es auch einige Schlagzeilen. Die Polizei sucht Zeugen, nachdem in der Stadt mehrere Autoscheiben eingeschlagen wurden. Und gestern kam es im Stuttgarter Osten zu einer Verfolgungsjagd mit einer jungen BMW-Fahrerin, die vor einer Polizeistreife geflüchtet ist."
output_filename = "no_speaker_example.wav"

#### Do not edit below ####
llasa_model_name = "MultiLlasa/Llasa-1B-Multilingual-German"
tokenizer = AutoTokenizer.from_pretrained(llasa_model_name)
model = AutoModelForCausalLM.from_pretrained(llasa_model_name)
model.to("cuda")

from xcodec2.modeling_xcodec2 import XCodec2Model
codec_model_path = "HKUST-Audio/xcodec2"
Codec_model = XCodec2Model.from_pretrained(codec_model_path)
Codec_model.cuda()

whisper_turbo_pipe = pipeline(
    "automatic-speech-recognition",
    model="openai/whisper-large-v3-turbo",
    torch_dtype=torch.float16,
    device="cuda",
)

def ids_to_speech_tokens(speech_ids):
    speech_tokens_str = []
    for speech_id in speech_ids:
        speech_tokens_str.append(f"<|s_{speech_id}|>")
    return speech_tokens_str

waveform, sample_rate = torchaudio.load(sample_audio_path)

max_secs = 15
if len(waveform[0]) / sample_rate > 15:
    print("Warning: Trimming audio to first 15secs.")
    waveform = waveform[:, : sample_rate * 15]
    waveform = torch.nn.functional.pad( waveform, (0, int(sample_rate * 0.5)), "constant", 0)

if waveform.size(0) > 1:
    waveform = torch.mean(waveform, dim=0, keepdim=True)

prompt_wav = torchaudio.transforms.Resample(orig_freq=sample_rate, new_freq=16000)(waveform)

if sample_audio_text is None:
    print("Transcribing audio...")
    transcription = whisper_turbo_pipe(waveform[0].numpy())["text"].strip()
else:
    transcription = sample_audio_text

print("Transcription:", transcription)

if len(target_text) == 0:
    raise ValueError("Target text must be provided!")
elif len(target_text) > 500:
    print("Text is too long; trimming to first 500 characters.")
    target_text = target_text[:500]

input_text = transcription + " " + target_text

with torch.no_grad():
    vq_code_prompt = Codec_model.encode_code(input_waveform=prompt_wav)
    vq_code_prompt = vq_code_prompt[0, 0, :]
    speech_ids_prefix = ids_to_speech_tokens(vq_code_prompt)

    formatted_text = f"<|TEXT_UNDERSTANDING_START|>{input_text}<|TEXT_UNDERSTANDING_END|>"

    chat = [
        {"role": "user", "content": "Convert the text to speech:" + formatted_text},
        {"role": "assistant", "content": "<|SPEECH_GENERATION_START|>" + "".join(speech_ids_prefix)}
        ]

    input_ids = tokenizer.apply_chat_template(chat, tokenize=True, return_tensors="pt", continue_final_message=True)
    input_ids = input_ids.to("cuda")
    speech_end_id = tokenizer.convert_tokens_to_ids("<|SPEECH_GENERATION_END|>")

    outputs = model.generate(
        input_ids,
        max_length=2048, 
        eos_token_id=speech_end_id,
        do_sample=True,
        top_p=1,
        temperature=0.8,
        min_new_tokens=4, # Fix so the model does not directly stop 
    )

    generated_ids = outputs[0][input_ids.shape[1] - len(speech_ids_prefix) : -1]

    speech_tokens = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)
    speech_tokens = extract_speech_ids(speech_tokens)
    speech_tokens = torch.tensor(speech_tokens).cuda().unsqueeze(0).unsqueeze(0)

    gen_wav = Codec_model.decode_code(speech_tokens)
    gen_wav = gen_wav[:, :, prompt_wav.shape[1] :]
    sf.write(output_filename, gen_wav[0, 0, :].cpu().numpy(), 16000)

This should fix it, Readme is also adjusted. Generation times should not be affected too much. Generation takes ~10s on an RTX 3090. Model loading takes some time.

SebastianBodza changed discussion status to closed

Sign up or log in to comment