Some Observations on Pacing and Intonation
I reworked some code that had been using Llasa-3B to use OuteTTS instead. On the positive side, OuteTTS requires only a fraction of the VRAM that Llasa-3B and xcodec2 does, is slightly faster, and--most importantly--produces much more reliable results [Llasa-3B fails about 30% of the time]. On the negative side, the voice cloning and intonation do not match the quality of Llasa-3B when it succeeds.
The biggest issue I've had with OuteTTS is the pacing. It feels a bit rushed and monotone and seems to mostly ignore breaks like commas and periods.
That said, good work. The implementation was straightforward, and I hope it continues to be developed. So many TTS projects seem to die, and there are very few that run reliably and offer voice cloning with this small of a footprint.
Hey,
The biggest issue I've had with OuteTTS is the pacing. It feels a bit rushed and monotone and seems to mostly ignore breaks like commas and periods.
Yes v0.2 did not support punctuations, the v0.3 (https://huggingface.co/OuteAI/OuteTTS-0.3-1B ) addresses this issue by adding punctuation support "The following punctuation marks are supported: '.', '!', '?', ',', '"', '„', '¡', '¿', '…', '...', '。', '!', '?', ',', '؟'. These are converted into special tokens, for instance, . is transformed into <|period|>. "
I would suggest trying this model instead.
On the negative side, the voice cloning and intonation do not match the quality of Llasa-3B when it succeeds.
I am working on improving voice cloning in a future release, and will use a different audio reconstruction model along with other changes to improve voice cloning.