Yanis L's picture

Yanis L PRO

Pendrokar

AI & ML interests

STT/STS/TTS you know, something that is solveable

Recent Activity

updated a dataset 7 minutes ago
Pendrokar/TTS_Arena
updated a dataset 32 minutes ago
Pendrokar/TTS_Arena
updated a dataset about 1 hour ago
Pendrokar/TTS_Arena
View all activity

Organizations

xVASynth TTS's profile picture Hugging Face Discord Community's profile picture

Pendrokar's activity

replied to Keltezaa's post 3 days ago
view reply

Did you not notice that each ZeroGPU space has 12 or was it 24 core server-type CPU? That is more powerful than what you get with a CPU-Upgrade space. And you get 10 for $9!!! A bargain!

reacted to hexgrad's post with ๐Ÿ‘ 3 days ago
view post
Post
5480
I wrote an article about G2P: https://hf.co/blog/hexgrad/g2p

G2P is an underrated piece of small TTS models, like offensive linemen who do a bunch of work and get no credit.

Instead of relying on explicit G2P, larger speech models implicitly learn this task by eating many thousands of hours of audio data. They often use a 500M+ parameter LLM at the front to predict latent audio tokens over a learned codebook, then decode these tokens into audio.

Kokoro instead relies on G2P preprocessing, is 82M parameters, and thus needs less audio to learn. Because of this, we can cherrypick high fidelity audio for training data, and deliver solid speech for those voices. In turn, this excellent audio quality & lack of background noise helps explain why Kokoro is very competitive in single-voice TTS Arenas.
reacted to Xenova's post with ๐Ÿ”ฅ 3 days ago
view post
Post
5083
We did it. Kokoro TTS (v1.0) can now run 100% locally in your browser w/ WebGPU acceleration. Real-time text-to-speech without a server. โšก๏ธ

Generate 10 seconds of speech in ~1 second for $0.

What will you build? ๐Ÿ”ฅ
webml-community/kokoro-webgpu

The most difficult part was getting the model running in the first place, but the next steps are simple:
โœ‚๏ธ Implement sentence splitting, allowing for streamed responses
๐ŸŒ Multilingual support (only phonemization left)

Who wants to help?
ยท
replied to their post 9 days ago
view reply

Which model out of the 8 models listed on my post?

posted an update 9 days ago
view post
Post
2974
TTS: Added Kokoro v1, Parler Large, LlaSa 3B & MARS 6 TTS models to the Arena.
Pendrokar/TTS-Spaces-Arena

Also had added MaskGCT, GPT-SoVITS & OuteTTS a month ago. OuteTTS devs did say that is too early for it to be added to TTS Arenas.

Mars 5 does have a space with open weights models, but inference is way too slow (2 minutes+).
  • 2 replies
ยท
reacted to fdaudens's post with ๐Ÿ”ฅ 10 days ago
view post
Post
3296
๐ŸŽฏ Kokoro TTS just hit v1.0! ๐Ÿš€

Small but mighty: 82M parameters, runs locally, speaks multiple languages. The best part? It's Apache 2.0 licensed!
This could unlock so many possibilities โœจ

Check it out: hexgrad/Kokoro-82M
  • 1 reply
ยท
reacted to hexgrad's post with ๐Ÿ”ฅ 13 days ago
reacted to StephenGenusa's post with ๐Ÿ‘€ 23 days ago
view post
Post
1185
I have a pro account and I am logged in. I have duplicated a space due to the error "You have exceeded your GPU quota", I am showing 0 GPU use, yet I am unable to use it "You have exceeded your GPU quota (60s requested vs. 44s left). Create a free account to get more daily usage quota." "Expert Support" is a pitch for consulting.
ยท
replied to their post 27 days ago
view reply

After 4000 votes F5 TTS fell near the bottom of the leaderboard, I extracted some sample from Emilia. Let us see if that changes anything.

reacted to dylanebert's post with ๐Ÿค— about 1 month ago
view post
Post
1996
๐ŸŸฆ New Image-to-3D model from Stability AI

stabilityai/stable-point-aware-3d

here's how it looks, with TRELLIS for comparison
reacted to hexgrad's post with ๐Ÿ”ฅ about 1 month ago
view post
Post
19679
๐Ÿ“ฃ Looking for labeled, high-quality synthetic audio/TTS data ๐Ÿ“ฃ Have you been or are you currently calling API endpoints from OpenAI, ElevenLabs, etc? Do you have labeled audio data sitting around gathering dust? Let's talk! Join https://discord.gg/QuGxSWBfQy or comment down below.

If your data exceeds quantity & quality thresholds and is approved into the next hexgrad/Kokoro-82M training mix, and you permissively DM me the data under an effective Apache license, then I will DM back the corresponding voicepacks for YOUR data if/when the next Apache-licensed Kokoro base model drops.

What does this mean? If you've been calling closed-source TTS or audio API endpoints to:
- Build voice agents
- Make long-form audio, like audiobooks or podcasts
- Handle customer support, etc
Then YOU can contribute to the training mix and get useful artifacts in return. โค๏ธ

More details at hexgrad/Kokoro-82M#21
ยท
reacted to hexgrad's post with ๐Ÿ”ฅ about 2 months ago
view post
Post
4041
Merry Christmas! ๐ŸŽ„ Open sourced a small TTS model at hexgrad/Kokoro-82M
  • 2 replies
ยท
replied to hexgrad's post 2 months ago
view reply

The original Arena's threshold is at 700 votes. But I am sure Kokoro will hold the position. The voice quality actually sounds close to ElevenLabs.

But StyleTTS usually is not very emotional. So it will fail where Edge TTS does. The phrases where the voice has to be sad or angry. For example Parler Expresso was overly jolly.

reacted to hexgrad's post with ๐Ÿ”ฅ 2 months ago
view post
Post
3021
self.brag(): Kokoro finally got 300 votes in Pendrokar/TTS-Spaces-Arena after @Pendrokar was kind enough to add it 3 weeks ago.
Discounting the small sample size of votes, I think it is safe to say that hexgrad/Kokoro-TTS is currently a top 3 model among the contenders in that Arena. This is notable because:
- At 82M params, Kokoro is one of the smaller models in the Arena
- MeloTTS has 52M params
- F5 TTS has 330M params
- XTTSv2 has 467M params
ยท
reacted to hexgrad's post with ๐Ÿ‘ 3 months ago
view post
Post
1427
@Respair just dropped Tsukasa: frontier TTS in Japanese Respair/Tsukasa_Speech
It's expressive, punches way above its weight class, and supports voice cloning. Go check it out! ๐Ÿš€
(Unmute the audio sample below after hitting play)
replied to their post 3 months ago
view reply

True, a sample from the original dataset would probably be the best. My attempt to try to fetch one from Emilia dataset was unsuccessful as HF dataset viewer can only show the German samples. Emilia's homepage has a ASMR-y example prompt given.

replied to their post 3 months ago
view reply

True about the narration style sample, but that still did not stop XTTS in surpassing F5. Both use the same sample.

posted an update 3 months ago
view post
Post
965
TTS: Sorry, I just cannot get the hype behind F5 TTS. It has now gathered a thousand votes in the TTS Arena fork and **has remained in #8 spot** against the _mostly_ Open TTS adversaries.

The voice sample used is the same as XTTS. F5 has so far been unstable, being unemotional/monotone/depressed and mispronouncing words (_awestruck_).

If you have suggestions please give feedback in the following thread:
mrfakename/E2-F5-TTS#32
ยท
reacted to hexgrad's post with ๐Ÿ”ฅ 3 months ago
posted an update 3 months ago
view post
Post
1381
Added @amphion MaskGCT & @hexgrad StyleTTS fine tuned model by the name of kokoro to the forked TTS Arena Space. If things keep up from what is seen in the preliminary results, then these two may end up in the TOP 5 of all TTS models. ๐Ÿคž๏ธ๐Ÿ€๏ธ

Pendrokar/TTS-Spaces-Arena
Svngoku/maskgct-audio-lab
hexgrad/Kokoro-TTS

I chose @Svngoku 's forked HF space over amphion's due to the overly high ZeroGPU duration demand on the latter. 300s!

amphion/maskgct

Had to remove @mrfakename 's MetaVoice-1B Space from the available models as that space has been down for quite some time. ๐Ÿค•๏ธ

mrfakename/MetaVoice-1B-v0.1

I'm close to syncing the code to the original Arena's code structure. Then I'd like to use ASR in order to validate and create synthetic public datasets from the generated samples. And then make the Arena multilingual, which will surely attract quite the crowd!
  • 1 reply
ยท