Made a Webapp for Kokoro (fixed missing first chunk when downloading)
Hey there,
made a simple web interface + backend using Kokoro. Still in the very early stages, just started Friday evening but maybe somebody wants to try it and give me some feedback and ideas.
Next things I wanna add are:
- .wav file storage for the backend so audio doesn't always have to be regenerated
- bookmarks
Also things like the progress bar and volume bar still have to be improved and I gotta have the frontend tell the backend to stop generating when switching files etc. :P
Update: So I improved the UI and State Management a bit.
With long playback the switch from chunk playback to consolidated buffer (for the progress bar and to enable .wav downloads) is still a bit janky. If someone has got an elegant solution for that i'd be grateful.
Ah and pressing reset is stops Backend audio generation now, makes this almost usable :P
Added docker-compose setup. Should work on Linux, Win and Mac, both ARM and x86
If you are looking for ideas and features check other projects like - https://github.com/NeuralFalconYT/Kokoro-82M-WebUI and https://github.com/NeuralFalconYT/kokoro_v1.
I am currently using https://github.com/NeuralFalconYT/kokoro_v1 but I am having issues with it not recognizing that I installed japanese and chinese languages with pip, it also seems to be missing hindi files somehow (errors thrown out at least on my PC).
But what really got me interested is in the first link (and few other tutorials on youtube) there was shown a way to mix voices.
If you manage to implement actual voice mixing for new/all voices and perhaphs add option to 'save' the voice as custom voice and give it a name such that You can load the interface and just select that voice from menu, that would be amazing.
Additionally, since I did not in fact get into looking into code behind kokoro... does it use phonemes when asking for audio? do you think providing Polish Words in forms of phonemes would work to generate polish audio - just curious how it works behind the scenes.
Anyway , Will probably check it out soon and if you implement voice mixing that would be awesome.
If you want more ambitious feature, add a way to interpret text by speaker like. You provide a tab with 'Multi Speakers' - Speaker_01 - TEXT... then Speaker_02 - TEXT... etc and then it combines those two calls into single or multi part wav for playback in your interface.
Another feature I would love to see is automatic generation from files in directory.
You add a browser/text box for file location (directory). Press Generate and it will read contents of the file, generate it, then go to next file, generate it etc. That - that would make me basically instantly switch to your interface as I am so tired of copy pasting text into interface while working with large number of files and text.
You could also add custom box to specify directory for output of those files. I had to modify my Whisper Web UI to add such a feature...hack of a job but it works somehow. Much better solution that going to output folder and sorting data in default folder when app can output directly in desired location.
Haha, about the phonemes, I do not know. Am not one of the creators of the model. Pretty sure it uses them though I do not know if your suggestion would work. I thought about them only in the context of having my backend api track at which word we are at what time of each chunk and sending "word timestamps" back to the frontend to highlight the word that is currently being spoken.
The Multi Speaker thing would be pretty easy to implement with the backend I already got and I really like the Idea so I will probably give that a go the next couple days. For direct text entry I would probably abstract it in a way where you select <speaker 1 voice>, <speaker 2 voice>, enter a single block of text and wrap it in something like:
+++ 1
Text Block
+++
+++ 2
Next segment
+++
+++ 2
Guy is talking much
+++
+++ 1
Last Segment
+++
Especially because that would let you directly upload markdown / txt files in that format + make it a lot less cluttered when working with more than 2 voices potentially
Thanks for the Inspiration! Voice Mixing will come too, just gotta think about how I'll implement it UI wise since I do not like cluttered shit
Another feature I would love to see is automatic generation from files in directory.
You add a browser/text box for file location (directory). Press Generate and it will read contents of the file, generate it, then go to next file, generate it etc. That - that would make me basically instantly switch to your interface as I am so tired of copy pasting text into interface while working with large number of files and text.
You could also add custom box to specify directory for output of those files. I had to modify my Whisper Web UI to add such a feature...hack of a job but it works somehow. Much better solution that going to output folder and sorting data in default folder when app can output directly in desired location.
Well you do not have to copy paste with mine, upload your files, click on them. If you make a profile they persist till you delete them.
Anyway, a sequential read from dir sounds nice. Probably will not add that to a webui though. Planning on making native macos and windows apps after this and releasing them in the app stores for like 2€. The vue webapp will stay open source and I wanna finish that and have it nice and polished before anything else.
Think I'd just prefer the option to have a "read stuff in sequence function" though where you can select any file from the files you have uploaded and order them in a specific way, maybe save that order preset and play them. This would also fit a web app I think
The main challenges it seems are Keeping it stupid simple and nice looking while having all functionality I want and my lack of practice with vue and javascript. First time doing anything javascript and ui
Another feature I would love to see is automatic generation from files in directory.
You add a browser/text box for file location (directory). Press Generate and it will read contents of the file, generate it, then go to next file, generate it etc. That - that would make me basically instantly switch to your interface as I am so tired of copy pasting text into interface while working with large number of files and text.
You could also add custom box to specify directory for output of those files. I had to modify my Whisper Web UI to add such a feature...hack of a job but it works somehow. Much better solution that going to output folder and sorting data in default folder when app can output directly in desired location.
Just added multiple file upload. So you can select 5 files at once, it will upload, you can put their contents in the text box by clicking them and they will persist in the profile till deleted. Not exactly what you wanted but better than before xD
Added Multi Speaker Mode, still gotta do some audio normalization so the voices are same loudness, that should just be a couple lines of code though
Next thing will be voice mixing so it is possible to make them sound a bit more similar in tone hopefully.