I posted a poll on twitter, and others have mentioned the interest in me using the convention of including the author name in the model path when I upload.
It has a couple advantages, first and foremost of course is ensuring clarity of who uploaded the original model (did Qwen upload Qwen2.6? Or did someone fine tune Qwen2.5 and named it 2.6 for fun?)
The second thing is that it avoids collisions, so if multiple people upload the same model and I try to quant them both, I would normally end up colliding and being unable to upload both
I'll be implementing the change next week, there are just two final details I'm unsure about:
First, should the files also inherit the author's name?
Second, what to do in the case that the author name + model name pushes us past the character limit?
Haven't yet decided how to handle either case, so feedback is welcome, but also just providing this as a "heads up"
After some heated discussion ๐ฅ, we clarify our intent re. storage limits on the Hub
TL;DR: - public storage is free, and (unless blatant abuse) unlimited. We do ask that you consider upgrading to PRO and/or Enterprise Hub if possible - private storage is paid above a significant free tier (1TB if you have a paid account, 100GB otherwise)
We optimize our infrastructure continuously to scale our storage for the coming years of growth in Machine learning, to the benefit of the community ๐ฅ
Before you panic, there's a new "preferred" method which is online (I prefer the term on-the-fly) repacking, so if you download Q4_0 and your setup can benefit from repacking the weights into interleaved rows (what Q4_0_4_4 was doing), it will do that automatically and give you similar performance (minor losses I think due to using intrinsics instead of assembly, but intrinsics are more maintainable)
So if you update your llama.cpp past that point, you won't be able to run Q4_0_4_4 (unless they add backwards compatibility back), but Q4_0 should be the same speeds (though it may currently be bugged on some platforms)
As such, I'll stop making those newer model formats soon, probably end of this week unless something changes, but you should be safe to download and Q4_0 quants and use those !
Also IQ4_NL supports repacking though not in as many shapes yet, but should get a respectable speed up on ARM chips, PR for that can be found here: https://github.com/ggerganov/llama.cpp/pull/10541
Remember, these are not meant for Apple silicon since those use the GPU and don't benefit from the repacking of weights
Recently Slaren over on llama.cpp refactored the model loader - in a way that's super awesome and very powerful - but with it came breaking of support for "split tensor MoE models", which applies to older mixtral models
You may have seen my upload of one such older mixtral model, ondurbin/bagel-dpo-8x7b-v0.2, and with the newest changes it seems to be able to run without issue
If you happen to run into issues with any other old mixtral models, drop a link here and I'll try to remake them with the new changes so that we can continue enjoying them :)
INTELLECT-1 is the first collaboratively trained 10 billion parameter language model trained from scratch on 1 trillion tokens of English text and code.
Llava o1 - vlm capable of spontaneous, systematic reasoning, similar to GPT-o1, 11B model outperforms gemini-1.5-pro, gpt-4o-mini, and llama-3.2-90B-vision Xkev/Llama-3.2V-11B-cot
Jina AI Jina CLIP v2 - general purpose multilingual and multimodal (text & image) embedding model, 900M params, 512 x 512 resolution, matroyoshka representations (1024 to 64) jinaai/jina-clip-v2
Athene v2 Chat & Agent by NexusFlow - SoTA general LLM fine-tuned from Qwen 2.5 72B excels at Chat + Function Calling/ JSON/ Agents Nexusflow/athene-v2-6735b85e505981a794fb02cc
Orca Agent Instruct by Microsoft - 1 million instruct pairs covering text editing, creative writing, coding, reading comprehension, etc - permissively licensed microsoft/orca-agentinstruct-1M-v1
Smol TTS models are here! OuteTTS-0.1-350M - Zero shot voice cloning, built on LLaMa architecture, CC-BY license! ๐ฅ
> Pure language modeling approach to TTS > Zero-shot voice cloning > LLaMa architecture w/ Audio tokens (WavTokenizer) > BONUS: Works on-device w/ llama.cpp โก
Three-step approach to TTS:
> Audio tokenization using WavTokenizer (75 tok per second) > CTC forced alignment for word-to-audio token mapping > Structured prompt creation w/ transcription, duration, audio tokens
The model is extremely impressive for 350M parameters! Kudos to the OuteAI team on such a brilliant feat - I'd love to see this be applied on larger data and smarter backbones like SmolLM ๐ค
> Trained with 1.3 trillion (dolma 1.7) tokens on 16 nodes, each with 4 MI250 GPUs
> Three checkpoints:
- AMD OLMo 1B: Pre-trained model - AMD OLMo 1B SFT: Supervised fine-tuned on Tulu V2, OpenHermes-2.5, WebInstructSub, and Code-Feedback datasets - AMD OLMo 1B SFT DPO: Aligned with human preferences using Direct Preference Optimization (DPO) on UltraFeedback dataset
Key Insights: > Pre-trained with less than half the tokens of OLMo-1B > Post-training steps include two-phase SFT and DPO alignment > Data for SFT: - Phase 1: Tulu V2 - Phase 2: OpenHermes-2.5, WebInstructSub, and Code-Feedback
> Model checkpoints on the Hub & Integrated with Transformers โก๏ธ
Congratulations & kudos to AMD on a brilliant smol model release! ๐ค
What a great day for Open Science! @AIatMeta released models, datasets, and code for many of its research artefacts! ๐ฅ
1. Meta Segment Anything Model 2.1: An updated checkpoint with improved results on visually similar objects, small objects and occlusion handling. A new developer suite will be added to make it easier for developers to build with SAM 2.
In regards to the latest mistral model and GGUFs for it:
Yes, they may be subpar and may require changes to llama.cpp to support the interleaved sliding window
Yes, I got excited when a conversion worked and released them ASAP
That said, generation seems to work right now and seems to mimic the output from spaces that are running the original model
I have appended -TEST to the model names in an attempt to indicate that they are not final or perfect, but if people still feel mislead and that it's not the right thing to do, please post (civilly) below your thoughts, I will highly consider pulling the conversions if that's what people think is best. After all, that's what I'm here for, in service to you all !
What a great milestone to celebrate! The huggingface_hub library is slowly becoming a cornerstone of the Python ML ecosystem when it comes to interacting with the @huggingface Hub. It wouldn't be there without the hundreds of community contributions and feedback! No matter if you are loading a model, sharing a dataset, running remote inference or starting jobs on our infra, you are for sure using it! And this is only the beginning so give a star if you wanna follow the project ๐ https://github.com/huggingface/huggingface_hub
Less than two days ago Kyutai Labs open sourced Moshi - an ~7.6B on-device Speech to Speech foundation model and Mimi - SoTA streaming speech codec! ๐ฅ
The release includes:
1. Moshiko & Moshika - Moshi finetuned on synthetic data (CC-BY license) (kyutai/moshi-v01-release-66eaeaf3302bef6bd9ad7acd) 2. Mimi - Streaiming Audio Codec, processes 24 kHz audio, down to a 12.5 Hz representation with a bandwidth of 1.1 kbps (CC-BY license) (kyutai/mimi) 3. Model checkpoints & Inference codebase written in Rust (Candle), PyTorch & MLX (Apache license) (https://github.com/kyutai-labs/moshi)
How does Moshi work?
1. Moshi processes two audio streams: one for itself and one for the user, with the user's stream coming from audio input and Moshi's stream generated by the model.
2. Along with these audio streams, Moshi predicts text tokens for its speech, enhancing its generation quality.
3. The model uses a small Depth Transformer for codebook dependencies and a large 7B parameter Temporal Transformer for temporal dependencies.
4. The theoretical latency is 160ms, with a practical latency of around 200ms on an L4 GPU.
Model size & inference:
Moshiko/ka are 7.69B param models
bf16 ~16GB VRAM 8-bit ~8GB VRAM 4-bit ~4GB VRAM
You can run inference via Candle ๐ฆ, PyTorch and MLX - based on your hardware.
The Kyutai team, @adefossez@lmz and team are cracked AF, they're bringing some serious firepower to the open source/ science AI scene, looking forward to what's next! ๐