Elie Bakouch

eliebak

AI & ML interests

Training LLM's @ πŸ€—

Recent Activity

Organizations

Hugging Face's profile picture HuggingFaceBR4's profile picture Hugging Face H4's profile picture Blog-explorers's profile picture Hugging Face TB Research's profile picture huggingPartyParis's profile picture Nanotron Research's profile picture MLX Community's profile picture Hugging Face SMOL's profile picture HuggingFaceFW's profile picture HuggingFaceFW-Dev's profile picture LLHF's profile picture llmc's profile picture SLLHF's profile picture Argilla Warehouse's profile picture nltpt's profile picture smol-explorers's profile picture Open Science's profile picture Hugging Face Science's profile picture open/ acc's profile picture Open R1's profile picture

eliebak's activity

reacted to lewtun's post with πŸš€πŸ”₯ 17 days ago
view post
Post
10037
We are reproducing the full DeepSeek R1 data and training pipeline so everybody can use their recipe. Instead of doing it in secret we can do it together in the open!

πŸ§ͺ Step 1: replicate the R1-Distill models by distilling a high-quality reasoning corpus from DeepSeek-R1.

🧠 Step 2: replicate the pure RL pipeline that DeepSeek used to create R1-Zero. This will involve curating new, large-scale datasets for math, reasoning, and code.

πŸ”₯ Step 3: show we can go from base model -> SFT -> RL via multi-stage training.

Follow along: https://github.com/huggingface/open-r1
Β·
reacted to Kseniase's post with πŸ”₯ about 1 month ago
view post
Post
2619
10 Free Comprehensive Datasets for Supervised Fine-Tuning

High-quality datasets, their size and relevance directly impact the effectiveness of fine-tuning and the models' real-world applications. Among the numerous datasets for different tasks, it can be challenging to choose the most comprehensive dataset that best suits your purposes.

So today, we invite you to explore top 10 free datasets on natural language processing and maths:

1. fka/awesome-chatgpt-prompts proposes a huge variety of prompts that can be used with ChatGPT. Over 700 models were trained on this dataset.

2. HuggingFaceFW/fineweb from Hugging Face includes 15T tokens of cleaned and deduplicated English web data. It’s suitable for LLM training, benchmarking, model validation.

3. HuggingFaceFW/fineweb-2 is an another version of FineWeb with high-quality pretraining data to over 1000 languages.

4. O1-OPEN/OpenO1-SFT with Chinese and English data can be used for Chain-of-Thought activation.

5. yahma/alpaca-cleaned is a curated version of the original Alpaca Dataset released by Stanford.

6. lmsys/lmsys-chat-1m with 1 million real-world conversations with 25 state-of-the-art LLMs offers diverse use cases, like content moderation, safety benchmarks, and training instruction-following models.

7. allenai/dolma from Allen AI includes 3T tokens from a diverse mix of web content, academic publications, code, books, and encyclopedic materials.

Math datasets:

1. HuggingFaceTB/finemath consists of educational math content and has two versions: 34B tokens and 54B tokens.

2. amphora/QwQ-LongCoT-130K for training O1-like LLMs.

3. openai/gsm8k for training multi-step reasoning.
reacted to anton-l's post with πŸ”₯ about 2 months ago
view post
Post
2416
Introducing πŸ“π…π’π§πžπŒπšπ­π‘: the best public math pre-training dataset with 50B+ tokens!
HuggingFaceTB/finemath

Math remains challenging for LLMs and by training on FineMath we see considerable gains over other math datasets, especially on GSM8K and MATH.

We build the dataset by:
πŸ› οΈ carefully extracting math data from Common Crawl;
πŸ”Ž iteratively filtering and recalling high quality math pages using a classifier trained on synthetic annotations to identify math reasoning and deduction.

We conducted a series of ablations comparing the performance of Llama-3.2-3B-Base after continued pre-training on FineMath and observe notable gains compared to the baseline model and other public math datasets.

We hope this helps advance the performance of LLMs on math and reasoning! πŸš€
We’re also releasing all the ablation models as well as the evaluation code.

HuggingFaceTB/finemath-6763fb8f71b6439b653482c2
reacted to cfahlgren1's post with ❀️ 3 months ago
view post
Post
3178
You can clean and format datasets entirely in the browser with a few lines of SQL.

In this post, I replicate the process @mlabonne used to clean the new microsoft/orca-agentinstruct-1M-v1 dataset.

The cleaning process consists of:
- Joining the separate splits together / add split column
- Converting string messages into list of structs
- Removing empty system prompts

https://huggingface.co/blog/cfahlgren1/the-beginners-guide-to-cleaning-a-dataset

Here's his new cleaned dataset: mlabonne/orca-agentinstruct-1M-v1-cleaned
  • 1 reply
Β·
reacted to charlesdedampierre's post with πŸ”₯ 5 months ago
view post
Post
4169
Please check the Open Source AI Network: we mapped the top 500 HF users
based on their followers' profiles.

The map can be found here: bunkalab/mapping_the_OS_community
  • 1 reply
Β·
reacted to rwightman's post with πŸ”₯πŸš€ 6 months ago
view post
Post
2069
The latest timm validation & test set results are now viewable by a leaderboard space: timm/leaderboard

As of yesterday, I updated all of the results for ImageNet , ImageNet-ReaL, ImageNet-V2, ImageNet-R, ImageNet-A, and Sketch sets. The csv files can be found in the GH repo https://github.com/huggingface/pytorch-image-models/tree/main/results

Unfortunately the latest benchmark csv files are not yet up to date, there are some gaps in dataset results vs throughput/flop numbers impact the plots.

h/t to @MohamedRashad for making the first timm leaderboard.
  • 1 reply
Β·
posted an update 8 months ago
view post
Post
1631
Wow, impressive 340B model by nvidia with a nice permissive license! πŸš€ The technical report is full of insights and seems to use a different learning rate schedule than cosine, probably a variant of WSD. Hope to get more info on that! πŸ‘€

nvidia/nemotron-4-340b-666b7ebaf1b3867caf2f1911
reacted to dvilasuero's post with πŸ”₯πŸš€β€οΈπŸ€— 8 months ago
view post
Post
8151
Today is a huge day in Argilla’s history. We couldn’t be more excited to share this with the community: we’re joining Hugging Face!

We’re embracing a larger mission, becoming part of a brilliant and kind team and a shared vision about the future of AI.

Over the past year, we’ve been collaborating with Hugging Face on countless projects: launching partner of Docker Spaces, empowering the community to clean Alpaca translations into Spanish and other languages, launching argilla/notus-7b-v1 building on Zephyr’s learnings, the Data is Better Together initiative with hundreds of community contributors, or releasing argilla/OpenHermesPreferences, one of the largest open preference tuning datasets

After more than 2,000 Slack messages and over 60 people collaborating for over a year, it already felt like we were part of the same team, pushing in the same direction. After a week of the smoothest transition you can imagine, we’re now the same team.

To those of you who’ve been following us, this won’t be a huge surprise, but it will be a big deal in the coming months. This acquisition means we’ll double down on empowering the community to build and collaborate on high quality datasets, we’ll bring full support for multimodal datasets, and we’ll be in a better place to collaborate with the Open Source AI community. For enterprises, this means that the Enterprise Hub will unlock highly requested features like single sign-on and integration with Inference Endpoints.

As a founder, I am proud of the Argilla team. We're now part of something bigger and a larger team but with the same values, culture, and goals. Grateful to have shared this journey with my beloved co-founders Paco and AmΓ©lie.

Finally, huge thanks to the Chief Llama Officer @osanseviero for sparking this and being such a great partner during the acquisition process.

Would love to answer any questions you have so feel free to add them below!
Β·
reacted to kargaranamir's post with πŸ‘ 8 months ago
view post
Post
1220
Introducing GlotCC: a new 2TB corpus based on an early 2024 CommonCrawl snapshot with data for 1000+ languages.

πŸ€— corpus v1: cis-lmu/GlotCC-V1
🐱 pipeline v3: https://github.com/cisnlp/GlotCC

More details? Stay tuned for our upcoming paper.
More data? In the next version, we plan to include additional snapshots of CommonCrawl.

Limitation: Due to the lower frequency of low-resource languages compared to others, there are sometimes only a few sentences available for very low-resource languages. However, the data volume for English in this version stands at 750GB, and the top 200 languages still have a strong presence in our data (see plot attached; we write the index for every 20 languages, meaning the 10th index is the 200th language).
reacted to fdaudens's post with πŸ‘€ 10 months ago
view post
Post
1575
How do Microsoft and Alphabet (Google) results compare?

Microsoft Reports Rising Revenues as A.I. Investments Bear Fruit
- 17 % jump in revenue and a 20 % increase in profit for the first three months of the year.
- Revenue was $61.9 billion, up from $52.9 billion a year earlier.
- Profit hit $21.9 billion, up from $18.3 billion.
- More than a fifth of that growth came from its generative A.I. services
https://www.nytimes.com/2024/04/25/technology/microsoft-earnings.html

Alphabet’s Revenue Jumps 15% to $80.5 Billion
- $80.5 billion in quarterly sales, up 15 % from a year earlier. Profit climbed 36 % to $23.7 billion.
- For the first time, a dividend of 20 cents per share
- It spent $12 billion on capital expenditures in the first quarter, soaring 91 % from a year earlier.
https://www.nytimes.com/2024/04/25/technology/alphabet-earnings.html

Meta’s Open Source Llama 3 Is Already Nipping at OpenAI’s Heels - Wired
- "if open source models prove competitive, developers and entrepreneurs may decide to stop paying to access the latest model from OpenAI or Google and use Llama 3 or one of the other increasingly powerful open source models that are popping up."
- "Open models appear to be dropping at an impressive clip."
https://www.wired.com/story/metas-open-source-llama-3-nipping-at-openais-heels/
reacted to trisfromgoogle's post with πŸš€πŸ”₯ 10 months ago
view post
Post
1847
Very excited to share the first two official Gemma variants from Google! Today at Google Cloud Next, we announced cutting-edge models for code and research!

First, google/codegemma-release-66152ac7b683e2667abdee11 - a new set of code-focused Gemma models at 2B and 7B, in both pretrained and instruction-tuned variants. These exhibit outstanding performance on academic benchmarks and (in my experience) real-life usage. Read more in the excellent HuggingFace blog: https://huggingface.co/blog/codegemma

Second, ( google/recurrentgemma-release-66152cbdd2d6619cb1665b7a), which is based on the outstanding Google DeepMind research in Griffin: https://arxiv.org/abs/2402.19427. RecurrentGemma is a research variant that enables higher throughput and vastly improved memory usage. We are excited about new architectures, especially in the lightweight Gemma sizes, where innovations like RecurrentGemma can scale modern AI to many more use cases.

For details on the launches of these models, check out our launch blog -- and please do not hesitate to send us feedback. We are excited to see what you build with CodeGemma and RecurrentGemma!

Huge thanks to the Hugging Face team for helping ensure that these models work flawlessly in the Hugging Face ecosystem at launch!
Β·
reacted to manu's post with ❀️ about 1 year ago
view post
Post
These past months, I've been busy baking a special sort of Croissant πŸ₯ with an awesome team !

πŸ₯ CroissantLLM is a truly bilingual language model trained on 3 trillion tokens of French and English data. In its size category (<2B), it is the best model in French, but it also rivals the best monolingual English models !

πŸ’Ύ To train it, we collected, filtered and cleaned huge quantities of permissively licensed French data, across various domains (legal, administrative, cultural, scientific), and different text modalities (speech transcriptions, movie subtitles, encyclopedias, forums, webpages)...

βš–οΈ Assessing LLM performance is not easy, especially outside of English, and to this end we crafted a novel evaluation benchmark, FrenchBench, aiming to assess reasoning, factual knowledge, and linguistic capabilities of models in French !

πŸ”Ž The best current LLMs are hidden behind a shroud of mystery, trained with undisclosed training data mixes or strategies. We go the opposite way, releasing all of the project's artefacts (model checkpoints, data, training details, evaluation benchmarks...) We obtain 81 % of the Stanford FMTI transparency criterias, far ahead of even most open initiatives !

πŸ§ͺBeyond a powerful industrial resource, our transparent initiative is a stepping stone for many scientific questions ! How does teaching a model two languages instead of one splits its monolingual ability ? Does training on so much French help the model integrate French-centric knowledge and cultural biases ? How does the model memorize the training data ?

Many more things to say, for those interested, I recommend checking out:

πŸ—žοΈ The blogpost: https://huggingface.co/blog/manu/croissant-llm-blog
πŸ“– The 45 page report with lots of gems: https://arxiv.org/abs/2402.00786
πŸ€– Models, Data, Demo: https://huggingface.co/croissantllm
Β·
reacted to dvilasuero's post with ❀️ about 1 year ago
view post
Post
πŸ”₯ Less is more for DPO, high quality matters!

πŸ“’ Dropping our first open dataset and LLM of the year:

πŸ’ΎMeet distilabel Orca Pairs DPO, an improved version of the now famous dataset from Intel:

argilla/distilabel-intel-orca-dpo-pairs


πŸ›οΈ And a new OpenHermes fine-tune outperforming baselines with 54% less DPO pairs:

https://huggingface.co/argilla/distilabeled-Hermes-2.5-Mistral-7B

You can use this new dataset for your DPO tuning, just like this:


from datasets import load_dataset

# Instead of this:
# dataset = load_dataset("Intel/orca_dpo_pairs", split="train")

# use this:
dataset = load_dataset("argilla/distilabel-intel-orca-dpo-pairs", split="train")

dataset = dataset.filter(
    lambda r: 
        r["status"] != "tie" and 
        r["chosen_score"] >= 8 and 
        not r["in_gsm8k_train"]
)

This will reduce the size of the original by 54% while giving you better quality preferences!

What should we build next?



  • 2 replies
Β·
reacted to akhaliq's post with πŸ‘ about 1 year ago
view post
Post
Self-Rewarding Language Models

paper page: Self-Rewarding Language Models (2401.10020)

Fine-tuning Llama 2 70B on three iterations of our approach yields a model that outperforms many existing systems on the AlpacaEval 2.0 leaderboard, including Claude 2, Gemini Pro, and GPT-4 0613
  • 1 reply
Β·