eliebak (Elie Bakouch)

reacted to lewtun's post with 🚀🔥 17 days ago

Post

10037

We are reproducing the full DeepSeek R1 data and training pipeline so everybody can use their recipe. Instead of doing it in secret we can do it together in the open!

🧪 Step 1: replicate the R1-Distill models by distilling a high-quality reasoning corpus from DeepSeek-R1.

🧠 Step 2: replicate the pure RL pipeline that DeepSeek used to create R1-Zero. This will involve curating new, large-scale datasets for math, reasoning, and code.

🔥 Step 3: show we can go from base model -> SFT -> RL via multi-stage training.

Follow along: https://github.com/huggingface/open-r1

5 replies

·

reacted to Kseniase's post with 🔥 about 1 month ago

Post

2619

10 Free Comprehensive Datasets for Supervised Fine-Tuning

High-quality datasets, their size and relevance directly impact the effectiveness of fine-tuning and the models' real-world applications. Among the numerous datasets for different tasks, it can be challenging to choose the most comprehensive dataset that best suits your purposes.

So today, we invite you to explore top 10 free datasets on natural language processing and maths:

1. fka/awesome-chatgpt-prompts proposes a huge variety of prompts that can be used with ChatGPT. Over 700 models were trained on this dataset.

2. HuggingFaceFW/fineweb from Hugging Face includes 15T tokens of cleaned and deduplicated English web data. It’s suitable for LLM training, benchmarking, model validation.

3. HuggingFaceFW/fineweb-2 is an another version of FineWeb with high-quality pretraining data to over 1000 languages.

4. O1-OPEN/OpenO1-SFT with Chinese and English data can be used for Chain-of-Thought activation.

5. yahma/alpaca-cleaned is a curated version of the original Alpaca Dataset released by Stanford.

6. lmsys/lmsys-chat-1m with 1 million real-world conversations with 25 state-of-the-art LLMs offers diverse use cases, like content moderation, safety benchmarks, and training instruction-following models.

7. allenai/dolma from Allen AI includes 3T tokens from a diverse mix of web content, academic publications, code, books, and encyclopedic materials.

Math datasets:

1. HuggingFaceTB/finemath consists of educational math content and has two versions: 34B tokens and 54B tokens.

2. amphora/QwQ-LongCoT-130K for training O1-like LLMs.

3. openai/gsm8k for training multi-step reasoning.

reacted to anton-l's post with 🔥 about 2 months ago

Post

2416

Introducing 📐𝐅𝐢𝐧𝐞𝐌𝐚𝐭𝐡: the best public math pre-training dataset with 50B+ tokens!
HuggingFaceTB/finemath

Math remains challenging for LLMs and by training on FineMath we see considerable gains over other math datasets, especially on GSM8K and MATH.

We build the dataset by:
🛠️ carefully extracting math data from Common Crawl;
🔎 iteratively filtering and recalling high quality math pages using a classifier trained on synthetic annotations to identify math reasoning and deduction.

We conducted a series of ablations comparing the performance of Llama-3.2-3B-Base after continued pre-training on FineMath and observe notable gains compared to the baseline model and other public math datasets.

We hope this helps advance the performance of LLMs on math and reasoning! 🚀
We’re also releasing all the ablation models as well as the evaluation code.

HuggingFaceTB/finemath-6763fb8f71b6439b653482c2

reacted to cfahlgren1's post with ❤️ 3 months ago

Post

3178

You can clean and format datasets entirely in the browser with a few lines of SQL.

In this post, I replicate the process @mlabonne used to clean the new microsoft/orca-agentinstruct-1M-v1 dataset.

The cleaning process consists of:
- Joining the separate splits together / add split column
- Converting string messages into list of structs
- Removing empty system prompts

https://huggingface.co/blog/cfahlgren1/the-beginners-guide-to-cleaning-a-dataset

Here's his new cleaned dataset: mlabonne/orca-agentinstruct-1M-v1-cleaned

1 reply

·

reacted to charlesdedampierre's post with 🔥 5 months ago

Post

4169

Please check the Open Source AI Network: we mapped the top 500 HF users
based on their followers' profiles.

The map can be found here: bunkalab/mapping_the_OS_community

1 reply

·

reacted to rwightman's post with 🔥🚀 6 months ago

Post

2069

The latest timm validation & test set results are now viewable by a leaderboard space: timm/leaderboard

As of yesterday, I updated all of the results for ImageNet , ImageNet-ReaL, ImageNet-V2, ImageNet-R, ImageNet-A, and Sketch sets. The csv files can be found in the GH repo https://github.com/huggingface/pytorch-image-models/tree/main/results

Unfortunately the latest benchmark csv files are not yet up to date, there are some gaps in dataset results vs throughput/flop numbers impact the plots.

h/t to @MohamedRashad for making the first timm leaderboard.

1 reply

·

posted an update 8 months ago

Post

1631

Wow, impressive 340B model by nvidia with a nice permissive license! 🚀 The technical report is full of insights and seems to use a different learning rate schedule than cosine, probably a variant of WSD. Hope to get more info on that! 👀

nvidia/nemotron-4-340b-666b7ebaf1b3867caf2f1911

reacted to dvilasuero's post with 🔥🚀❤️🤗 8 months ago

Post

8151

Today is a huge day in Argilla’s history. We couldn’t be more excited to share this with the community: we’re joining Hugging Face!

We’re embracing a larger mission, becoming part of a brilliant and kind team and a shared vision about the future of AI.

Over the past year, we’ve been collaborating with Hugging Face on countless projects: launching partner of Docker Spaces, empowering the community to clean Alpaca translations into Spanish and other languages, launching argilla/notus-7b-v1 building on Zephyr’s learnings, the Data is Better Together initiative with hundreds of community contributors, or releasing argilla/OpenHermesPreferences, one of the largest open preference tuning datasets

After more than 2,000 Slack messages and over 60 people collaborating for over a year, it already felt like we were part of the same team, pushing in the same direction. After a week of the smoothest transition you can imagine, we’re now the same team.

To those of you who’ve been following us, this won’t be a huge surprise, but it will be a big deal in the coming months. This acquisition means we’ll double down on empowering the community to build and collaborate on high quality datasets, we’ll bring full support for multimodal datasets, and we’ll be in a better place to collaborate with the Open Source AI community. For enterprises, this means that the Enterprise Hub will unlock highly requested features like single sign-on and integration with Inference Endpoints.

As a founder, I am proud of the Argilla team. We're now part of something bigger and a larger team but with the same values, culture, and goals. Grateful to have shared this journey with my beloved co-founders Paco and Amélie.

Finally, huge thanks to the Chief Llama Officer @osanseviero for sparking this and being such a great partner during the acquisition process.

Would love to answer any questions you have so feel free to add them below!

28 replies

·

reacted to kargaranamir's post with 👍 8 months ago

Post

1220

Introducing GlotCC: a new 2TB corpus based on an early 2024 CommonCrawl snapshot with data for 1000+ languages.

🤗 corpus v1: cis-lmu/GlotCC-V1
🐱 pipeline v3: https://github.com/cisnlp/GlotCC

More details? Stay tuned for our upcoming paper.
More data? In the next version, we plan to include additional snapshots of CommonCrawl.

Limitation: Due to the lower frequency of low-resource languages compared to others, there are sometimes only a few sentences available for very low-resource languages. However, the data volume for English in this version stands at 750GB, and the top 200 languages still have a strong presence in our data (see plot attached; we write the index for every 20 languages, meaning the 10th index is the 200th language).

reacted to fdaudens's post with 👀 10 months ago

Post

1575

How do Microsoft and Alphabet (Google) results compare?

Microsoft Reports Rising Revenues as A.I. Investments Bear Fruit
- 17 % jump in revenue and a 20 % increase in profit for the first three months of the year.
- Revenue was $61.9 billion, up from $52.9 billion a year earlier.
- Profit hit $21.9 billion, up from $18.3 billion.
- More than a fifth of that growth came from its generative A.I. services
https://www.nytimes.com/2024/04/25/technology/microsoft-earnings.html

Alphabet’s Revenue Jumps 15% to $80.5 Billion
- $80.5 billion in quarterly sales, up 15 % from a year earlier. Profit climbed 36 % to $23.7 billion.
- For the first time, a dividend of 20 cents per share
- It spent $12 billion on capital expenditures in the first quarter, soaring 91 % from a year earlier.
https://www.nytimes.com/2024/04/25/technology/alphabet-earnings.html

Meta’s Open Source Llama 3 Is Already Nipping at OpenAI’s Heels - Wired
- "if open source models prove competitive, developers and entrepreneurs may decide to stop paying to access the latest model from OpenAI or Google and use Llama 3 or one of the other increasingly powerful open source models that are popping up."
- "Open models appear to be dropping at an impressive clip."
https://www.wired.com/story/metas-open-source-llama-3-nipping-at-openais-heels/

reacted to trisfromgoogle's post with 🚀🔥 10 months ago

Post

1847

Very excited to share the first two official Gemma variants from Google! Today at Google Cloud Next, we announced cutting-edge models for code and research!

First, google/codegemma-release-66152ac7b683e2667abdee11 - a new set of code-focused Gemma models at 2B and 7B, in both pretrained and instruction-tuned variants. These exhibit outstanding performance on academic benchmarks and (in my experience) real-life usage. Read more in the excellent HuggingFace blog: https://huggingface.co/blog/codegemma

Second, ( google/recurrentgemma-release-66152cbdd2d6619cb1665b7a), which is based on the outstanding Google DeepMind research in Griffin: https://arxiv.org/abs/2402.19427. RecurrentGemma is a research variant that enables higher throughput and vastly improved memory usage. We are excited about new architectures, especially in the lightweight Gemma sizes, where innovations like RecurrentGemma can scale modern AI to many more use cases.

For details on the launches of these models, check out our launch blog -- and please do not hesitate to send us feedback. We are excited to see what you build with CodeGemma and RecurrentGemma!

Huge thanks to the Hugging Face team for helping ensure that these models work flawlessly in the Hugging Face ecosystem at launch!

3 replies

·

reacted to manu's post with ❤️ about 1 year ago

Post

These past months, I've been busy baking a special sort of Croissant 🥐 with an awesome team !

🥐 CroissantLLM is a truly bilingual language model trained on 3 trillion tokens of French and English data. In its size category (<2B), it is the best model in French, but it also rivals the best monolingual English models !

💾 To train it, we collected, filtered and cleaned huge quantities of permissively licensed French data, across various domains (legal, administrative, cultural, scientific), and different text modalities (speech transcriptions, movie subtitles, encyclopedias, forums, webpages)...

⚖️ Assessing LLM performance is not easy, especially outside of English, and to this end we crafted a novel evaluation benchmark, FrenchBench, aiming to assess reasoning, factual knowledge, and linguistic capabilities of models in French !

🔎 The best current LLMs are hidden behind a shroud of mystery, trained with undisclosed training data mixes or strategies. We go the opposite way, releasing all of the project's artefacts (model checkpoints, data, training details, evaluation benchmarks...) We obtain 81 % of the Stanford FMTI transparency criterias, far ahead of even most open initiatives !

🧪Beyond a powerful industrial resource, our transparent initiative is a stepping stone for many scientific questions ! How does teaching a model two languages instead of one splits its monolingual ability ? Does training on so much French help the model integrate French-centric knowledge and cultural biases ? How does the model memorize the training data ?

Many more things to say, for those interested, I recommend checking out:

🗞️ The blogpost: https://huggingface.co/blog/manu/croissant-llm-blog
📖 The 45 page report with lots of gems: https://arxiv.org/abs/2402.00786
🤖 Models, Data, Demo: https://huggingface.co/croissantllm

3 replies

·

reacted to dvilasuero's post with ❤️ about 1 year ago

Post

🔥 Less is more for DPO, high quality matters!

📢 Dropping our first open dataset and LLM of the year:

💾Meet distilabel Orca Pairs DPO, an improved version of the now famous dataset from Intel:

argilla/distilabel-intel-orca-dpo-pairs

🏛️ And a new OpenHermes fine-tune outperforming baselines with 54% less DPO pairs:

https://huggingface.co/argilla/distilabeled-Hermes-2.5-Mistral-7B

You can use this new dataset for your DPO tuning, just like this:

from datasets import load_dataset

# Instead of this:
# dataset = load_dataset("Intel/orca_dpo_pairs", split="train")

# use this:
dataset = load_dataset("argilla/distilabel-intel-orca-dpo-pairs", split="train")

dataset = dataset.filter(
    lambda r: 
        r["status"] != "tie" and 
        r["chosen_score"] >= 8 and 
        not r["in_gsm8k_train"]
)

This will reduce the size of the original by 54% while giving you better quality preferences!

What should we build next?

2 replies

·

reacted to akhaliq's post with 👍 about 1 year ago

Post

Self-Rewarding Language Models

paper page: Self-Rewarding Language Models (2401.10020)

Fine-tuning Llama 2 70B on three iterations of our approach yields a model that outperforms many existing systems on the AlpacaEval 2.0 leaderboard, including Claude 2, Gemini Pro, and GPT-4 0613

1 reply

·

Elie Bakouch

AI & ML interests

Recent Activity

Organizations

eliebak's activity