Data Is Better Together

community

Activity Feed

AI & ML interests

Building better datasets together

Recent Activity

davanstrien updated a dataset about 9 hours ago

data-is-better-together/fineweb-c

davanstrien updated a dataset about 12 hours ago

data-is-better-together/fineweb-c-progress

guipenedo updated a dataset 1 day ago

data-is-better-together/fineweb-c

View all activity

data-is-better-together's activity

davanstrien

updated a dataset about 9 hours ago

data-is-better-together/fineweb-c

Viewer • Updated about 9 hours ago • 62.1k • 1.86k • 39

burtenshaw

posted an update about 11 hours ago

Post

1472

The Hugging Face agents course is finally out!

👉 https://huggingface.co/agents-course

This first unit of the course sets you up with all the fundamentals to become a pro in agents.

- What's an AI Agent?
- What are LLMs?
- Messages and Special Tokens
- Understanding AI Agents through the Thought-Action-Observation Cycle
- Thought, Internal Reasoning and the Re-Act Approach
- Actions, Enabling the Agent to Engage with Its Environment
- Observe, Integrating Feedback to Reflect and Adapt

davanstrien

updated a dataset about 12 hours ago

data-is-better-together/fineweb-c-progress

Viewer • Updated about 12 hours ago • 793 • 334 • 3

guipenedo

updated a dataset 1 day ago

data-is-better-together/fineweb-c

Viewer • Updated about 9 hours ago • 62.1k • 1.86k • 39

davidberenstein1957

posted an update 1 day ago

Post

1009

Fine-tune Deepseek-R1 with a Synthetic Reasoning Dataset

Blog: https://huggingface.co/blog/sdiazlor/fine-tune-deepseek-with-a-synthetic-reasoning-data

burtenshaw

posted an update 4 days ago

Post

3024

SmolLM2 paper is out! 😊

😍 Why do I love it? Because it facilitates teaching and learning!

Over the past few months I've engaged with (no joke) thousands of students based on SmolLM.

- People have inferred, fine-tuned, aligned, and evaluated this smol model.
- People used they're own machines and they've used free tools like colab, kaggle, and spaces.
- People tackled use cases in their job, for fun, in their own language, and with their friends.

upvote the paper SmolLM2: When Smol Goes Big -- Data-Centric Training of a Small Language Model (2502.02737)

1 reply

plaguss

authored a paper 5 days ago

SmolLM2: When Smol Goes Big -- Data-Centric Training of a Small Language Model

Paper • 2502.02737 • Published 7 days ago • 153

guipenedo

authored a paper 5 days ago

SmolLM2: When Smol Goes Big -- Data-Centric Training of a Small Language Model

Paper • 2502.02737 • Published 7 days ago • 153

gabrielmbmb

authored a paper 5 days ago

SmolLM2: When Smol Goes Big -- Data-Centric Training of a Small Language Model

Paper • 2502.02737 • Published 7 days ago • 153

burtenshaw

authored a paper 5 days ago

SmolLM2: When Smol Goes Big -- Data-Centric Training of a Small Language Model

Paper • 2502.02737 • Published 7 days ago • 153

davidberenstein1957

posted an update 5 days ago

Post

1869

Agentic RAG: Applied, visual, and step-by-step! 🐾

Get familiar with the Agents and tools, not the bells and whistles!

Retrieve - Augment and now GENERATE.

part 3: https://huggingface.co/blog/davidberenstein1957/ai-blueprint-agentic-rag-part-3-generate

davidberenstein1957

posted an update 6 days ago

Post

2696

Anyone can create free hosted tools for their AI agents! 🔥

Agentic RAG stack part 2 - augment
Augment retrieval results by reranking optimises content without increasing time too much

part2: https://huggingface.co/blog/davidberenstein1957/ai-blueprint-agentic-rag-part-2-augment
code: https://github.com/huggingface/ai-blueprint

davidberenstein1957

posted an update 7 days ago

Post

1867

Creating an agentic RAG stack on the Hugging Face Hub - part 1 - retrieval (1/5).

🚀 Web apps and microservices included!

Chunk, embed and index documents at a huge scale without overhead.

Blog: https://huggingface.co/blog/davidberenstein1957/ai-blueprint-agentic-rag-part-1-retrieve

sayakpaul

posted an update 12 days ago

Post

1845

We have been cooking a couple of fine-tuning runs on CogVideoX with finetrainers, smol datasets, and LoRA to generate cool video effects like crushing, dissolving, etc.

We are also releasing a LoRA extraction utility from a fully fine-tuned checkpoint. I know that kind of stuff has existed since eternity, but the quality on video models was nothing short of spectacular. Below are some links:

* Models and datasets: https://huggingface.co/finetrainers
* finetrainers: https://github.com/a-r-r-o-w/finetrainers
* LoRA extraction: https://github.com/huggingface/diffusers/blob/main/scripts/extract_lora_from_model.py

1 reply

davidberenstein1957

posted an update 12 days ago

Post

1554

tldr; Parquet is awesome, DuckDB too!

Datasets on the Hugging Face Hub rely on parquet files. We can interact with these files using DuckDB as a fast in-memory database system. One of DuckDB’s features is vector similarity search which can be used with or without an index.

blog:
https://huggingface.co/learn/cookbook/vector_search_with_hub_as_backend

davanstrien

posted an update 13 days ago

Post

1773

Why choose between strong LLM reasoning and efficient models?

Use DeepSeek to generate high-quality training data, then distil that knowledge into ModernBERT answerdotai/ModernBERT-base for fast, efficient classification.

Blog post: https://danielvanstrien.xyz/posts/2025/deepseek/distil-deepseek-modernbert.html

davanstrien

posted an update 14 days ago

Post

1865

Updated the ColPali Query Generator Space davanstrien/ColPali-Query-Generator to use Qwen/Qwen2.5-VL-7B-Instruct.

Given an input image, it generates several queries along with explanations to justify them. This approach can generate synthetic data for fine-tuning ColPali models.

sayakpaul

posted an update 15 days ago

Post

1912

We have authored a post to go over the state of video generation in the Diffusers ecosystem 🧨

We cover the models supported, the knobs of optims our users can fire, fine-tuning, and more 🔥

5-6GBs for HunyuanVideo, sky is the limit 🌌 🤗
https://huggingface.co/blog/video_gen

davanstrien

posted an update 15 days ago

Post

2005

🌍 Big step for multilingual AI data!

The Hugging Face community has rated educational content in languages spoken by 1.6 billion people! New additions:
• Japanese
• Italian
• Old High German

Learn more and contribute: https://huggingface.co/blog/davanstrien/fineweb2-community

These ratings can help enhance training data for major world languages.

1 reply

davidberenstein1957

posted an update 15 days ago

Post

1754

Let's uncover the post-training dataset from DeepSeek-R1 with Magpie!

Pass pre-query tokens <｜begin▁of▁sentence｜>User: , let the model generate the rest.

We can get realistic examples!

Gist: https://gist.github.com/davidberenstein1957/3f20046ce57395a6aba13f8b4e956b59

6 replies

AI & ML interests

Recent Activity

Team members 15

data-is-better-together's activity