Hugging Face
Models
Datasets
Spaces
Posts
Docs
Enterprise
Pricing
Log In
Sign Up
6.8
TFLOPS
5
17
rasgaard
rasgaard
Follow
21world's profile picture
jfcalvo's profile picture
nataliaElv's profile picture
7 followers
·
36 following
AI & ML interests
None yet
Recent Activity
upvoted
an
article
about 9 hours ago
From Llasa to Llasagna 🍕: Finetuning LLaSA to generates Italian speech and other languages
liked
a model
21 days ago
hexgrad/Kokoro-82M
reacted
to
davanstrien
's
post
with 🤗
29 days ago
Introducing scandi-fine-web-cleaner https://huggingface.co/davanstrien/scandi-fine-web-cleaner, the first model trained on FineWeb-C community annotations! FineWeb2 is a massive multilingual dataset for pre-training language models. Like any web-scale dataset, it contains low-quality content. How can we improve it? Over the past months, an amazing community of 400+ annotators has been labelling content quality (using Argilla) across 23 languages through the FineWeb-C initiative. Today, I'm happy to share the first classifier trained on this data. 🔍 What we've built: - A lightweight classifier that efficiently removes low-quality content - 90%+ precision demonstrated on Danish & Swedish - Can process the 43M+ documents in Danish FineWeb2 with minimal compute 🌍 Why this matters: The approach can be reproduced for any of the 23 languages in FineWeb-C (https://huggingface.co/datasets/data-is-better-together/fineweb-c). We can improve training data quality at scale without massive compute resources by starting with community annotations and training small, efficient classifiers. Want to build a classifier for your language? Check out the full blog post with code examples and implementation details: https://danielvanstrien.xyz/posts/2025/FineWeb-c/scandinavian-content-filtering-fineweb.html
View all activity
Organizations
Papers
1
arxiv:
2305.17154
models
12
Sort: Recently updated
rasgaard/luke-base-newsgroups-finetuned
Text Classification
•
Updated
Feb 28, 2024
•
119
rasgaard/luke-base-newsgroups-probe
Text Classification
•
Updated
Feb 28, 2024
•
140
rasgaard/squeezebert-newsgroups-finetuned
Text Classification
•
Updated
Feb 28, 2024
•
178
rasgaard/squeezebert-newsgroups-probe
Text Classification
•
Updated
Feb 28, 2024
•
167
rasgaard/distilbert-newsgroups-finetuned
Text Classification
•
Updated
Feb 28, 2024
•
198
rasgaard/distilbert-newsgroups-probe
Text Classification
•
Updated
Feb 28, 2024
•
219
rasgaard/bert-newsgroups-finetuned
Text Classification
•
Updated
Feb 28, 2024
•
162
rasgaard/bert-newsgroups-probe
Text Classification
•
Updated
Feb 28, 2024
•
179
rasgaard/roberta-newsgroups-finetuned
Text Classification
•
Updated
Feb 28, 2024
•
225
rasgaard/roberta-newsgroups-probe
Text Classification
•
Updated
Feb 28, 2024
•
213
Expand 12 models
datasets
3
Sort: Recently updated
rasgaard/mmi-bendr-preprocessed
Viewer
•
Updated
Feb 19, 2024
•
4.41k
•
87
rasgaard/20_newsgroups
Viewer
•
Updated
Sep 13, 2023
•
18.8k
•
283
rasgaard/FTRACE-Synth
Viewer
•
Updated
Feb 20, 2023
•
3.2M
•
48