Guilherme Penedo
guipenedo
AI & ML interests
None yet
Recent Activity
upvoted
an
article
about 1 hour ago
Finding Moroccan Arabic (Darija) in Fineweb 2
updated
a dataset
1 day ago
data-is-better-together/fineweb-c
upvoted
an
article
1 day ago
Open R1: Update #2
Organizations
guipenedo's activity
Update 2025/2025-01-22-Torstar.md
#4 opened 11 days ago
by
guipenedo
![](https://cdn-avatars.huggingface.co/v1/production/uploads/62596f9e1c0a084224b93e00/X2aLkJ0ofhkXwAg7lXvxD.jpeg)
New update returns a 500 server error using the datasets-server API
6
#18 opened about 2 months ago
by
jonna32
Synthetic Data Generator
1
#5 opened about 1 month ago
by
kishorekashyap
Cannot load with datasets
3
#4 opened about 1 month ago
by
mbanon
![](https://cdn-avatars.huggingface.co/v1/production/uploads/noauth/_m3de_G457LNKIwQO1M6f.jpeg)
A lot of load errors after new update
14
#19 opened about 1 month ago
by
yzhangcs
![](https://cdn-avatars.huggingface.co/v1/production/uploads/638f5839f6de4b9e7e1627fb/6QGkrqRag6-GnH9k60Oil.jpeg)
Add "date" column to "default" subset
#20 opened about 1 month ago
by
lhoestq
![](https://cdn-avatars.huggingface.co/v1/production/uploads/1594214747713-5e9ecfc04957053f60648a3e.png)
Simple exact deduplication removes 2/3 of data.
4
#49 opened 6 months ago
by
egor-pakhomov
Torrent?
3
#4 opened 10 months ago
by
emilss
Any plan to train models on larger subset of dataset?
1
#8 opened 10 months ago
by
mrfakename
![](https://cdn-avatars.huggingface.co/v1/production/uploads/62e54f0eae9d3f10acb95cb9/VAyk05hqB3OZWXEZW-B0q.png)
Are copyrighted works included in this dataset?
4
#9 opened 10 months ago
by
umm-maybe
![](https://cdn-avatars.huggingface.co/v1/production/uploads/1653942799944-noauth.png)
Reprocessing for a new language
14
#12 opened 10 months ago
by
pere
![](https://cdn-avatars.huggingface.co/v1/production/uploads/1631018295628-5f0ca59719cb630495b81509.jpeg)
Training configs for data ablation study
2
#14 opened 10 months ago
by
jimmyhbx
tiny-fineweb
3
#19 opened 10 months ago
by
3thn
![](https://cdn-avatars.huggingface.co/v1/production/uploads/66144e2044765354627477b9/eT5upf5np13H0o1ZViweY.png)
Unsafe files
1
#25 opened 9 months ago
by
alielfilali01
![](https://cdn-avatars.huggingface.co/v1/production/uploads/626237d9bbcbd1c34f1bb231/EJrOjvAL-68qMCYdnvOrq.png)
"Reproducing GPT-2 (124M) in llm.c in 90 minutes for $20" using fineweb by Karpathy
#28 opened 9 months ago
by
clem
![](https://cdn-avatars.huggingface.co/v1/production/uploads/1583857146757-5e67bdd61009063689407479.jpeg)
Regarding to the newly updated indexes(writen as deduplication issues)
5
#29 opened 8 months ago
by
kimcando
![](https://cdn-avatars.huggingface.co/v1/production/uploads/62759969a227a8b3a7065b2a/ADlH0VjyVlZ0Om8Mxj5HX.jpeg)
Language subset
3
#33 opened 8 months ago
by
talmor
How to compute the aggerate score?
1
#35 opened 8 months ago
by
mornmirror
why do you apply "All filters except the (very destructive) terminal_punct"
3
#36 opened 8 months ago
by
bpwl0121