Reasoning Datasets Collection Distilled synthetic Reasoning datasets β’ 7 items β’ Updated 9 days ago β’ 50
view article Article Mastering Long Contexts in LLMs with KVPress By nvidia and 1 other β’ 19 days ago β’ 62
view article Article Explore, Curate and Vector Search Any Hugging Face Dataset with Nomic Atlas By MaxNomic and 4 others β’ 19 days ago β’ 30
view article Article Exploring Synthetic Data Generation with DataDreamer By asoria β’ 21 days ago β’ 6
Towards Best Practices for Open Datasets for LLM Training Paper β’ 2501.08365 β’ Published 28 days ago β’ 54
high-quality Chinese training datasets Collection a suite of high-quality Chinese datasets, used for pretraining, fine-tuning or preference alignment. And the models trained on these datasets. β’ 12 items β’ Updated 25 days ago β’ 10
view article Article Synthetic Data Generation with FastData and Hugging Face By asoria β’ Jan 7 β’ 14
Reasoning Datasets Collection Reasoning datasets that are trending π₯ β’ 10 items β’ Updated Jan 3 β’ 24
view article Article Finding Moroccan Arabic (Darija) in Fineweb 2 By omarkamali and 3 others β’ Dec 8, 2024 β’ 21
view article Article Bridging the Gap Between Physical Numerical Simulations and Machine Learning: Introducing The Well By rubenohana β’ Dec 2, 2024 β’ 17
OLMo 2 Collection Artifacts for the second set of OLMo models. β’ 22 items β’ Updated about 17 hours ago β’ 81
Marqo-Ecommerce-Embeddings Collection State-of-the-art embedding models fine-tuned for the ecommerce domain. +67% increase in evaluation metrics vs ViT-B-16-SigLIP. β’ 10 items β’ Updated Nov 14, 2024 β’ 17
NLI Eval Datasets Collection A curated collection of NLI evaluation datasets. Each dataset is exactly as originally proposed β’ 19 items β’ Updated Nov 12, 2024 β’ 3
BhasaAnuvaad Collection A Speech Translation Dataset for 13 Indian Languages β’ 11 items β’ Updated 26 days ago β’ 14