From scratch pretraining on english only no synthetic data, no code, 3 epochs of 1 gig of data for the ~135M param model. Test network using [Differential Transformer (Attention)](https://arxiv.org/abs/2410.05258). Other than some alterations to the attention, such as 16 heads insted of 9 and using differential attn, this is the same setup as https://huggingface.co/HuggingFaceTB/SmolLM2-135M-Instruct # Scripts: - `inference.py` to run the model with some test prompts - `test_train.py` runs with the exact configurations used to train this model and is the reproduction script. Data is assumed to be in JSONL format with `"text":"example text", "text":"..."` # Notes: Appears to be very competent, learned significantly faster than the GQA control. Achieved a slightly better minimum loss. The runtime at this scale is about on par with the GQA/MHA control. # Training Metrics ## Dataset Information - Training data per epoch: 1 GB - Total tokens trained: 48,261,120 - No sythetic data ## Training Results - Final Train Loss: 2.8485 - Final Train Perplexity: 17.15 ![image/png](https://cdn-uploads.huggingface.co/production/uploads/637f3b03932a61b89aefbf5c/2BCEqanzJUNh8uKZjp_cj.png)