YAML Metadata Warning: empty or missing yaml metadata in repo card (https://huggingface.co/docs/hub/model-cards#model-card-metadata)

From scratch pretraining on english only no synthetic data, no code, 3 epochs of 1 gig of data for the ~125M param model.

Test network using Tensor Product Attention. Other than some alterations to the attention, such as 16 heads insted of 9 and using TPA, this is the same setup as https://huggingface.co/HuggingFaceTB/SmolLM2-135M-Instruct

Scripts:

  • inference.py to run the model with some test prompts
  • test_train.py runs with the exact configurations used to train this model and is the reproduction script. Data is assumed to be in JSONL format with "text":"example text", "text":"..."

Notes:

One of the primary reported benefits for TPA are for inference which are not really being leveraged at all, although you can probably fit a larger bsz than traditional MHA/GQA with this. This did save about 5% on params, that amount should scale much more as the network size increases. The run time is very similar to MHA/GQA at this scale.

Training Metrics

Dataset Information

  • Training data per epoch: 1 GB
  • Total tokens trained: 48,261,120
  • No sythetic data

Training Results

  • Final Train Loss: 3.0421
  • Final Train Perplexity: 20.95

image/png

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model is not currently available via any of the supported third-party Inference Providers, and HF Inference API was unable to determine this model's library.