Overview

This is the repo for intermediate checkpoints for my upcoming MicroLlama V2 model with 500 million parameters based on Llama3.2. They are completed pretrained from scratch using SlmPajama-627B. This project is still work in progress and I have only trained with 5B tokens so far. I will keep running the training process until I run out of funds.

Some reasons for using these checkpoints:

You can use them starting point to train your own small language model.
More interestingly, you can prob into the learning process of these models to understand how LLM learns to mimic human.

How to use these checkpoints

These checkpoints are compatible with litgpt with slight modifications (see section below).

In order to load them into transformer models, you will need to convert the litgpt pretraining checkpoint into litgpt inference only checkpoint (no code modification is required):

# Install litgpt
pip install 'litgpt[all]'

# litgpt pretrain checkpoint to inference checkpoint 
litgpt convert_pretrained_checkpoint <LOCAL_PATH_TO_CHECKPOINT_FROM_THIS_REPO> \
  --output_dir <LOCAL_PATH_TO_INFERENCE_CHECKPOINT>

# litgpt inference checkpoint to HF checkpoints
litgpt convert_from_litgpt <LOCAL_PATH_TO_INFERENCE_CHECKPOINT> <LOCAL_OUTPUT_PATH_TO_CONVERTED_HF_CHECKPOINT>

Reference:

litgpt pretrain checkpoint to inference checkpoint https://github.com/Lightning-AI/litgpt/blob/main/tutorials/pretrain_tinyllama.md#export-checkpoints
litgpt inference checkpoint to HF checkpoints https://github.com/Lightning-AI/litgpt/blob/main/tutorials/convert_lit_models.md

Caveat: for some reason the auto generated config.json for the model in the checkpoint is incorrect, you will need to replace it with https://huggingface.co/keeeeenw/MicroLlama2-checkpoints/blob/main/config.json to resolve any inference or evaluation error.

Advanced usage - pretraining with litgpt

For folks who are familar with litgpt, you can add the following code to your config.py to use these checkpoints to continue to train the model.

    # based on Llama-3.2-1B
    dict(
        name="micro-llama-300M-v2",
        hf_config=dict(org="keeeeenw", name="MicroLlamaV2"),
        block_size=131072,  # Stable choice for Llama model training
        # This contributes to 300M to 500M parameter increase
        # Note that we cannot change this number because the llama3
        # tokenizer is hardcoded to support this vocab size.
        vocab_size=128000,
        padded_vocab_size=128256,
        n_layer=12,
        n_embd=1024,
        n_head=16,
        n_query_groups=4,
        rotary_percentage=1.0,
        parallel_residual=False,
        bias=False,
        norm_class_name="RMSNorm",
        mlp_class_name="LLaMAMLP",
        intermediate_size=5632,
        rope_base=500000,  # Scaling for long sequence support
        # RoPE adjustments for block size of 131072
        rope_adjustments=dict(
            factor=16.0,  # Matches block_size=131072
            low_freq_factor=1.0,
            high_freq_factor=4.0,
            original_max_seq_len=8192  # Max seq length for 128K token block
        )
    ),

You will need to preprocess your data using meta-llama/Llama-3.2-1B tokenizer similar to prepare-the-tinyllama-1t-token-dataset which uses the Llama2 tokenizer.

Assuming you have litgpt installed already,

git clone https://huggingface.co/datasets/cerebras/SlimPajama-627B data

litgpt download meta-llama/Llama-3.2-1B \
   --access_token your_hf_token \
   --tokenizer_only true

python litgpt/data/prepare_slimpajama.py \
  --input_dir data/slimpajama-raw/train \
  --output_dir data/slimpajama/train \
  --tokenizer_path checkpoints/meta-llama/Llama-3.2-1B

python litgpt/data/prepare_slimpajama.py \
  --input_dir data/slimpajama-raw/validation \
  --output_dir data/slimpajama/val \
  --tokenizer_path checkpoints/meta-llama/Llama-3.2-1B

Please note that this data processing process run on CPU only and will take a long time if you don't have CPU with 96+ cores. I have tried to shared the converted data as a HF dataset, but HF does not support having too many files within the same directory. I will figure how to distribute the converted dataset later.

Finally you can use my config to start training https://huggingface.co/keeeeenw/MicroLlama2-checkpoints/blob/main/microllama_v2.yaml

Note: the config has 300M in the model name but it is actually 500M due to the vocab size increase from Llama2 to Llam3:

litgpt pretrain \
  --config microllama_v2.yaml \
  --resume <LOCAL_PATH_TO_CHECKPOINT_FROM_THIS_REPO>

IMPORTANT NOTE I have had various issues when moving from server to server to resume training from checkpoints specifically when I switched from Lightning AI Studio to a private server. For example, Lightning AI Studio may look for your preprocessed data from /root/.lightning/chunks/ if you store the preposed data on S3 and allows Lightning AI studio to stream the data while training. When I moved to a private server, litgpt tried to look for the same data under /cache/chunks/.

If you run into any issues with resuming training, just convert the checkpoint to inference checkpoint and then you can load it again.

litgpt convert_pretrained_checkpoint <LOCAL_PATH_TO_CHECKPOINT_FROM_THIS_REPO> \
  --output_dir <LOCAL_PATH_TO_INFERENCE_CHECKPOINT>

litgpt pretrain \
  --config microllama_v2.yaml \
  --initial_checkpoint_dir <LOCAL_PATH_TO_INFERENCE_CHECKPOINT>

You will lose the index to the training dataset as well as other hyperparams such as learning rate but this allows you to start your pre-training quickly.

Evaluation results

Note this does not represent the final performance of the model and should only be served as a reference for my training progress.

checkpoint: step-00088000

|    Tasks    |Version|Filter|n-shot| Metric |Value |   |Stderr|
|-------------|------:|------|-----:|--------|-----:|---|-----:|
|piqa         |      1|none  |     0|acc     |0.6202|±  |0.0113|
|             |       |none  |     0|acc_norm|0.6213|±  |0.0113|
|boolq        |      2|none  |     0|acc     |0.5875|±  |0.0086|
|arc_challenge|      1|none  |     0|acc     |0.1980|±  |0.0116|
|             |       |none  |     0|acc_norm|0.2201|±  |0.0121|
|arc_easy     |      1|none  |     0|acc     |0.4373|±  |0.0102|
|             |       |none  |     0|acc_norm|0.3935|±  |0.0100|
|winogrande   |      1|none  |     0|acc     |0.5004|±  |0.0141|
|openbookqa   |      1|none  |     0|acc     |0.1760|±  |0.0170|
|             |       |none  |     0|acc_norm|0.2680|±  |0.0198|
|hellaswag    |      1|none  |     0|acc     |0.2893|±  |0.0045|
|             |       |none  |     0|acc_norm|0.3125|±  |0.0046|

You can use the following script to reproduce the results (assuming you have installed litgpt)

MODEL_NAME="step-00088000"
MODEL_OUTPUT_ROOT="MicroLlamaV2-VastAI-Checkpoints/out/pretrain/micro-llama-v2"
MODEL_OUTPUT_REL="${MODEL_OUTPUT_ROOT}/${MODEL_NAME}"

# HuggingFace
huggingface-cli download keeeeenw/MicroLlama2-checkpoints ${MODEL_NAME}/lit_model.pth --local-dir checkpoints/${MODEL_OUTPUT_ROOT}/
huggingface-cli download keeeeenw/MicroLlama2-checkpoints ${MODEL_NAME}/generation_config.json --local-dir checkpoints/${MODEL_OUTPUT_ROOT}/
huggingface-cli download keeeeenw/MicroLlama2-checkpoints ${MODEL_NAME}/hyperparameters.yaml --local-dir checkpoints/${MODEL_OUTPUT_ROOT}/
huggingface-cli download keeeeenw/MicroLlama2-checkpoints ${MODEL_NAME}/model_config.yaml --local-dir checkpoints/${MODEL_OUTPUT_ROOT}/
huggingface-cli download keeeeenw/MicroLlama2-checkpoints ${MODEL_NAME}/tokenizer.json --local-dir checkpoints/${MODEL_OUTPUT_ROOT}/
huggingface-cli download keeeeenw/MicroLlama2-checkpoints ${MODEL_NAME}/tokenizer_config.json --local-dir checkpoints/${MODEL_OUTPUT_ROOT}/

# Copy config, see "caveat" below
cp -r <local_path>/config.json checkpoints/${MODEL_OUTPUT_REL}/

# AWS
# aws s3 cp s3://microllama-v2/checkpoints/out/pretrain/micro-llama-v2/${MODEL_NAME} checkpoints/${MODEL_OUTPUT_REL} --recursive

litgpt evaluate \
  ${MODEL_OUTPUT_REL} \
  --tasks "hellaswag,openbookqa,winogrande,arc_easy,arc_challenge,boolq,piqa" \
  --device cuda:0 \
  --batch_size 16

keeeeenw
/

MicroLlama2-checkpoints

Overview

How to use these checkpoints

Advanced usage - pretraining with litgpt

Evaluation results

Dataset used to train keeeeenw/MicroLlama2-checkpoints