shisa-ai/Llama-3.1-Tulu-3-405B-FP8-Dynamic

This is an llmcompressor v0.4.0 FP8 Dynamic quant.

You can refer to CPU offloading example but for quanting with an H100 node, we used this setup to avoid OOM errors:

config = AutoConfig.from_pretrained(model_name)
with init_empty_weights():
    model = AutoModelForCausalLM.from_config(config)

max_memory = {
      0: "60GiB",
      1: "60GiB",
      2: "60GiB",
      3: "60GiB",
      4: "60GiB",
      5: "60GiB",
      6: "60GiB",
      7: "60GiB",
      "cpu": "1500GiB",
}

device_map = infer_auto_device_map(
    model,
    max_memory=max_memory,
    no_split_module_classes=["LlamaDecoderLayer"],
)

Original model here: https://huggingface.co/allenai/Llama-3.1-Tulu-3-405B

Model tree for shisa-ai/Llama-3.1-Tulu-3-405B-FP8-Dynamic

shisa-ai
/

Llama-3.1-Tulu-3-405B-FP8-Dynamic

Model tree for shisa-ai/Llama-3.1-Tulu-3-405B-FP8-Dynamic

Dataset used to train shisa-ai/Llama-3.1-Tulu-3-405B-FP8-Dynamic