This is an llmcompressor v0.4.0 FP8 Dynamic quant.

You can refer to CPU offloading example but for quanting with an H100 node, we used this setup to avoid OOM errors:

config = AutoConfig.from_pretrained(model_name)
with init_empty_weights():
    model = AutoModelForCausalLM.from_config(config)

max_memory = {
      0: "60GiB",
      1: "60GiB",
      2: "60GiB",
      3: "60GiB",
      4: "60GiB",
      5: "60GiB",
      6: "60GiB",
      7: "60GiB",
      "cpu": "1500GiB",
}

device_map = infer_auto_device_map(
    model,
    max_memory=max_memory,
    no_split_module_classes=["LlamaDecoderLayer"],
)

Original model here: https://huggingface.co/allenai/Llama-3.1-Tulu-3-405B

Downloads last month
2
Safetensors
Model size
406B params
Tensor type
BF16
·
F8_E4M3
·
Inference Providers NEW
This model is not currently available via any of the supported third-party Inference Providers, and HF Inference API was unable to determine this model's library.

Model tree for shisa-ai/Llama-3.1-Tulu-3-405B-FP8-Dynamic

Dataset used to train shisa-ai/Llama-3.1-Tulu-3-405B-FP8-Dynamic