This is an llmcompressor v0.4.0 FP8 Dynamic quant.
You can refer to CPU offloading example but for quanting with an H100 node, we used this setup to avoid OOM errors:
config = AutoConfig.from_pretrained(model_name)
with init_empty_weights():
model = AutoModelForCausalLM.from_config(config)
max_memory = {
0: "60GiB",
1: "60GiB",
2: "60GiB",
3: "60GiB",
4: "60GiB",
5: "60GiB",
6: "60GiB",
7: "60GiB",
"cpu": "1500GiB",
}
device_map = infer_auto_device_map(
model,
max_memory=max_memory,
no_split_module_classes=["LlamaDecoderLayer"],
)
Original model here: https://huggingface.co/allenai/Llama-3.1-Tulu-3-405B
- Downloads last month
- 2
Inference Providers
NEW
This model is not currently available via any of the supported third-party Inference Providers, and
HF Inference API was unable to determine this model's library.
Model tree for shisa-ai/Llama-3.1-Tulu-3-405B-FP8-Dynamic
Base model
meta-llama/Llama-3.1-405B
Finetuned
allenai/Llama-3.1-Tulu-3-405B-SFT
Finetuned
allenai/Llama-3.1-Tulu-3-405B-DPO
Finetuned
allenai/Llama-3.1-Tulu-3-405B