Test result

#1
by krustik - opened

Model in GGUF best quality Q8 uses 539Gb RAM, which is unusually high compared to storage size (usually it's that +10% in GGUFs).
It was a mistake to use as basis of this product a Meta's Llama 3 405B model, which is really bad model in my tests, it's lobotomized incredibly by censorship to the level of uselessness (by that i mean on request to do reasoning or etc the model report standard censoring decline and on many similar topics) and was incredibly slow, we wasn't able to compare it by speed for a long time, but after large Deepseek 671B, which is fast even on only CPU setups, we see how badly Meta's Llama 3 405B was made.
By result of using Llama 405B this Tulu 3 model is incredibly slow and sluggish - x10 times slower than Deepseek V3 or R1 671B-Q6 on same hardware (0.70 token/sec deepseek R1-Q6, 0.07 token/sec for tulu3/llama-Q8).
Quality, unfortunately even on Q8 on my tests with repairing Chuck code or creating a Mozart piece it failed and produce broken code. Deepseek V2.5-Q8 which is just 235B model was able to repair code, better if it was used as basis for Tulu or Deepseek V3/R1 which in Q6 uses 567Gb RAM.

Testing made on 20 cores Xeon - CPU-only setup, on enterprise level Gigabyte motherboard with 12 RAM slots, in oobabooga (text-generation-webui 2.4) llm launcher, GGUF Q8 model by Bartowski.

Sign up or log in to comment