Jim Lai

grimjim

AI & ML interests

Experimenting primarily with 7B-12B parameter text completion models. Not all models are intended for direct use, but aim for research and/or educational purposes.

Recent Activity

new activity 4 days ago

grimjim/DeepSauerHuatuoSkywork-R1-o1-Llama-3.1-8B:Adding Evaluation Results

posted an update 5 days ago

I've made yet another merge of reasoning models with incremental gains on the current Open LLM leaderboard. open-llm-leaderboard/open_llm_leaderboard Merging in DeepSeek R1 distillation to Llama 3.1 8B (at 10% task arithmetic weight, using the Llama 3.1 8B base model as the case rather than the instruct model) with a prior best merge resulted in a slightly lower IFEval, but a higher result in every other benchmark save for MMLU-PRO, which went down only marginally. MATH Lvl5 and GPQA went up palpably. grimjim/DeepSauerHuatuoSkywork-R1-o1-Llama-3.1-8B This result is currently my best Llama 3.1 8B merge result to date. The actual R1 distillation itself scored quite badly, so this would seem to be another case of unexpected formatting (reflected in IFEval) hurting the evaluation results, obscuring the strength of a model. It is also possible to use the text generation feature of this model to generate roleplay completions. Based on informal testing, this model's bias toward problem-solving will subtly impact narration.

updated a collection 6 days ago

Highlighted work

View all activity

Organizations

Posts 20

Post

2242

I've made yet another merge of reasoning models with incremental gains on the current Open LLM leaderboard.
open-llm-leaderboard/open_llm_leaderboard

Merging in DeepSeek R1 distillation to Llama 3.1 8B (at 10% task arithmetic weight, using the Llama 3.1 8B base model as the case rather than the instruct model) with a prior best merge resulted in a slightly lower IFEval, but a higher result in every other benchmark save for MMLU-PRO, which went down only marginally. MATH Lvl5 and GPQA went up palpably.
grimjim/DeepSauerHuatuoSkywork-R1-o1-Llama-3.1-8B

This result is currently my best Llama 3.1 8B merge result to date. The actual R1 distillation itself scored quite badly, so this would seem to be another case of unexpected formatting (reflected in IFEval) hurting the evaluation results, obscuring the strength of a model.

It is also possible to use the text generation feature of this model to generate roleplay completions. Based on informal testing, this model's bias toward problem-solving will subtly impact narration.

Post

1847

A recent merge has provided another interesting result on the current Open LLM leaderboard.
open-llm-leaderboard/open_llm_leaderboard

Combining an o1 reasoning merge with VAGOsolutions's Llama-3.1 SauerkrautLM 8B Instruct model resulted in a lower IFEval, but a higher result in every other benchmark. This result is currently my best Llama 3.1 8B merge result to date.
grimjim/SauerHuatuoSkywork-o1-Llama-3.1-8B
The results suggest that defects in output format and/or output parsing may be limiting benchmark performance of various o1 models.

View all Posts

Collections 5

models 127

datasets 3

grimjim/empatheticdialogues

Updated 29 days ago • 62

grimjim/PAlign-PAPI-personality_prompt.json-cleaned

Viewer • Updated Sep 21, 2024 • 300 • 50

grimjim/adversarial-10-alpaca

Viewer • Updated Aug 16, 2024 • 10 • 42 • 1

Jim Lai

AI & ML interests

Recent Activity

Organizations

Posts 20

Collections 5

grimjim/SauerHuatuoSkywork-o1-Llama-3.1-8B

grimjim/SauerHuatuoSkywork-o1-Llama-3.1-8B-GGUF

grimjim/DeepSauerHuatuoSkywork-R1-o1-Llama-3.1-8B

grimjim/HuatuoSkywork-o1-Llama-3.1-8B

grimjim/kuno-kunoichi-v1-DPO-v2-SLERP-7B

grimjim/kukulemon-7B

grimjim/kukulemon-spiked-9B

grimjim/kukulemon-32K-7B

models 127

grimjim/DeepSauerHuatuoSkywork-R1-o1-Llama-3.1-8B

grimjim/SauerHuatuoSkywork-o1-Llama-3.1-8B-GGUF

grimjim/SauerHuatuoSkywork-o1-Llama-3.1-8B

grimjim/BadApple-o1-Llama-3.1-8B

grimjim/Magnolia-v4-12B

grimjim/HuatuoSkywork-o1-Llama-3.1-8B

grimjim/Magnolia-v4-Gemma2-8k-9B

grimjim/Llama3.1-SuperNovaLite-HuatuoSkywork-o1-8B

grimjim/lemon07r_Gemma-2-Ataraxy-v4c-9B_fixed

grimjim/Magnolia-v3-Gemma2-8k-9B

datasets 3

grimjim/empatheticdialogues

grimjim/PAlign-PAPI-personality_prompt.json-cleaned

grimjim/adversarial-10-alpaca

Jim Lai

AI & ML interests

Recent Activity

Organizations

Posts 20

Collections 5

models 127 Sort: Recently updated

datasets 3 Sort: Recently updated

models 127

datasets 3