vwxyzjn commited on
Commit
d0cdfbe
·
verified ·
1 Parent(s): 4edbc67

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +24 -1
README.md CHANGED
@@ -28,7 +28,7 @@ Tülu3 is designed for state-of-the-art performance on a diversity of tasks in a
28
 
29
  - **Training Repository:** https://github.com/allenai/open-instruct
30
  - **Eval Repository:** https://github.com/allenai/olmes
31
- - **Paper:** https://allenai.org/papers/tulu-3-report.pdf (arXiv soon)
32
  - **Demo:** https://playground.allenai.org/
33
 
34
  ### Model Family
@@ -41,6 +41,14 @@ Tülu3 is designed for state-of-the-art performance on a diversity of tasks in a
41
  | **Final Models (RLVR)** | [allenai/Llama-3.1-Tulu-3-8B](https://huggingface.co/allenai/Llama-3.1-Tulu-3-8B) | [allenai/Llama-3.1-Tulu-3-70B](https://huggingface.co/allenai/Llama-3.1-Tulu-3-70B) |
42
  | **Reward Model (RM)**| [allenai/Llama-3.1-Tulu-3-8B-RM](https://huggingface.co/allenai/Llama-3.1-Tulu-3-8B-RM) | (Same as 8B) |
43
 
 
 
 
 
 
 
 
 
44
  ## Using the model
45
 
46
  ### Loading with HuggingFace
@@ -108,6 +116,21 @@ See the Falcon 180B model card for an example of this.
108
  | **AlpacaEval 2 (LC % win)** | 26.3 | 49.6 | 49.8 | 33.4 | 47.7 | 28.4 | **66.1** |
109
  | **Safety (6 task avg.)** | **94.4** | 89.0 | 88.3 | 76.5 | 87.0 | 57.9 | 69.0 |
110
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
111
 
112
  ## Hyperparamters
113
 
 
28
 
29
  - **Training Repository:** https://github.com/allenai/open-instruct
30
  - **Eval Repository:** https://github.com/allenai/olmes
31
+ - **Paper:** https://arxiv.org/abs/2411.15124
32
  - **Demo:** https://playground.allenai.org/
33
 
34
  ### Model Family
 
41
  | **Final Models (RLVR)** | [allenai/Llama-3.1-Tulu-3-8B](https://huggingface.co/allenai/Llama-3.1-Tulu-3-8B) | [allenai/Llama-3.1-Tulu-3-70B](https://huggingface.co/allenai/Llama-3.1-Tulu-3-70B) |
42
  | **Reward Model (RM)**| [allenai/Llama-3.1-Tulu-3-8B-RM](https://huggingface.co/allenai/Llama-3.1-Tulu-3-8B-RM) | (Same as 8B) |
43
 
44
+ | **Stage** | **Llama 3.1 405B** |
45
+ |-----------|-------------------|
46
+ | **Base Model** | [meta-llama/llama-3.1-405B](https://huggingface.co/meta-llama/llama-3.1-405B) |
47
+ | **SFT** | [allenai/llama-3.1-Tulu-3-405B-SFT](https://huggingface.co/allenai/llama-3.1-Tulu-3-405B-SFT) |
48
+ | **Final Model (DPO)** | [allenai/llama-3.1-Tulu-3-405B](https://huggingface.co/allenai/llama-3.1-Tulu-3-405B) |
49
+ | **Reward Model (RM)**| (Same as 8B)
50
+
51
+
52
  ## Using the model
53
 
54
  ### Loading with HuggingFace
 
116
  | **AlpacaEval 2 (LC % win)** | 26.3 | 49.6 | 49.8 | 33.4 | 47.7 | 28.4 | **66.1** |
117
  | **Safety (6 task avg.)** | **94.4** | 89.0 | 88.3 | 76.5 | 87.0 | 57.9 | 69.0 |
118
 
119
+ | Benchmark (eval) | Tülu 3 405B SFT | Tülu 3 405B DPO | Tülu 3 405B | Llama 3.1 405B Instruct | Nous Hermes 3 405B | Deepseek V3 | GPT 4o (11-24) |
120
+ |-----------------|----------------|----------------|-------------|------------------------|-------------------|-------------|----------------|
121
+ | **Avg w/o Safety** | 76.3 | 79.0 | 80.0 | 78.1 | 74.4 | 79.0 | **80.5** |
122
+ | **Avg w/ Safety** | 77.5 | 79.6 | 80.7 | 79.0 | 73.5 | 75.9 | **81.6** |
123
+ | **MMLU (5 shot, CoT)** | 84.4 | 86.6 | 87.0 | **88.0** | 84.9 | 82.1 | 87.9 |
124
+ | **PopQA (3 shot)** | **55.7** | 55.4 | 55.5 | 52.9 | 54.2 | 44.9 | 53.6 |
125
+ | **BigBenchHard (0 shot, CoT)** | 88.0 | 88.8 | 88.6 | 87.1 | 87.7 | **89.5** | 83.3 |
126
+ | **MATH (4 shot, Flex)** | 63.4 | 59.9 | 67.3 | 66.6 | 58.4 | **72.5** | 68.8 |
127
+ | **GSM8K (8 shot, CoT)** | 93.6 | 94.2 | **95.5** | 95.4 | 92.7 | 94.1 | 91.7 |
128
+ | **HumanEval (pass@10)** | 95.7 | **97.2** | 95.9 | 95.9 | 92.3 | 94.6 | 97.0 |
129
+ | **HumanEval+ (pass@10)** | 93.3 | **93.9** | 92.9 | 90.3 | 86.9 | 91.6 | 92.7 |
130
+ | **IFEval (prompt loose)** | 82.4 | 85.0 | 86.0 | **88.4** | 81.9 | 88.0 | 84.8 |
131
+ | **AlpacaEval 2 (LC % win)** | 30.4 | 49.8 | 51.4 | 38.5 | 30.2 | 53.5 | **65.0** |
132
+ | **Safety (6 task avg.)** | 87.7 | 85.5 | 86.7 | 86.8 | 65.8 | 72.2 | **90.9** |
133
+
134
 
135
  ## Hyperparamters
136