Update README.md
Browse files
README.md
CHANGED
@@ -28,7 +28,7 @@ Tülu3 is designed for state-of-the-art performance on a diversity of tasks in a
|
|
28 |
|
29 |
- **Training Repository:** https://github.com/allenai/open-instruct
|
30 |
- **Eval Repository:** https://github.com/allenai/olmes
|
31 |
-
- **Paper:** https://
|
32 |
- **Demo:** https://playground.allenai.org/
|
33 |
|
34 |
### Model Family
|
@@ -41,6 +41,14 @@ Tülu3 is designed for state-of-the-art performance on a diversity of tasks in a
|
|
41 |
| **Final Models (RLVR)** | [allenai/Llama-3.1-Tulu-3-8B](https://huggingface.co/allenai/Llama-3.1-Tulu-3-8B) | [allenai/Llama-3.1-Tulu-3-70B](https://huggingface.co/allenai/Llama-3.1-Tulu-3-70B) |
|
42 |
| **Reward Model (RM)**| [allenai/Llama-3.1-Tulu-3-8B-RM](https://huggingface.co/allenai/Llama-3.1-Tulu-3-8B-RM) | (Same as 8B) |
|
43 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
44 |
## Using the model
|
45 |
|
46 |
### Loading with HuggingFace
|
@@ -108,6 +116,21 @@ See the Falcon 180B model card for an example of this.
|
|
108 |
| **AlpacaEval 2 (LC % win)** | 26.3 | 49.6 | 49.8 | 33.4 | 47.7 | 28.4 | **66.1** |
|
109 |
| **Safety (6 task avg.)** | **94.4** | 89.0 | 88.3 | 76.5 | 87.0 | 57.9 | 69.0 |
|
110 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
111 |
|
112 |
## Hyperparamters
|
113 |
|
|
|
28 |
|
29 |
- **Training Repository:** https://github.com/allenai/open-instruct
|
30 |
- **Eval Repository:** https://github.com/allenai/olmes
|
31 |
+
- **Paper:** https://arxiv.org/abs/2411.15124
|
32 |
- **Demo:** https://playground.allenai.org/
|
33 |
|
34 |
### Model Family
|
|
|
41 |
| **Final Models (RLVR)** | [allenai/Llama-3.1-Tulu-3-8B](https://huggingface.co/allenai/Llama-3.1-Tulu-3-8B) | [allenai/Llama-3.1-Tulu-3-70B](https://huggingface.co/allenai/Llama-3.1-Tulu-3-70B) |
|
42 |
| **Reward Model (RM)**| [allenai/Llama-3.1-Tulu-3-8B-RM](https://huggingface.co/allenai/Llama-3.1-Tulu-3-8B-RM) | (Same as 8B) |
|
43 |
|
44 |
+
| **Stage** | **Llama 3.1 405B** |
|
45 |
+
|-----------|-------------------|
|
46 |
+
| **Base Model** | [meta-llama/llama-3.1-405B](https://huggingface.co/meta-llama/llama-3.1-405B) |
|
47 |
+
| **SFT** | [allenai/llama-3.1-Tulu-3-405B-SFT](https://huggingface.co/allenai/llama-3.1-Tulu-3-405B-SFT) |
|
48 |
+
| **Final Model (DPO)** | [allenai/llama-3.1-Tulu-3-405B](https://huggingface.co/allenai/llama-3.1-Tulu-3-405B) |
|
49 |
+
| **Reward Model (RM)**| (Same as 8B)
|
50 |
+
|
51 |
+
|
52 |
## Using the model
|
53 |
|
54 |
### Loading with HuggingFace
|
|
|
116 |
| **AlpacaEval 2 (LC % win)** | 26.3 | 49.6 | 49.8 | 33.4 | 47.7 | 28.4 | **66.1** |
|
117 |
| **Safety (6 task avg.)** | **94.4** | 89.0 | 88.3 | 76.5 | 87.0 | 57.9 | 69.0 |
|
118 |
|
119 |
+
| Benchmark (eval) | Tülu 3 405B SFT | Tülu 3 405B DPO | Tülu 3 405B | Llama 3.1 405B Instruct | Nous Hermes 3 405B | Deepseek V3 | GPT 4o (11-24) |
|
120 |
+
|-----------------|----------------|----------------|-------------|------------------------|-------------------|-------------|----------------|
|
121 |
+
| **Avg w/o Safety** | 76.3 | 79.0 | 80.0 | 78.1 | 74.4 | 79.0 | **80.5** |
|
122 |
+
| **Avg w/ Safety** | 77.5 | 79.6 | 80.7 | 79.0 | 73.5 | 75.9 | **81.6** |
|
123 |
+
| **MMLU (5 shot, CoT)** | 84.4 | 86.6 | 87.0 | **88.0** | 84.9 | 82.1 | 87.9 |
|
124 |
+
| **PopQA (3 shot)** | **55.7** | 55.4 | 55.5 | 52.9 | 54.2 | 44.9 | 53.6 |
|
125 |
+
| **BigBenchHard (0 shot, CoT)** | 88.0 | 88.8 | 88.6 | 87.1 | 87.7 | **89.5** | 83.3 |
|
126 |
+
| **MATH (4 shot, Flex)** | 63.4 | 59.9 | 67.3 | 66.6 | 58.4 | **72.5** | 68.8 |
|
127 |
+
| **GSM8K (8 shot, CoT)** | 93.6 | 94.2 | **95.5** | 95.4 | 92.7 | 94.1 | 91.7 |
|
128 |
+
| **HumanEval (pass@10)** | 95.7 | **97.2** | 95.9 | 95.9 | 92.3 | 94.6 | 97.0 |
|
129 |
+
| **HumanEval+ (pass@10)** | 93.3 | **93.9** | 92.9 | 90.3 | 86.9 | 91.6 | 92.7 |
|
130 |
+
| **IFEval (prompt loose)** | 82.4 | 85.0 | 86.0 | **88.4** | 81.9 | 88.0 | 84.8 |
|
131 |
+
| **AlpacaEval 2 (LC % win)** | 30.4 | 49.8 | 51.4 | 38.5 | 30.2 | 53.5 | **65.0** |
|
132 |
+
| **Safety (6 task avg.)** | 87.7 | 85.5 | 86.7 | 86.8 | 65.8 | 72.2 | **90.9** |
|
133 |
+
|
134 |
|
135 |
## Hyperparamters
|
136 |
|