Model card v1
Browse files
README.md
ADDED
@@ -0,0 +1,36 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
---
|
2 |
+
license: mit
|
3 |
+
datasets:
|
4 |
+
- allenai/c4
|
5 |
+
language:
|
6 |
+
- de
|
7 |
+
library_name: transformers
|
8 |
+
pipeline_tag: fill-mask
|
9 |
+
---
|
10 |
+
|
11 |
+
# BERTchen-v0.1
|
12 |
+
|
13 |
+
Efficiently pretrained [MosaicBERT](https://huggingface.co/mosaicml/mosaic-bert-base) model on German [C4](https://huggingface.co/datasets/allenai/c4).
|
14 |
+
Paper and Code following soon.
|
15 |
+
|
16 |
+
## Model description
|
17 |
+
|
18 |
+
BERTchen follows the architecture of a MosaicBERT model (introduced [in](https://arxiv.org/abs/2312.17482)) and utilizes [FlashAttention 2](https://arxiv.org/abs/2307.08691). It is pretrained for 4 hours on one A100 40GB GPU.
|
19 |
+
|
20 |
+
Only the masked language modeling objective is used, making the [CLS] token redundant, which is excluded from the tokenizer. As pretraining data, a random subset of the German C4 dataset (introduced [in](https://arxiv.org/abs/1910.10683)) is used.
|
21 |
+
|
22 |
+
The tokenizer is taken from other efficient German pretraining work: [paper](https://openreview.net/forum?id=VYfJaHeVod) and [code](https://github.com/konstantinjdobler/tight-budget-llm-adaptation)
|
23 |
+
|
24 |
+
## Training procedure
|
25 |
+
BERTchen was pretrained using the MosaicBERT hyper-parameters (Which can be found in the [paper](https://arxiv.org/abs/2312.17482) and [here](https://github.com/mosaicml/examples/blob/main/examples/benchmarks/bert/yamls/main/mosaic-bert-base-uncased.yaml)). We changed the training-goal to 2500 to better reflect the steps achievable by the model in the constrained time. In addition, we used a batch size of 1024, with a sequence length of 512 as we found this to work better. After 4 hours the training is cut and the checkpoint saved.
|
26 |
+
|
27 |
+
## Evaluation results
|
28 |
+
| Task | Germanquad (F1/EM) | Germeval 2017 B | Germeval 2024 Subtask 1 as majority vote |
|
29 |
+
|:----:|:-----------:|:----:|:----:|
|
30 |
+
| | 96.4/93.6 | 0.96 | 0.887 |
|
31 |
+
|
32 |
+
## Model variations
|
33 |
+
For the creation of BERTchen we tested different datasets and training setups. Two notable variants are:
|
34 |
+
|
35 |
+
- [`BERTchen-v0.1`](https://huggingface.co/frederic-sadrieh/BERTchen-v0.1) Same pre-training just on the [CulturaX](https://huggingface.co/datasets/uonlp/CulturaX) dataset.
|
36 |
+
- [`hybrid_BERTchen-v0.1`](https://huggingface.co/frederic-sadrieh/hybrid_BERTchen-v0.1) Pre-trained on [CulturaX](https://huggingface.co/datasets/uonlp/CulturaX) with own hybrid sequence length changing approach (For more information see model card or paper)
|