File size: 4,641 Bytes
bb430d5
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
2515b82
bb430d5
2515b82
bb430d5
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
2515b82
bb430d5
2515b82
bb430d5
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
---
license: cc-by-sa-4.0
language:
- en
- ar
- zh
- cs
- de
- fr
- ru
---

# ZeroMMT

ZeroMMT is compatible with Python 3.9, we did not test any other python versions.

[Read the paper (arXiv)](https://arxiv.org/abs/2407.13579)

<p align="justify"> ZeroMMT is a zero-shot multilingual multimodal machine translation system trained on English text-image pairs only. It starts from a pretrained NLLB (more info <a href="https://github.com/facebookresearch/fairseq/tree/nllb">here</a>) and adapts it using lightweight modules (<a href="https://github.com/adapter-hub/adapters">adapters</a> & visual projector) while keeping original weights frozen during training. It is trained using visually conditioned masked language modeling and KL divergence between original MT outputs and new MMT ones. ZeroMMT is available in 3 sizes: 600M, 1.3B and 3.3B. The largest model shows state-of-the-art performances on <a href="https://github.com/MatthieuFP/CoMMuTE">CoMMuTE</a>, benchmark intended to evaluate abilities of multimodal translation systems to exploit image information to disambiguate the English sentence to be translated. ZeroMMT is multilingual and available for English-to-{Arabic,Chinese,Czech,German,French,Russian}.</p>


If you use this package or like our work, please cite:
```
@misc{futeral2024zeroshotmultimodalmachinetranslation,
      title={Towards Zero-Shot Multimodal Machine Translation}, 
      author={Matthieu Futeral and Cordelia Schmid and Benoît Sagot and Rachel Bawden},
      year={2024},
      eprint={2407.13579},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2407.13579}, 
}
```

### Installation

```
pip install zerommt
```

### Example

**without cfg**
```
import requests
from PIL import Image
import torch
from zerommt import create_model

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = create_model(model_path="matthieufp/ZeroMMT-1.3B",
                     enable_cfg=False).to(device)
model.eval()

image = Image.open(
    requests.get(
        "http://images.cocodataset.org/val2017/000000002153.jpg", stream=True
    ).raw
)

src_text = "He's got a bat in his hands."
src_lang = "eng_Latn"
tgt_lang = "fra_Latn"

# Compute cross-entropy loss given translation
tgt_text = "Il a une batte dans ses mains."

with torch.inference_mode():
    loss = model(imgs=[image],
                 src_text=[src_text],
                 src_lang=src_lang,
                 tgt_text=[tgt_text],
                 tgt_lang=tgt_lang,
                 output_loss=True)

print(loss)

# Generate translation with beam search
beam_size = 4

image2 = Image.open(
    requests.get(
        "https://zupimages.net/up/24/29/7r3s.jpg", stream=True
    ).raw
)

with torch.inference_mode():
    generated = model.generate(imgs=[image, image2],
                               src_text=[src_text, src_text],
                               src_lang=src_lang,
                               tgt_lang=tgt_lang,
                               beam_size=beam_size)

translation = model.tokenizer.batch_decode(generated, skip_special_tokens=True)
print(translation)
```

**with cfg** (WARNING: enabling cfg will require approximately twice as much memory!)

```
import requests
from PIL import Image
import torch
from zerommt import create_model

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = create_model(model_path="matthieufp/ZeroMMT-1.3B",
                     enable_cfg=True).to(device)
model.eval()

image = Image.open(
    requests.get(
        "http://images.cocodataset.org/val2017/000000002153.jpg", stream=True
    ).raw
)

src_text = "He's got a bat in his hands."
src_lang = "eng_Latn"
tgt_lang = "fra_Latn"

# Compute cross-entropy loss given translation
tgt_text = "Il a une batte dans ses mains."
cfg_value = 1.25

with torch.inference_mode():
    loss = model(imgs=[image],
                 src_text=[src_text],
                 src_lang=src_lang,
                 tgt_text=[tgt_text],
                 tgt_lang=tgt_lang,
                 output_loss=True,
                 cfg_value=cfg_value)
print(loss)

# Generate translation with beam search and cfg
beam_size = 4

with torch.inference_mode():
    generated = model.generate(imgs=[image],
                               src_text=[src_text],
                               src_lang=src_lang,
                               tgt_lang=tgt_lang,
                               beam_size=beam_size,
                               cfg_value=cfg_value)
                               
translation = model.tokenizer.batch_decode(generated)[0]
print(translation)
```