Model Card for llm-jp-clip-vit-base-patch16

Model Details

Japanese CLIP model trained with OpenCLIP on relaion2B-en-research-safe-japanese-translation, a Japanese translation of the English subset of ReLAION-5B (https://huggingface.co/datasets/laion/relaion2B-en-research-safe), translated by gemma-2-9b-it.

The total number of parameters of this model is 248M.

How to Use

Installation

$ pip install open_clip_torch

Zero-shot Image Classification

import open_clip

model, preprocess = open_clip.create_model_from_pretrained('hf-hub:llm-jp/llm-jp-clip-vit-base-patch16')
tokenizer = open_clip.get_tokenizer('hf-hub:llm-jp/llm-jp-clip-vit-base-patch16')

import torch
from PIL import Image
import requests

url = 'http://images.cocodataset.org/val2017/000000039769.jpg'
image = Image.open(requests.get(url, stream=True).raw)
image = preprocess(image).unsqueeze(0)
text = tokenizer(["猫", "犬", "鳥"])

with torch.no_grad(), torch.cuda.amp.autocast():
    image_features = model.encode_image(image)
    text_features = model.encode_text(text)
    image_features /= image_features.norm(dim=-1, keepdim=True)
    text_features /= text_features.norm(dim=-1, keepdim=True)

    text_probs = (100.0 * image_features @ text_features.T).softmax(dim=-1)

print("Label probs:", text_probs)
# Label probs: tensor([[9.9425e-01, 5.2273e-03, 5.2600e-04]])

Reference:

Training Details

Model Architecture

  • Text Encoder: RoBERTa base with llm-jp-tokenizer
  • Image Encoder: ViT-B/16

Training Data

This model is trained on relaion2B-en-research-safe-japanese-translation. Due to a 70% success rate in image downloads, the dataset size was 1.45 billion samples, and we processed it over 9 epochs (13 billion samples in total).

Evaluation

Evaluation Code: https://github.com/llm-jp/clip-eval

Table: Performance of each model in zero-shot image classification and image-text retrieval tasks. Bold indicates first place, and underline indicates second place.

Model Params (M) ImageNet Recruit CIFAR10 CIFAR100 Food101 Caltech101 XM3600 I → T XM3600 T → I Avg.
Japanese CLIP
Rinna ViT-B/16 196 50.6 39.9 90.7 64.0 53.2 84.6 53.8 54.0 61.4
Rinna ViT-B/16 cloob 196 54.6 41.6 88.2 60.3 57.2 80.2 53.4 53.4 61.1
LY ViT-B/16 196 52.0 83.8 96.3 76.7 73.9 88.4 76.9 78.0 78.3
llm-jp-ViT-B/16 248 54.2 59.4 91.8 69.2 82.2 85.6 73.6 72.7 73.6
StabilityAI ViT-L/16 414 62.4 70.5 97.6 84.1 74.0 86.7 67.3 66.0 76.1
llm-jp-ViT-L/14 467 59.5 62.9 96.4 77.0 88.2 87.8 74.1 74.1 77.5
Multilingual CLIP
SigLIP B/16-256 multi 370 51.9 71.2 92.4 65.8 78.6 85.6 45.9 43.0 66.8
jina-clip-v2 865 35.8 48.1 95.1 58.3 52.0 69.4 67.3 66.4 61.6
LAION ViT-H/14 multi 1193 53.0 74.5 97.9 78.4 74.3 85.1 75.0 72.0 76.3

LICENSE

The Apache License, Version 2.0

Please refer to the Gemma Terms of Use, as the training data was translated using gemma-2-9b-it. We utilizes Gemma solely for translation purposes. According to the definition of "Model Derivatives" in Section 1.1(e), our model does not fall under the category of a "model in order to cause that model to perform similarly to Gemma." Therefore, we have concluded that it is not necessary to inherit the Gemma license.

Citation

Bibtex:

@inproceedings{sugiura2025clip,
author = {杉浦 一瑳 and 栗田 修平 and 小田 悠介 and 河原大輔 and 岡崎 直観},
month = mar,
series = {言語処理学会第31回年次大会 (NLP2025)},
title = {オープンLLMによる翻訳を活用した日本語 CLIP の開発},
year = {2025}
}
Downloads last month
12
Inference Providers NEW
This model is not currently available via any of the supported third-party Inference Providers, and the model is not deployed on the HF Inference API.

Dataset used to train llm-jp/llm-jp-clip-vit-base-patch16