Update README.md
Browse files
README.md
CHANGED
@@ -27,52 +27,62 @@ tags:
|
|
27 |
StreetCLIP is a robust foundation model for open-domain image geolocalization and other
|
28 |
geographic and climate-related tasks.
|
29 |
|
30 |
-
Trained on
|
31 |
-
on multiple open-domain image geolocalization benchmarks in zero-shot,
|
32 |
-
trained on millions of images.
|
33 |
|
|
|
34 |
|
35 |
-
|
|
|
|
|
|
|
|
|
36 |
|
37 |
-
## Model
|
38 |
|
39 |
-
<!-- Provide a longer summary of what this model is. -->
|
40 |
-
|
41 |
-
|
42 |
-
- **Developed by:** Authors not disclosed
|
43 |
- **Model type:** [CLIP](https://openai.com/blog/clip/)
|
44 |
- **Language:** English
|
45 |
- **License:** Create Commons Attribution Non Commercial 4.0
|
46 |
-
- **
|
47 |
|
48 |
## Model Sources
|
49 |
|
50 |
- **Paper:** Pre-print available soon ...
|
51 |
-
- **Demo:** Currently in development ...
|
52 |
|
53 |
# Uses
|
54 |
|
55 |
-
|
|
|
|
|
56 |
|
57 |
## Direct Use
|
58 |
|
59 |
-
|
|
|
|
|
|
|
|
|
60 |
|
61 |
## Downstream Use
|
62 |
|
63 |
-
|
|
|
64 |
|
65 |
## Out-of-Scope Use
|
66 |
|
67 |
-
|
68 |
|
69 |
# Bias, Risks, and Limitations
|
70 |
|
71 |
-
|
|
|
72 |
|
73 |
## Recommendations
|
74 |
-
|
75 |
-
|
|
|
|
|
76 |
|
77 |
## How to Get Started with the Model
|
78 |
|
@@ -102,14 +112,23 @@ probs = logits_per_image.softmax(dim=1) # we can take the softmax to get the lab
|
|
102 |
|
103 |
## Training Data
|
104 |
|
105 |
-
StreetCLIP was trained on an
|
106 |
-
urban and rural images. The data used to train the model comes from 101 countries
|
|
|
|
|
|
|
|
|
|
|
107 |
|
108 |
## Training Procedure
|
109 |
|
110 |
-
|
|
|
|
|
|
|
111 |
|
112 |
-
|
|
|
113 |
|
114 |
# Evaluation
|
115 |
|
@@ -121,12 +140,17 @@ identify the correct country and then city of geographical image origin.
|
|
121 |
|
122 |
### Testing Data
|
123 |
|
|
|
|
|
124 |
* [IM2GPS](http://graphics.cs.cmu.edu/projects/im2gps/).
|
125 |
* [IM2GPS3K](https://github.com/lugiavn/revisiting-im2gps)
|
126 |
|
127 |
### Metrics
|
128 |
|
129 |
-
|
|
|
|
|
|
|
130 |
|
131 |
## Results
|
132 |
|
@@ -143,10 +167,6 @@ achieving SOTA performance on a selection of benchmark metrics.
|
|
143 |
- **Hardware Type:** 4 NVIDIA A100 GPUs
|
144 |
- **Hours used:** 12
|
145 |
|
146 |
-
# Example Image Attribution
|
147 |
-
|
148 |
-
To be added soon ...
|
149 |
-
|
150 |
# Citation
|
151 |
|
152 |
Preprint available soon ...
|
|
|
27 |
StreetCLIP is a robust foundation model for open-domain image geolocalization and other
|
28 |
geographic and climate-related tasks.
|
29 |
|
30 |
+
Trained on an original dataset of 1.1 million street-level urban and rural geo-tagged images, it achieves
|
31 |
+
state-of-the-art performance on multiple open-domain image geolocalization benchmarks in zero-shot,
|
32 |
+
outperforming supervised models trained on millions of images.
|
33 |
|
34 |
+
# Model Description
|
35 |
|
36 |
+
StreetCLIP is a model pretrained by deriving image captions synthetically from image class labels using
|
37 |
+
a domain-specific caption template. This allows StreetCLIP to transfer its generalized zero-shot learning
|
38 |
+
capabilities to a specific domain (i.e. the domain of image geolocalization).
|
39 |
+
StreetCLIP builds on the OpenAI's pretrained large version of CLIP ViT, using 14x14 pixel
|
40 |
+
patches and images with a 336 pixel side length.
|
41 |
|
42 |
+
## Model Details
|
43 |
|
|
|
|
|
|
|
|
|
44 |
- **Model type:** [CLIP](https://openai.com/blog/clip/)
|
45 |
- **Language:** English
|
46 |
- **License:** Create Commons Attribution Non Commercial 4.0
|
47 |
+
- **Trained from model:** [openai/clip-vit-large-patch14-336](https://huggingface.co/openai/clip-vit-large-patch14-336)
|
48 |
|
49 |
## Model Sources
|
50 |
|
51 |
- **Paper:** Pre-print available soon ...
|
|
|
52 |
|
53 |
# Uses
|
54 |
|
55 |
+
StreetCLIP has a deep understanding of the visual features found in street-level urban and rural scenes
|
56 |
+
and knows how to relate these concepts to specific countries, regions, and cities. Given its training setup,
|
57 |
+
the following use cases are recommended for StreetCLIP.
|
58 |
|
59 |
## Direct Use
|
60 |
|
61 |
+
StreetCLIP can be used out-of-the box using zero-shot learning to infer the geolocation of images on a country, region,
|
62 |
+
or city level. Given that StreetCLIP was pretrained on a dataset of stree-level urban and rural images,
|
63 |
+
the best performance can be expected on images from a similar distribution.
|
64 |
+
|
65 |
+
Broader direct use cases
|
66 |
|
67 |
## Downstream Use
|
68 |
|
69 |
+
StreetCLIP can be finetuned for any downstream applications that require geographic or street-level urban or rural
|
70 |
+
scene understanding.
|
71 |
|
72 |
## Out-of-Scope Use
|
73 |
|
74 |
+
Any use cases attempting to geolocate users' private images are out-of-scope and discouraged.
|
75 |
|
76 |
# Bias, Risks, and Limitations
|
77 |
|
78 |
+
StreetCLIP was not trained on social media images or images of identifable people for a reason. As such, any use case
|
79 |
+
attempting to geolocalize users' private images
|
80 |
|
81 |
## Recommendations
|
82 |
+
We encourage the community to apply StreetCLIP to applications with significant social impact of which there are many.
|
83 |
+
Examples include analyzing the built environment (i.e. building quality, type, or energy efficiency classification),
|
84 |
+
infrastructure (i.e. road quality, utility pole maintenance, identifying damage from natural disasters), and natural
|
85 |
+
environment (i.e. image segmentation, vegetation mapping and classification, tracking deforestation).
|
86 |
|
87 |
## How to Get Started with the Model
|
88 |
|
|
|
112 |
|
113 |
## Training Data
|
114 |
|
115 |
+
StreetCLIP was trained on an original, unreleased street-level dataset of 1.1 million real-world,
|
116 |
+
urban and rural images. The data used to train the model comes from 101 countries, biased towards
|
117 |
+
western countries and not including India and China.
|
118 |
+
|
119 |
+
## Preprocessing
|
120 |
+
|
121 |
+
Same preprocessing as [openai/clip-vit-large-patch14-336](https://huggingface.co/openai/clip-vit-large-patch14-336).
|
122 |
|
123 |
## Training Procedure
|
124 |
|
125 |
+
StreetCLIP is initialized with OpenAI's pretrained large version of CLIP ViT and then pretrained using the synthetic
|
126 |
+
caption domain-specific pretraining method described in the paper corresponding to this work. StreetCLIP was trained
|
127 |
+
for 3 epochs using an AdamW optimizer with a learning rate of 1e-6 on 3 NVIDIA A100 80GB GPUs, a batch size of 32,
|
128 |
+
and gradient accumulation of 12 steps.
|
129 |
|
130 |
+
StreetCLIP was trained with the goal of matching images in the batch
|
131 |
+
with the caption correponding to the correct city, region, and country of the images' origins.
|
132 |
|
133 |
# Evaluation
|
134 |
|
|
|
140 |
|
141 |
### Testing Data
|
142 |
|
143 |
+
StreetCLIP was evaluated on the following two open-domain image geolocalization benchmarks.
|
144 |
+
|
145 |
* [IM2GPS](http://graphics.cs.cmu.edu/projects/im2gps/).
|
146 |
* [IM2GPS3K](https://github.com/lugiavn/revisiting-im2gps)
|
147 |
|
148 |
### Metrics
|
149 |
|
150 |
+
The objective of the listed benchmark datasets is to predict the images' coordinates of origin with as
|
151 |
+
little deviation as possible. A common metric set forth in prior literature is called Percentage at Kilometer (% @ KM).
|
152 |
+
The Percentage at Kilometer metric first calculates the distance in kilometers between the predicted coordinates
|
153 |
+
to the ground truth coordinates and then looks at what percentage of error distances are below a certain kilometer threshold.
|
154 |
|
155 |
## Results
|
156 |
|
|
|
167 |
- **Hardware Type:** 4 NVIDIA A100 GPUs
|
168 |
- **Hours used:** 12
|
169 |
|
|
|
|
|
|
|
|
|
170 |
# Citation
|
171 |
|
172 |
Preprint available soon ...
|