Geography-Aware Self-Supervised Learning

Abstract

Contrastive learning methods have significantly narrowed the gap between supervised and unsupervised learning on computer vision tasks. In this paper, we explore their application to geo-located datasets, e.g. remote sensing, where unlabeled data is often abundant but labeled data is scarce. We first show that due to their different characteristics, a non-trivial gap persists between contrastive and supervised learning on standard benchmarks. To close the gap, we propose novel training methods that exploit the spatio-temporal structure of remote sensing data. We leverage spatially aligned images over time to construct temporal positive pairs in contrastive learning and geo-location to design pre-text tasks. Our experiments show that our proposed method closes the gap between contrastive and supervised learning on image classification, object detection and semantic segmentation for remote sensing. Moreover, we demonstrate that the proposed method can also beapplied to geo-tagged ImageNet images, improving down-stream performance on various tasks.

Model

Top shows the original MoCo-v2 framework. Bottom shows the schematic overview of our approach.

By leveraging spatially aligned images over time to construct temporal positive pairs in contrastive learning and geo-location in the design of pre-text tasks, we are able to close the gap between self-supervised and supervised learning on image classification, object detection and semantic segmentation on remote sensing and other geo-tagged image datasets.

Our experiments on the Functional Map of the World (fMoW) dataset consisting of high spatial resolution satellite images show that we improve MoCo-v2 baseline significantly. In particular, we improve it by ~8% classification accuracy when testing the learned representations on image classification, ~2% AP on object detection, ~1% mIoU on semantic segmentation, and ~3% top-1 accuracy on land cover classification. Interestingly, our geography-aware learning can even outperform the supervised learning counterpart on temporal data classification by ~2%. With the proposed method, we can improve the accuracy on target applications utilizing object detection, and semantic segmentation.

GeoImageNet

Some examples from GeoImageNet dataset. Below each image, we list their latitudes, longitudes, city, country name. In our study, we use the latitude and longitude information for unsupervised learning.

To further demonstrate the effectiveness of our geography-aware learning approach, we searched for geo-tagged images in ImageNet using the FLICKR API, and were able to find 543,435 images with their associated coordinates (lat_i, lon_i) across 5150 class categories. This dataset is more challenging than ImageNet-1K as it is highly imbalanced and contains about 5X more classes. We extend the proposed approaches to geo-located ImageNet, and show that geography-aware learning can improve the performance of MoCo-v2 by ~2% on image classification, showing the potential application of our approach to any geo-tagged dataset.

We show some examples from GeoImageNet in the figure above, for some images we have geo-coordinates that can be predicted from visual cues. For example, we see that a picture of a person with a Sombrerohat was captured in Mexico. Similarly, an Indian Elephant picture was captured in India, where there is a large population of Indian Elephants. Next to it, we show the picture of an African Elephant (which is larger in size). If a model is trained to predict where in the world the image was taken,it should be able to identify visual cues that are transferable to other tasks (e.g., visual cues to differentiate Indian Ele-phants from the African counterparts).

Shows the distribution of the fMoW

Shows the distribution of GeoImageNet

BibTeX

@article{ayush2021geography,
      title={Geography-Aware Self-Supervised Learning},
      author={Ayush, Kumar and Uzkent, Burak and Meng, Chenlin and Tanmay, Kumar and Burke, Marshall and Lobell, David and Ermon, Stefano},
      journal={ICCV},
      year={2021}
    }