Mapping our Landscape: Finetuning the Geospatial Foundation model for Land Cover Mapping

Mapping our Landscape: Finetuning the Geospatial Foundation model for Land Cover Mapping

Mapping of our Landscape, i.e. mapping how Land is used and what covers the land also termed as Land Use Land Cover(LULC) plays a crucial role in environmental monitoring, urban planning, disaster management, agriculture, and climate change studies. While researchers have explored various AI/ML approaches for LULC mapping using remote sensing images spanning several decades, existing LULC mapping techniques encounter challenges related to accuracy, the need for substantial labeled data for training, and adaptability to different geographical regions. We have addressed the need for an enormous quantity of labeled data with the Geospatial Foundational Model. Recent advances in foundational models have gained significant traction due to their ability to alleviate labeled data scarcity issues. We propose a fine-tuning strategy for Land Use Land Cover classification utilizing a cutting-edge geospatial foundation model jointly developed by IBM and NASA, known as Prithvi.


Overview of Prithvi:

Prithvi is a first-of-its-kind temporal Vision transformer pre-trained by the IBM and NASA team on contiguous US Harmonised Landsat Sentinel 2 (HLS)[1] data. The model adopts a self-supervised encoder developed with a ViT architecture[2] and a Masked AutoEncoder (MAE)[3] learning strategy, with an Mean Square Error (MSE) loss function. The model includes spatial attention across multiple patches and temporal attention for each patch.

Prithvi is an open-source model released on huggingface. Try it out yourself for inference or fine-tuning of interesting remote-sensing problems https://huggingface.co/ibm-nasa-geospatial/Prithvi-100M. You can read more about this model here- https://arxiv.org/abs/2310.18660


Finetuning for LULC mapping:

We approached the LULC mapping task as a pixel-by-pixel segmentation task. We utilized the weights from Prithvi’s pre-trained encoder model and trained a decoder head on top of it for segmentation. The semantic segmentation decoder used for our downstream task is a Fully Convolutional Network.

Data: We used the images from a harmonized surface reflectance product called HLS-2 (version 2.0) as input and S2 LULC label data[4] is used as ground truth to fine-tune the Prithvi model. As S2 LULC label data is naturally imbalanced, steps were then taken to obtain a balanced training dataset. Creating a balanced dataset for pixel-wise segmentation poses more challenges compared to a classification task. We used some heuristics to create an as balanced as possible dataset. We compared the performance of the Prithvi fine-tuned model with a traditional deep-learning model U-Net as well as with ViT-a state-of-the-art foundation model proposed for natural images.

We observed that Prithvi achieved a mean Intersection over Union( mIoU ) of 62.37% and outperformed the baseline models U-Net and ViT, which achieved mIoUs of 36.36% and 46.8% respectively. Prithvi is also able to perform better than the baseline for each of the classes. We found that the model could easily detect water, trees, bare ground, and rangeland. However, it struggled with categorizing crops, flooded vegetation, built areas, snow, and clouds and has low IoU for these classes. It is difficult to predict crops and flooded vegetation because of insufficient temporal resolution of the labeled data, as the cropping pattern changes frequently, but the S2 LULC labeled data only has one map per year for a particular region. The mIoU increases from 46.8% to 62.37% as we go from Prithvi (Pre-trained) to Prithvi (Fine-tuned). This clearly demonstrates the effectiveness of fine-tuning the pre-trained weights in improving performance.

Figure 1

We have run experiments to understand the impact of input data size on the performance of the fine-tuned geospatial FM. The figure above summarizes the experiments carried out. The darker lines show the mean from 5 different random seeds and the lighter bands show the region from -sigma to +sigma. In these experiments, we have reduced training data size by keeping the validation and test set constant. We noticed that as we decrease the size of training data, the mIoU also decreases. The performance of the Prithvi model trained on a reduced amount of data is relatively lower but satisfactory. Even with an 87.5% reduction in data, we can still attain a mIoU of 37%.

The data reduction experiments are repeated with ViT and UNet baselines. We noticed that Prithvi consistently performed better in all the data reduction experiments. The Prithvi model can pick up the learning very well in the starting few epochs as compared to UNet and ViT. The UNet's performance drops drastically as we reduce the data set size from 100% to 50%. This set of experiments shows that with limited labeled samples, pre-trained models have an advantage.

This blog is based on a recently accepted paper at Fragile Earth, KDD

Co-authors: Ayush Jain , Ranjini Guruprasad , Kamal Das , Johannes Jakubik , Bianca Zadrozny

Have a look at some of our previous blogs:


References

[1]https://hls.gsfc.nasa.gov/

[2]https://huggingface.co/docs/transformers/en/model_doc/vit

[3]https://arxiv.org/abs/2111.06377

[4]https://www.esri.com/partners/impact-observatory-a2T5x0000084pJXEAY/sentinel-2-10m-land--a2d5x000005kRoNAAU

要查看或添加评论,请登录

Devyani Lambhate (she/her)的更多文章

社区洞察

其他会员也浏览了