https://github.com/chen-yang-liu/awesome-rs-spatiotemporal-vlms

🔥Remote Sensing Spatio-Temporal Vision-Language Models: A Comprehensive Survey

Keywords

change-detetion foundation-models large-language-models remote-sensing spatio-temporal-analysis vision-language

Last synced: 10 months ago · JSON representation

Repository

🔥Remote Sensing Spatio-Temporal Vision-Language Models: A Comprehensive Survey

Basic Info

Host: GitHub
Owner: Chen-Yang-Liu
Default Branch: main
Homepage:
Size: 13.7 MB

Statistics

Stars: 153
Watchers: 4
Forks: 8
Open Issues: 1
Releases: 0

Topics

change-detetion foundation-models large-language-models remote-sensing spatio-temporal-analysis vision-language

Created over 1 year ago · Last pushed 10 months ago

Metadata Files

Readme

Remote Sensing Spatio-Temporal Vision-Language Models: A Comprehensive Survey

Chenyang Liu · Jiafan Zhang · Keyan Chen · Man Wang · Zhengxia Zou ·
Zhenwei Shi*✉

This repo is used for recording, and tracking recent Remote Sensing Spatio-Temporal Vision-Language Models (RS-STVLMs). If you find any work missing or have any suggestions (papers, implementations, and other resources), feel free to pull requests.

:star: Share us a :star:

Share us a :star: if you're interested in this repo. We will continue to track relevant progress and update this repository.

🙌 Add Your Paper in our Repo and Survey!

You are welcome to give us an issue or PR for your RS-STVLM work !!!!! We will record it for next version update of our survey

🥳 News

🔥🔥🔥 The rep is updating 🔥🔥🔥

✨ Highlight!!

✅ The first survey for Remote Sensing Spatio-Temporal Vision-Language Models.

✅ Some public datasets and code links are provided.

✅ We will continue to track related work in this repository.

📖 Introduction

Timeline of RS-STVLMs:

Alt Text

📖 Table of Contents

📚 Remote Sensing Spatio-Temporal Vision-language Tasks and Methods
👨‍🏫 Large Language Models Meets Temporal Images
🛰️ Dataset
💻 Others
🖊️ Citation
🐲 Contact

📚 Remote Sensing Spatio-Temporal Vision-language Tasks and Methods

Change Captioning

Multitask Learning of Change Detection and Change Captioning

| Time | Model Name | Paper Title | Visual Encoder | Language Decoder | Code/Project | |:-------:|:---------------:|-----------------------------------------------------------------------------------------------------------------------------------------------------------------|:----------------:|:-------------------:|:-----------------------------------------------------:| | 2024.01 | Pix4Cap | Pixel-Level Change Detection Pseudo-Label Learning for Remote Sensing Change Captioning | ViT-B/32 | Transformer Decoder | link | | 2024.03 | Change-Agent | Change-Agent: Toward Interactive Comprehensive Remote Sensing Change Interpretation and Analysis | ViT-B/32 | Transformer Decoder | link | | 2024.07 | Semantic-CC | Semantic-CC: Boosting Remote Sensing Image Change Captioning via Foundational Knowledge and Semantic Guidance | SAM | Vicuna | N/A | | 2024.09 | DetACC * | Detection Assisted Change Captioning for Remote Sensing Image | ResNet-101 | Transformer Decoder | N/A | | 2024.09 | KCFI | Enhancing Perception of Key Changes in Remote Sensing Image Change Captioning | ViT | Qwen | link | | 2024.10 | MV-CC * | MV-CC: Mask Enhanced Video Model for Remote Sensing Change Caption | InternVideo2 | Transformer Decoder | link | | 2024.10 | ChangeMinds | ChangeMinds: Multi-task Framework for Detecting and Describing Changes in Remote Sensing | Swin Transformer | Transformer Decoder | link | | 2024.10 | CTMTNet | A Multi-Task Network and Two Large Scale Datasets for Change Detection and Captioning in Remote Sensing Images | ResNet-101 | Transformer Decoder | N/A | | 2024.12 | Mask Approx Net | Mask Approximation Net: A Novel Diffusion Model Approach for Remote Sensing Change Captioning | ResNet | Transformer Decoder | link | | 2025.01 | MModalCC * | Robust Change Captioning in Remote Sensing: SECOND-CC Dataset and MModalCC Framework | ResNet-101 | Transformer Decoder | link | | 2025.03 | CD4C * | CD4C: Change Detection for Remote Sensing Image Change Captioning | ResNet-101 | Transformer Decoder | N/A | | 2025.04 | FST-Net | Frequency–Spatial–Temporal Domain Fusion Network for Remote Sensing Image Change Captioning | Segformer | Transformer Decoder | N/A | | ......

Change Question Answering

| Time | Model Name | Paper Title | Visual Encoder | Language Decoder | Code/Project | |:-------:|:----------------:|--------------------------------------------------------------------------------------------------------------------------------------------------|:------------------:|:-------------------------:|:----------------------------------------------------:| | 2022.07 | change-aware VQA | Change-Aware Visual Question Answering | CNN | RNN | N/A | | 2022.09 | CDVQA-Net | Change Detection Meets Visual Question Answering | CNN | RNN | link | | 2024.09 | ChangeChat | ChangeChat: An Interactive Model for Remote Sensing Change Analysis via Multimodal Instruction Tuning | CLIP-ViT | Vicuna-v1.5 | link | | 2024.09 | CDchat | CDChat: A Large Multimodal Model for Remote Sensing Change Description | CLIP ViT-L/14 | Vicuna-v1.5 | link | | 2024.10 | TEOChat | TEOChat: A Large Vision-Language Assistant for Temporal Earth Observation Data | CLIP ViT-L/14 | LLaMA-2 | link | | 2024.10 | GeoLLaVA | GeoLLaVA: Efficient Fine-Tuned Vision-Language Models for Temporal Change Detection in Remote Sensing | Video encoder | LLaVA-NeXT, Video-LLaVA | link | | 2024.10 | VisTA | Show Me What and Where has Changed? Question Answering and Grounding for Remote Sensing Change Detection | CLIP image Encoder | CLIP Text Encoder | link | | 2024.12 | RSUniVLM | RSUniVLM: A Unified Vision Language Model for Remote Sensing via Granularity-oriented Mixture of Experts | Siglip-400m | Qwen2-0.5B | link | | 2024.12 | EarthDial | EarthDial: Turning Multi-sensory Earth Observations to Interactive Dialogues | InternViT-300M | Phi-3-mini | link | | 2024.12 | UniRS | UniRS: Unifying Multi-temporal Remote Sensing Tasks through Vision Language Models | Siglip-400m | Sheared-LLAMA-3B | link | | 2025.05 | DVLChat | DynamicVL: Benchmarking Multimodal Large Language Models for Dynamic City Understanding | SAM | Qwen2.5-VL | N/A | | ......

Text-driven Temporal Images Retrieval

| Time | Model Name | Paper Title | Code/Project | |:--------:|:------------:|-----------------------------------------------------------------------------------------------------------------------------|:---------------------------------------------:| | 2024.06 | ChangeRetCap | Towards a multimodal framework for remote sensing image change retrieval and captioning | link | | 2025.01 | text-ITSR | Self-Supervised Cross-Modal Text-Image Time Series Retrieval in Remote Sensing | N/A | | ........

Change Grounding

Text-driven Temporal Images Generation

| Time | Model Name | Paper Title | Code/Project | |:-------:|:-----------:|-----------------------------------------------------------------------------------------------------------------------------------------------------------|:------------:| | 2025.02 | TGIPG | Image Editing based on Diffusion Model for Remote Sensing Image Change Captioning | N/A | | 2025.04 | ChangeDiff | ChangeDiff: A Multi-Temporal Change Detection Data Generator with Flexible Text Prompts via Diffusion Model | link | | 2025.07 | -- | Open-vocabulary generative vision-language models for creating a large-scale remote sensing change detection dataset | link | | 2025.07 | ChangeBridge | ChangeBridge: Spatiotemporal Image Generation with Multimodal Controls for Remote Sensing | N/A | | ........

👨‍🏫 Large Language Models Meets Temporal Images

LLM-driven Task-Specific Spatio-Temporal VLMs

| Time | Method | Paper Title | LLM | LLM | Fine-tuning | Code/Project | |:--------:|:-----------:|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|:-------------:|:-----------:|:-------------:|:------------------------------------------------------:| | 2023.10 | PromptCC | A Decoupling Paradigm with Prompt Learning for Remote Sensing Image Change Captioning | CLIP-ViT-B/32 | GPT-2 | Prompt Tuning | link | | 2024.06 | ChangeExp | Towards Temporal Change Explanations from Bi-Temporal Satellite Images | CLIP-ViT-L | LLaVA-1.5 | Prompt Method | N/A | | 2024.07 | Semantic-CC | Semantic-CC: Boosting Remote Sensing Image Change Captioning via Foundational Knowledge and Semantic Guidance | SAM | Vicuna | LoRA | N/A | | 2024.09 | KCFI | Enhancing Perception of Key Changes in Remote Sensing Image Change Captioning | ViT | Qwen | Prompt Tuning | link | | 2024.09 | CDChat | CDChat: A Large Multimodal Model for Remote Sensing Change Description | CLIP-ViT-L/14 | Vicuna-v1.5 | LoRA | link | | 2024.10 | GeoLLaVA | GeoLLaVA: Efficient Fine-Tuned Vision-Language Models for Temporal Change Detection in Remote Sensing | Siglip-400m | LLaVA-NeXT | LoRA | link | | 2024.10 | Chareption | Chareption: Change-Aware Adaption Empowers Large Language Model for Effective Remote Sensing Image Change Captioning | CLIP-ViT-L/14 | LLaMA-7B | Adapter | N/A | | 2024.11 | CCExpert | CCExpert: Advancing MLLM Capability in Remote Sensing Change Captioning with Difference-Aware Integration and a Foundational Dataset | Siglip-400m | Qwen-2 | LoRA | link | | ........

Unified Spatio-Temporal Vision-Language Foundation Models

| Time | Method | Paper Title | Visual Encoder | LLM | Fine-tuning | Code/Project | |:--------:|:------------:|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------|:------------------:|:-----------:|:-----------------:|:-----------------------------------------------------:| | 2024.03 | Change-Agent | Change-Agent: Toward Interactive Comprehensive Remote Sensing Change Interpretation and Analysis | Segformer | Chatgpt | Frozen | link | | 2024.09 | ChangeChat | ChangeChat: An Interactive Model for Remote Sensing Change Analysis via Multimodal Instruction Tuning | CLIP-ViT | Vicuna-v1.5 | LoRA | link | | 2024.10 | TEOChat | TEOChat: A Large Vision-Language Assistant for Temporal Earth Observation Data | CLIP ViT-L/14 | LLaMA-2 | LoRA | link | | 2024.12 | RingMoGPT | RingMoGPT: A Unified Remote Sensing Foundation Model for Vision, Language, and grounded tasks | ViT-g/14(EVA-CLIP) | Vicuna-13B | Frozen | N/A | | 2024.12 | RSUniVLM | RSUniVLM: A Unified Vision Language Model for Remote Sensing via Granularity-oriented Mixture of Experts | Siglip-400m | Qwen2-0.5B | MoE | link | | 2024.12 | EarthDial | EarthDial: Turning Multi-sensory Earth Observations to Interactive Dialogues | InternViT-300M | Phi-3-mini| Fully Fine-tuning | link | | 2024.12 | UniRS | UniRS: Unifying Multi-temporal Remote Sensing Tasks through Vision Language Models | Siglip-400m | Sheared-LLAMA-3B | Fully Fine-tuning | link | | 2025.03 | Falcon | Falcon: A Remote Sensing Vision-Language Foundation Model | DaViT | Florence-2 | Fully Fine-tuning | link | | 2025.03 | GeoRSMLLM | GeoRSMLLM: A Multimodal Large Language Model for Vision-Language Tasks in Geoscience and Remote Sensing | SigLIP | Qwen2-7B | N/A | N/A | | 2025.05 | DVLChat | DynamicVL: Benchmarking Multimodal Large Language Models for Dynamic City Understanding | SAM | Qwen2.5-VL | LoRA | N/A | | ........

LLM-driven Remote Sensing Vision-Language Agents

🛰️ Dataset

Matching Temporal Images, Text, and Masks

| Dataset | Time | Image Size | Image Resolution | Image Pairs | Captions* | Masks | Temporal Image Data Source | Anno. | Link | |:-----------:|:-------:|:--------------:|:----------------:|:-----------:|:---------:|:--------------:|:------------------------------:|:---------:|:--------:| | DUBAI CCD | 2022.08 | 50×50 | 30m | 500 | 2,500 | - | Landsat-7 imagery | Manual | Link | | LEVIR CCD | 2022.08 | 256×256 | 0.5m | 500 | 2,500 | - | LEVIR-CD | Manual | Link | | LEVIR-CC | 2022.11 | 256×256 | 0.5m | 10,077 | 50,385 | - | LEVIR-CD | Manual | Link | | CCExpert | 2024.11 | - | - | 200K | 1.2M | - | LEVIR-CC, CLVER-Change, ImageEdit, Spot-the-dif, STVchrono, Vismin, ChangeSim, SYSU-CD, SECOND | Auto. | Link | | SECTION | 2025.07 | 256×256 | 0.3-3m | 4,059 | 12,200 | - | SECOND | Manual | Link | | LEVIR-MCI | 2024.03 | 256×256 | 0.5m | 10,077 | 50,385 | building, road | LEVIR-CC | Manual | Link | | LEVIR-CDC | 2024.11 | 256×256 | 0.5m | 10,077 | 50,385 | building | LEVIR-CC | Manual | Link | | WHU-CDC | 2024.11 | 256×256 | 0.075m | 7,434 | 37,170 | building | WHU-CD | Manual | Link | | SECOND-CC | 2025.01 | 256×256 | 0.3∼3m | 6,041 | 30,205 | 6 classes | SECOND | Manual | Link |

Matching Temporal Images, Instruction and Response

| Dataset | Time | Instruction Samples | Number of Images | Temporal Length | Temporal Image Data Source | Anno. | Link | |----------------|----------|--------------------------|-----------------------|----------------------|----------------------------------|------------|---------| | CDVQA | 2022.09 | 122,000 | 2,968 | 2 | SECOND | Manual | Link | | ChangeChat-87k | 2024.09 | 87,195 | 10,077 | 2 | LEVIR-CC, LEVIR-MCI | Auto. | Link | | QAG-360K | 2024.10 | 360,000 | 6,810 | 2 | Hi-UCD, SECOND, LEVIR-CD | Auto. | Link | | GeoLLaVA | 2024.10 | 100,000 | 100,000 | 2 | fMoW | Auto. | Link | | TEOChatlas | 2024.10 | 554,071 | - | 1~8 | xBD, S2Looking, QFabric, fMoW | Auto. | Link | | EarthDial | 2024.12 | 11.11 Million | - | 1~4 | fMoW, TreeSatAI-Time-Series, MUDS, xBD, QuakeSet | Manual & Auto. | Link | | UniRS | 2024.12 | 318.8 K | - | 1~T (T>2) | LEVIR-CC, ERA-Video | Auto. | Link | | Falcon_SFT | 2025.03 | 78 Million | 5.6 Million | 1~2 | CDD, EGY-BCD, HRSCD, LEVIR-CD, MSBC, MSOSCD, NJDS, S2Looking, SYSU-CD, WHU-CD | Auto. | Link | | DVL-Suite |2025.05 |69,926 |15,063 | 6.9 (Average) | U.S. National Agriculture Imagery Program (NAIP) | Manual & Auto. | N/A | ....

💻 Others

Some CLIP Models in Remote Sensing

//: # ()

🖊️ Citation

If you find our survey and repository useful for your research, please consider citing our paper:

bibtex @misc{liu2024remotesensingtemporalvisionlanguage, title={Remote Sensing Spatio-Temporal Vision-Language Models: A Comprehensive Survey}, author={Chenyang Liu and Jiafan Zhang and Keyan Chen and Man Wang and Zhengxia Zou and Zhenwei Shi}, year={2024}, eprint={2412.02573}, archivePrefix={arXiv}, primaryClass={cs.CV}, url={https://arxiv.org/abs/2412.02573}, }

🐲 Contact

liuchenyang@buaa.edu.cn

Owner

Name: Liu Chenyang
Login: Chen-Yang-Liu
Kind: user
Location: Beijing

Website: https://Chen-Yang-Liu.github.io
Repositories: 15
Profile: https://github.com/Chen-Yang-Liu

Liu Chenyang

GitHub Events

Total

Watch event: 39
Push event: 17
Fork event: 2

Last Year

Watch event: 39
Push event: 17
Fork event: 2

https://github.com/chen-yang-liu/awesome-rs-spatiotemporal-vlms

Science Score: 49.0%

Keywords

Repository

Basic Info

Statistics

Topics

Metadata Files

README.md

Remote Sensing Spatio-Temporal Vision-Language Models: A Comprehensive Survey

:star: Share us a :star:

🙌 Add Your Paper in our Repo and Survey!

🥳 News

✨ Highlight!!

📖 Introduction

📖 Table of Contents

📚 Remote Sensing Spatio-Temporal Vision-language Tasks and Methods

Change Captioning

Multitask Learning of Change Detection and Change Captioning

Change Question Answering

Text-driven Temporal Images Retrieval

Change Grounding

Text-driven Temporal Images Generation

👨‍🏫 Large Language Models Meets Temporal Images

LLM-driven Task-Specific Spatio-Temporal VLMs

Unified Spatio-Temporal Vision-Language Foundation Models

LLM-driven Remote Sensing Vision-Language Agents

🛰️ Dataset

Matching Temporal Images, Text, and Masks

Matching Temporal Images, Instruction and Response

💻 Others

Some CLIP Models in Remote Sensing

🖊️ Citation

🐲 Contact

Owner

GitHub Events

Total

Last Year