https://github.com/chen-yang-liu/awesome-rs-spatiotemporal-vlms

🔥Remote Sensing Spatio-Temporal Vision-Language Models: A Comprehensive Survey

https://github.com/chen-yang-liu/awesome-rs-spatiotemporal-vlms

Science Score: 49.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
    Found 6 DOI reference(s) in README
  • Academic publication links
    Links to: arxiv.org, scholar.google, sciencedirect.com, springer.com, mdpi.com, ieee.org, iop.org
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (10.3%) to scientific vocabulary

Keywords

change-detetion foundation-models large-language-models remote-sensing spatio-temporal-analysis vision-language
Last synced: 6 months ago · JSON representation

Repository

🔥Remote Sensing Spatio-Temporal Vision-Language Models: A Comprehensive Survey

Basic Info
  • Host: GitHub
  • Owner: Chen-Yang-Liu
  • Default Branch: main
  • Homepage:
  • Size: 13.7 MB
Statistics
  • Stars: 153
  • Watchers: 4
  • Forks: 8
  • Open Issues: 1
  • Releases: 0
Topics
change-detetion foundation-models large-language-models remote-sensing spatio-temporal-analysis vision-language
Created about 1 year ago · Last pushed 6 months ago
Metadata Files
Readme

README.md

Awesome PR's Welcome

Remote Sensing Spatio-Temporal Vision-Language Models: A Comprehensive Survey


Chenyang Liu · Jiafan Zhang · Keyan Chen · Man Wang · Zhengxia Zou ·
Zhenwei Shi*✉

arXiv PDF <!-- TPAMI PDF -->


This repo is used for recording, and tracking recent Remote Sensing Spatio-Temporal Vision-Language Models (RS-STVLMs). If you find any work missing or have any suggestions (papers, implementations, and other resources), feel free to pull requests.

:star: Share us a :star:

Share us a :star: if you're interested in this repo. We will continue to track relevant progress and update this repository.

🙌 Add Your Paper in our Repo and Survey!

  • You are welcome to give us an issue or PR for your RS-STVLM work !!!!! We will record it for next version update of our survey

🥳 News

🔥🔥🔥 The rep is updating 🔥🔥🔥

✨ Highlight!!

✅ The first survey for Remote Sensing Spatio-Temporal Vision-Language Models.

✅ Some public datasets and code links are provided.

✅ We will continue to track related work in this repository.

📖 Introduction

Timeline of RS-STVLMs:

Alt Text

📖 Table of Contents

📚 Remote Sensing Spatio-Temporal Vision-language Tasks and Methods

Change Captioning

| Time | Model Name | Paper Title | Visual Encoder | Language Decoder | Code/Project |
|:--------:|:---------------:|------------------------------------------------------------------------------------------------------------------------------------------------------------------------|:--------------------------:|:---------------------------------:|:--------------------------------------------------:| | 2021.10 | CNN-RNN | Captioning changes in bi-temporal remote sensing images | VGG-16 | RNN | N/A | | 2022.08 | CC-RNN/SVM | Change captioning: A new paradigm for multitemporal remote sensing image analysis | VGG-16 | RNN,SVM | N/A | | 2022.11 | RSICCformer | Remote sensing image change captioning with dual-branch transformers: A new method and a large scale dataset | ResNet-101 | Transformer Decoder | link | | 2023.07 | PSNet | Progressive Scale-aware Network for Remote sensing Image Change Captioning | ViT-B/32 | Transformer Decoder | link | | 2023.10 | PromptCC | A Decoupling Paradigm with Prompt Learning for Remote Sensing Image Change Captioning | ViT-B/32 | GPT-2 | link | | 2023.11 | Chg2Cap | Changes to Captions: An Attentive Network for Remote Sensing Change Captioning | ResNet-101 | Transformer Decoder | link | | 2023.11 | ICT-Net | Interactive Change-Aware Transformer Network for Remote Sensing Image Change Captioning | ResNet-101 | Transformer Decoder | link | | 2024.03 | SITS-CC | Change Caption for Satellite Images Time Series | ResNet-101 | Transformer Decoder | link | | 2024.05 | RSCaMa | RSCaMa: Remote Sensing Image Change Captioning with State Space Model | ViT-B/32 | Mamba, Transformer Decoder, GPT-2 | link | | 2024.05 | SparseFocus | A Lightweight Sparse Focus Transformer for Remote Sensing Image Change Captioning | ResNet-101 | Transformer Decoder | link | | 2024.05 | SEN | Single-stream Extractor Network with Contrastive Pre-training for Remote Sensing Change Captioning | ResNet with 6-channel | Transformer Decoder | link | | 2024.05 | Diffusion-RSCC | Diffusion model for learning cross-modal data distribution | ResNet-101 | Diffusion | link | | 2024.05 | CARD | Context-aware Difference Distilling for Multi-change Captioning | ResNet-101 | Transformer Decoder | link | | 2024.06 | ChangeRetCap | Towards a multimodal framework for remote sensing image change retrieval and captioning | ResNet-101 | Transformer Decoder | link | | 2024.06 | Intelli-Change | Intelli-Change Remote Sensing - A Novel Transformer Approach | ResNet-101 | Transformer Decoder | N/A | | 2024.06 | ChangeExp | Towards Temporal Change Explanations from Bi-Temporal Satellite Images | LLaVA-1.5 | LLaVA-1.5 | N/A | | 2024.07 | MAF-Net | Multi-scale Attentive Fusion Network for Remote Sensing Image Change Captioning | ResNet-101 | Transformer Decoder | N/A | | 2024.07 | SFEN | Scale-wised feature enhancement network for change captioning of remote sensing images | WideResNet | Transformer Decoder | N/A | | 2024.09 | MfrNet | MfrNet: A New Multi-Scale Feature Refining Method for Remote Sensing Image Change Captioning | ResNet-18 | Transformer Decoder | N/A | | 2024.09 | SEIFNet | Inter-Temporal Interaction and Symmetric Difference Learning for Remote Sensing Image Change Captioning | ResNet-101 | Transformer Decoder | link | | 2024.10 | MV-CC | MV-CC: Mask Enhanced Video Model for Remote Sensing Change Caption | InternVideo2 | Transformer Decoder | link | | 2024.10 | Chareption | Chareption: Change-Aware Adaption Empowers Large Language Model for Effective Remote Sensing Image Change Captioning | CLIP ViT-L/14 | LLaMA-7B | N/A | | 2024.11 | MADiffCC | Remote Sensing Image Change Captioning Using Multi-Attentive Network with Diffusion Model | Diffusion | Transformer Decoder | N/A |
| 2024.11 | CCExpert | CCExpert: Advancing MLLM Capability in Remote Sensing Change Captioning with Difference-Aware Integration and a Foundational Dataset | Diffusion | Transformer Decoder | link | | 2024.12 | --- | Data Augmentation in Remote Sensing Image Change Captioning | ViT-B/32 | Transformer Decoder | N/A | | 2024.12 | Mask Approx Net | Mask Approximation Net: A Novel Diffusion Model Approach for Remote Sensing Change Captioning | ResNet | Transformer Decoder | link | | 2025.01 | SAT-Cap | Change Captioning in Remote Sensing: Evolution to SAT-Cap -- A Single-Stage Transformer Approach | ResNet-101 | Transformer Decoder | link | | 2025.01 | MModalCC | Robust Change Captioning in Remote Sensing: SECOND-CC Dataset and MModalCC Framework | ResNet-101 | Transformer Decoder | link | | 2025.01 | SGD-RSCCN | Scene Graph and Dependency Grammar Enhanced Remote Sensing Change Caption Network (SGD-RSCCN) | ResNet-101 | Transformer Decoder | N/A | | 2025.02 | TGIPG | Image Editing based on Diffusion Model for Remote Sensing Image Change Captioning | // | // | N/A | | 2025.03 | Change3D | Change3D: Revisiting Change Detection and Captioning from A Video Modeling Perspective | X3D-L(video) | Transformer Decoder | link | | 2025.03 | CD4C | CD4C: Change Detection for Remote Sensing Image Change Captioning | ResNet-101 | Transformer Decoder | N/A | | 2025.04 | RDD+ACR | Region-aware Difference Distilling with Attribute-guided Contrastive Regularization for Change Captioning | ResNet-101 | Transformer Decoder | N/A | | 2025.04 | FST-Net | Frequency–Spatial–Temporal Domain Fusion Network for Remote Sensing Image Change Captioning | Segformer | Transformer Decoder | N/A | | 2025.05 | CTSD-Net | A Cross-Spatial Differential Localization Network for Remote Sensing Change Captioning | SegFormer | Transformer Decoder | N/A | | 2025.06 | CTM | Cross-Temporal Remote Sensing Image Change Captioning: A Manifold Mapping and Bayesian Diffusion Approach for Land Use Monitoring | CLIP | Transformer Decoder | N/A | | 2025.06 | IHM-SNet | IHM-SNet: An Interactive Hierarchical Mamba-Based Screening Network for Remote Sensing Image Change Captioning | CLIP-ViT | Transformer Decoder | N/A | | 2025.07 | MTI-CC | Cross-layer Attention Enhanced Remote Sensing Image Change Captioning via Mamba-Transformer Interaction | CLIP-ViT | Transformer Decoder | N/A | | 2025.08 | CI-Net | Restricted supervised Cascade Information Network for remote sensing change captioning with serial sentences | Asymmetric Siamese Network | Cascade Linguistic Module | N/A | | 2025.08 | SCCNet | SCCNet: Siamese Networks for Selective Change Captioning in Bi-Temporal Remote Sensing Images | ViT | Transformer Decoder | N/A | | 2025.08 | -- | Text-Augmented Semantic Feature Extraction and Difference Information Learning for Remote Sensing Image Change Captioning | FastSAM+CLIP | Transformer Decoder | link | | 2025.08 | C3aptioner | C3aptioner: Improving Change Captioning by Leveraging Momentum Cross-view and Cross-modality Contrastive Learning | ResNet-101 | Transformer Decoder | N/A | | ........

Multitask Learning of Change Detection and Change Captioning

| Time | Model Name | Paper Title | Visual Encoder | Language Decoder | Code/Project | |:-------:|:---------------:|-----------------------------------------------------------------------------------------------------------------------------------------------------------------|:----------------:|:-------------------:|:-----------------------------------------------------:| | 2024.01 | Pix4Cap | Pixel-Level Change Detection Pseudo-Label Learning for Remote Sensing Change Captioning | ViT-B/32 | Transformer Decoder | link | | 2024.03 | Change-Agent | Change-Agent: Toward Interactive Comprehensive Remote Sensing Change Interpretation and Analysis | ViT-B/32 | Transformer Decoder | link | | 2024.07 | Semantic-CC | Semantic-CC: Boosting Remote Sensing Image Change Captioning via Foundational Knowledge and Semantic Guidance | SAM | Vicuna | N/A | | 2024.09 | DetACC * | Detection Assisted Change Captioning for Remote Sensing Image | ResNet-101 | Transformer Decoder | N/A | | 2024.09 | KCFI | Enhancing Perception of Key Changes in Remote Sensing Image Change Captioning | ViT | Qwen | link | | 2024.10 | MV-CC * | MV-CC: Mask Enhanced Video Model for Remote Sensing Change Caption | InternVideo2 | Transformer Decoder | link | | 2024.10 | ChangeMinds | ChangeMinds: Multi-task Framework for Detecting and Describing Changes in Remote Sensing | Swin Transformer | Transformer Decoder | link | | 2024.10 | CTMTNet | A Multi-Task Network and Two Large Scale Datasets for Change Detection and Captioning in Remote Sensing Images | ResNet-101 | Transformer Decoder | N/A | | 2024.12 | Mask Approx Net | Mask Approximation Net: A Novel Diffusion Model Approach for Remote Sensing Change Captioning | ResNet | Transformer Decoder | link | | 2025.01 | MModalCC * | Robust Change Captioning in Remote Sensing: SECOND-CC Dataset and MModalCC Framework | ResNet-101 | Transformer Decoder | link | | 2025.03 | CD4C * | CD4C: Change Detection for Remote Sensing Image Change Captioning | ResNet-101 | Transformer Decoder | N/A | | 2025.04 | FST-Net | Frequency–Spatial–Temporal Domain Fusion Network for Remote Sensing Image Change Captioning | Segformer | Transformer Decoder | N/A | | ......

Change Question Answering

| Time | Model Name | Paper Title | Visual Encoder | Language Decoder | Code/Project | |:-------:|:----------------:|--------------------------------------------------------------------------------------------------------------------------------------------------|:------------------:|:-------------------------:|:----------------------------------------------------:| | 2022.07 | change-aware VQA | Change-Aware Visual Question Answering | CNN | RNN | N/A | | 2022.09 | CDVQA-Net | Change Detection Meets Visual Question Answering | CNN | RNN | link | | 2024.09 | ChangeChat | ChangeChat: An Interactive Model for Remote Sensing Change Analysis via Multimodal Instruction Tuning | CLIP-ViT | Vicuna-v1.5 | link | | 2024.09 | CDchat | CDChat: A Large Multimodal Model for Remote Sensing Change Description | CLIP ViT-L/14 | Vicuna-v1.5 | link | | 2024.10 | TEOChat | TEOChat: A Large Vision-Language Assistant for Temporal Earth Observation Data | CLIP ViT-L/14 | LLaMA-2 | link | | 2024.10 | GeoLLaVA | GeoLLaVA: Efficient Fine-Tuned Vision-Language Models for Temporal Change Detection in Remote Sensing | Video encoder | LLaVA-NeXT, Video-LLaVA | link | | 2024.10 | VisTA | Show Me What and Where has Changed? Question Answering and Grounding for Remote Sensing Change Detection | CLIP image Encoder | CLIP Text Encoder | link | | 2024.12 | RSUniVLM | RSUniVLM: A Unified Vision Language Model for Remote Sensing via Granularity-oriented Mixture of Experts | Siglip-400m | Qwen2-0.5B | link | | 2024.12 | EarthDial | EarthDial: Turning Multi-sensory Earth Observations to Interactive Dialogues | InternViT-300M | Phi-3-mini | link | | 2024.12 | UniRS | UniRS: Unifying Multi-temporal Remote Sensing Tasks through Vision Language Models | Siglip-400m | Sheared-LLAMA-3B | link | | 2025.05 | DVLChat | DynamicVL: Benchmarking Multimodal Large Language Models for Dynamic City Understanding | SAM | Qwen2.5-VL | N/A | | ......

Text-driven Temporal Images Retrieval

| Time | Model Name | Paper Title | Code/Project | |:--------:|:------------:|-----------------------------------------------------------------------------------------------------------------------------|:---------------------------------------------:| | 2024.06 | ChangeRetCap | Towards a multimodal framework for remote sensing image change retrieval and captioning | link | | 2025.01 | text-ITSR | Self-Supervised Cross-Modal Text-Image Time Series Retrieval in Remote Sensing | N/A | | ........

Change Grounding

| Time | Model Name | Grounding Output | Paper Title | Code/Project | |:--------:|:----------:|:----------------:|-----------------------------------------------------------------------------------------------------------------------------------------------|:----------------------------------------------:| | 2024.09 | ChangeChat | mask | ChangeChat: An Interactive Model for Remote Sensing Change Analysis via Multimodal Instruction Tuning | link | | 2024.10 | TEOChat | bbox | TEOChat: A Large Vision-Language Assistant for Temporal Earth Observation Data | link | | 2024.10 | VisTA | mask | Show Me What and Where has Changed? Question Answering and Grounding for Remote Sensing Change Detection | link | | 2024.12 | RSUniVLM | mask | RSUniVLM: A Unified Vision Language Model for Remote Sensing via Granularity-oriented Mixture of Experts | link | | 2024.12 | EarthDial | bbox | EarthDial: Turning Multi-sensory Earth Observations to Interactive Dialogues | link | | 2025.03 | Falcon | mask | Falcon: A Remote Sensing Vision-Language Foundation Model | link | | 2025.03 | GeoRSMLLM | mask | GeoRSMLLM: A Multimodal Large Language Model for Vision-Language Tasks in Geoscience and Remote Sensing | N/A | | ........

Text-driven Temporal Images Generation

| Time | Model Name | Paper Title | Code/Project | |:-------:|:-----------:|-----------------------------------------------------------------------------------------------------------------------------------------------------------|:------------:| | 2025.02 | TGIPG | Image Editing based on Diffusion Model for Remote Sensing Image Change Captioning | N/A | | 2025.04 | ChangeDiff | ChangeDiff: A Multi-Temporal Change Detection Data Generator with Flexible Text Prompts via Diffusion Model | link | | 2025.07 | -- | Open-vocabulary generative vision-language models for creating a large-scale remote sensing change detection dataset | link | | 2025.07 | ChangeBridge | ChangeBridge: Spatiotemporal Image Generation with Multimodal Controls for Remote Sensing | N/A | | ........

👨‍🏫 Large Language Models Meets Temporal Images

LLM-driven Task-Specific Spatio-Temporal VLMs

| Time | Method | Paper Title | LLM | LLM | Fine-tuning | Code/Project | |:--------:|:-----------:|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|:-------------:|:-----------:|:-------------:|:------------------------------------------------------:| | 2023.10 | PromptCC | A Decoupling Paradigm with Prompt Learning for Remote Sensing Image Change Captioning | CLIP-ViT-B/32 | GPT-2 | Prompt Tuning | link | | 2024.06 | ChangeExp | Towards Temporal Change Explanations from Bi-Temporal Satellite Images | CLIP-ViT-L | LLaVA-1.5 | Prompt Method | N/A | | 2024.07 | Semantic-CC | Semantic-CC: Boosting Remote Sensing Image Change Captioning via Foundational Knowledge and Semantic Guidance | SAM | Vicuna | LoRA | N/A | | 2024.09 | KCFI | Enhancing Perception of Key Changes in Remote Sensing Image Change Captioning | ViT | Qwen | Prompt Tuning | link | | 2024.09 | CDChat | CDChat: A Large Multimodal Model for Remote Sensing Change Description | CLIP-ViT-L/14 | Vicuna-v1.5 | LoRA | link | | 2024.10 | GeoLLaVA | GeoLLaVA: Efficient Fine-Tuned Vision-Language Models for Temporal Change Detection in Remote Sensing | Siglip-400m | LLaVA-NeXT | LoRA | link | | 2024.10 | Chareption | Chareption: Change-Aware Adaption Empowers Large Language Model for Effective Remote Sensing Image Change Captioning | CLIP-ViT-L/14 | LLaMA-7B | Adapter | N/A | | 2024.11 | CCExpert | CCExpert: Advancing MLLM Capability in Remote Sensing Change Captioning with Difference-Aware Integration and a Foundational Dataset | Siglip-400m | Qwen-2 | LoRA | link | | ........

Unified Spatio-Temporal Vision-Language Foundation Models

| Time | Method | Paper Title | Visual Encoder | LLM | Fine-tuning | Code/Project | |:--------:|:------------:|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------|:------------------:|:-----------:|:-----------------:|:-----------------------------------------------------:| | 2024.03 | Change-Agent | Change-Agent: Toward Interactive Comprehensive Remote Sensing Change Interpretation and Analysis | Segformer | Chatgpt | Frozen | link | | 2024.09 | ChangeChat | ChangeChat: An Interactive Model for Remote Sensing Change Analysis via Multimodal Instruction Tuning | CLIP-ViT | Vicuna-v1.5 | LoRA | link | | 2024.10 | TEOChat | TEOChat: A Large Vision-Language Assistant for Temporal Earth Observation Data | CLIP ViT-L/14 | LLaMA-2 | LoRA | link | | 2024.12 | RingMoGPT | RingMoGPT: A Unified Remote Sensing Foundation Model for Vision, Language, and grounded tasks | ViT-g/14(EVA-CLIP) | Vicuna-13B | Frozen | N/A | | 2024.12 | RSUniVLM | RSUniVLM: A Unified Vision Language Model for Remote Sensing via Granularity-oriented Mixture of Experts | Siglip-400m | Qwen2-0.5B | MoE | link | | 2024.12 | EarthDial | EarthDial: Turning Multi-sensory Earth Observations to Interactive Dialogues | InternViT-300M | Phi-3-mini| Fully Fine-tuning | link | | 2024.12 | UniRS | UniRS: Unifying Multi-temporal Remote Sensing Tasks through Vision Language Models | Siglip-400m | Sheared-LLAMA-3B | Fully Fine-tuning | link | | 2025.03 | Falcon | Falcon: A Remote Sensing Vision-Language Foundation Model | DaViT | Florence-2 | Fully Fine-tuning | link | | 2025.03 | GeoRSMLLM | GeoRSMLLM: A Multimodal Large Language Model for Vision-Language Tasks in Geoscience and Remote Sensing | SigLIP | Qwen2-7B | N/A | N/A | | 2025.05 | DVLChat | DynamicVL: Benchmarking Multimodal Large Language Models for Dynamic City Understanding | SAM | Qwen2.5-VL | LoRA | N/A | | ........

LLM-driven Remote Sensing Vision-Language Agents

| Time | Method | Paper Title | Function | Code | |:--------:|:-------------:|:--------------------------------------------------------------------------------------------------------------------------------------------------------------------:|:-------------------------------------:|:------------------------------------------------------------------:| | 2024.01 | RSChatgpt | Remote Sensing ChatGPT: Solving Remote Sensing Tasks with ChatGPT and Visual Models | Single-image analysis | Link | | 2024.03 | Change-Agent | Change-Agent: Toward Interactive Comprehensive Remote Sensing Change Interpretation and Analysis | Spatio-Temporal Change Interpretation | Link | | 2024.06 | RS-Agent | RS-Agent: Automating Remote Sensing Tasks through Intelligent Agent | Tool selection and knowledge search | Link | | 2024.07 | RS-AGENT | RS-AGENT: Large Language Models Guided Agent System for Remote Sensing Image Generation | Image Generation | N/A | | 2024.12 | GeoTool-GPT | GeoTool-GPT: a trainable method for facilitating Large Language Models to master GIS tools | Master GIS tools | N/A | | 2025.01 | RescueADI | RescueADI: Adaptive Disaster Interpretation in Remote Sensing Images With Autonomous Agents | Disaster Interpretation | N/A | | ........

🛰️ Dataset

Matching Temporal Images, Text, and Masks

| Dataset | Time | Image Size | Image Resolution | Image Pairs | Captions* | Masks | Temporal Image Data Source | Anno. | Link | |:-----------:|:-------:|:--------------:|:----------------:|:-----------:|:---------:|:--------------:|:------------------------------:|:---------:|:--------:| | DUBAI CCD | 2022.08 | 50×50 | 30m | 500 | 2,500 | - | Landsat-7 imagery | Manual | Link | | LEVIR CCD | 2022.08 | 256×256 | 0.5m | 500 | 2,500 | - | LEVIR-CD | Manual | Link | | LEVIR-CC | 2022.11 | 256×256 | 0.5m | 10,077 | 50,385 | - | LEVIR-CD | Manual | Link | | CCExpert | 2024.11 | - | - | 200K | 1.2M | - | LEVIR-CC, CLVER-Change, ImageEdit, Spot-the-dif, STVchrono, Vismin, ChangeSim, SYSU-CD, SECOND | Auto. | Link | | SECTION | 2025.07 | 256×256 | 0.3-3m | 4,059 | 12,200 | - | SECOND | Manual | Link | | LEVIR-MCI | 2024.03 | 256×256 | 0.5m | 10,077 | 50,385 | building, road | LEVIR-CC | Manual | Link | | LEVIR-CDC | 2024.11 | 256×256 | 0.5m | 10,077 | 50,385 | building | LEVIR-CC | Manual | Link | | WHU-CDC | 2024.11 | 256×256 | 0.075m | 7,434 | 37,170 | building | WHU-CD | Manual | Link | | SECOND-CC | 2025.01 | 256×256 | 0.3∼3m | 6,041 | 30,205 | 6 classes | SECOND | Manual | Link |

Matching Temporal Images, Instruction and Response

| Dataset | Time | Instruction Samples | Number of Images | Temporal Length | Temporal Image Data Source | Anno. | Link | |----------------|----------|--------------------------|-----------------------|----------------------|----------------------------------|------------|---------| | CDVQA | 2022.09 | 122,000 | 2,968 | 2 | SECOND | Manual | Link | | ChangeChat-87k | 2024.09 | 87,195 | 10,077 | 2 | LEVIR-CC, LEVIR-MCI | Auto. | Link | | QAG-360K | 2024.10 | 360,000 | 6,810 | 2 | Hi-UCD, SECOND, LEVIR-CD | Auto. | Link | | GeoLLaVA | 2024.10 | 100,000 | 100,000 | 2 | fMoW | Auto. | Link | | TEOChatlas | 2024.10 | 554,071 | - | 1~8 | xBD, S2Looking, QFabric, fMoW | Auto. | Link | | EarthDial | 2024.12 | 11.11 Million | - | 1~4 | fMoW, TreeSatAI-Time-Series, MUDS, xBD, QuakeSet | Manual & Auto. | Link | | UniRS | 2024.12 | 318.8 K | - | 1~T (T>2) | LEVIR-CC, ERA-Video | Auto. | Link | | Falcon_SFT | 2025.03 | 78 Million | 5.6 Million | 1~2 | CDD, EGY-BCD, HRSCD, LEVIR-CD, MSBC, MSOSCD, NJDS, S2Looking, SYSU-CD, WHU-CD | Auto. | Link | | DVL-Suite |2025.05 |69,926 |15,063 | 6.9 (Average) | U.S. National Agriculture Imagery Program (NAIP) | Manual & Auto. | N/A | ....

💻 Others

Some CLIP Models in Remote Sensing

| Time | Model Name | Paper Title | Code/Project | |:-------:|:----------:|-----------------------------------------------------------------------------------------------------------------------------------------------------------|:-------------------------------------:| | 2023.06 | RemoteCLIP | RemoteCLIP: A Vision Language Foundation Model for Remote Sensing | link | | 2023.06 | GeoRSCLIP | RS5M and GeoRSCLIP: A Large-Scale Vision- Language Dataset and a Large Vision-Language Model for Remote Sensing | link | | 2023.12 | SkyCLIP | SkyScript: a large and semantically diverse vision-language dataset for remote sensing | link | | 2025.01 | Git-RSCLIP | Text2Earth: Unlocking text-driven remote sensing image generation with a global-scale dataset and a foundation model | link |

//: # ()

🖊️ Citation

If you find our survey and repository useful for your research, please consider citing our paper:

bibtex @misc{liu2024remotesensingtemporalvisionlanguage, title={Remote Sensing Spatio-Temporal Vision-Language Models: A Comprehensive Survey}, author={Chenyang Liu and Jiafan Zhang and Keyan Chen and Man Wang and Zhengxia Zou and Zhenwei Shi}, year={2024}, eprint={2412.02573}, archivePrefix={arXiv}, primaryClass={cs.CV}, url={https://arxiv.org/abs/2412.02573}, }

🐲 Contact

liuchenyang@buaa.edu.cn

Owner

  • Name: Liu Chenyang
  • Login: Chen-Yang-Liu
  • Kind: user
  • Location: Beijing

Liu Chenyang

GitHub Events

Total
  • Watch event: 39
  • Push event: 17
  • Fork event: 2
Last Year
  • Watch event: 39
  • Push event: 17
  • Fork event: 2