Updated 10 months ago
lmms-finetune
A minimal codebase for finetuning large multimodal models, supporting llava-1.5/1.6, llava-interleave, llava-next-video, llava-onevision, llama-3.2-vision, qwen-vl, qwen2-vl, phi3-v etc.
Updated 9 months ago
https://github.com/ahwang16/grounded-intuition-gpt-vision
Resources for Grounded Intuition of GPT-Vision's Abilities with Scientific Images
Updated 10 months ago
text2earth
[IEEE GRSM 2025 🔥] "Text2Earth: Unlocking Text-driven Remote Sensing Image Generation with a Global-Scale Dataset and a Foundation Model"
Updated 9 months ago
https://github.com/amazon-science/mix-generation
MixGen: A New Multi-Modal Data Augmentation
Updated 9 months ago
https://github.com/chen-yang-liu/awesome-rs-spatiotemporal-vlms
🔥Remote Sensing Spatio-Temporal Vision-Language Models: A Comprehensive Survey
Updated 9 months ago
https://github.com/bytedance/shot2story
A new multi-shot video understanding benchmark Shot2Story with comprehensive video summaries and detailed shot-level captions.
Updated 10 months ago
drivelm
[ECCV 2024 Oral] DriveLM: Driving with Graph Visual Question Answering