https://github.com/google-research/foundation-model-embedded-3dgs

https://github.com/google-research/foundation-model-embedded-3dgs

Science Score: 36.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
  • Academic publication links
    Links to: arxiv.org
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (11.1%) to scientific vocabulary
Last synced: 4 months ago · JSON representation

Repository

Basic Info
  • Host: GitHub
  • Owner: google-research
  • License: apache-2.0
  • Language: Python
  • Default Branch: main
  • Size: 11 MB
Statistics
  • Stars: 31
  • Watchers: 2
  • Forks: 4
  • Open Issues: 1
  • Releases: 0
Created almost 2 years ago · Last pushed 7 months ago
Metadata Files
Readme Contributing License

README.md

This is not an officially supported Google product. This project is not eligible for the Google Open Source Software Vulnerability Rewards Program.

FMGS:Foundation Model Embedded 3D Gaussian Splatting for Holistic 3D Scene Understanding

Xingxing ZuoPouya SamangoueiYunwen ZhouYan DiMingyang Li
Google AR
arXiv Paper

TL;DR

FMGS embeds foundation models to a 3D scene representation that seamlessly integrates 3G Gaussians and multi-resolution hash encodings (MHE). The trained scene representation supports open-vocabulary query of objects and unsupervised semantic segmentation.

Paper | Project Page

Teaser image Object query  image

Abstract: Precisely perceiving the geometric and semantic properties of real-world 3D objects is crucial for the continued evolution of augmented reality and robotic applications. To this end, we present Foundation Model Embedded Gaussian Splatting (FMGS), which incorporates vision-language embeddings of foundation models into 3D Gaussian Splatting (GS). The key contribution of this work is an efficient method to reconstruct and represent 3D vision-language models. This is achieved by distilling feature maps generated from image-based foundation models into those rendered from our 3D model. To ensure high-quality rendering and fast training, we introduce a novel scene representation by integrating strengths from both GS and multi-resolution hash encodings (MHE). Our effective training procedure also introduces a pixel alignment loss that makes the rendered feature distance of same semantic entities close, following the pixel-level semantic boundaries. Our results demonstrate remarkable multi-view semantic consistency, facilitating diverse downstream tasks, like open-vocabulary language-based object detection and unsupervised semantic segmentation at a fast inference speed. This research explores the intersection of vision, language, and 3D scene representation, paving the way for enhanced scene understanding in uncontrolled real-world environments.

Update Log:
**June 27, 2024**: * Initial release of the FMGS repository.

BibTeX

@Article{zuo2024fmgs,
      title={Fmgs: Foundation model embedded 3d gaussian splatting for holistic 3d scene understanding},
      author={Zuo, Xingxing and Samangouei, Pouya and Zhou, Yunwen and Di, Yan and Li, Mingyang},
      journal={arXiv preprint arXiv:2401.01970},
      year={2024},
      url={https://xingxingzuo.github.io/fmgs/}
}

Cloning the Repository

The repository contains submodules, thus please check it out with ```shell

SSH

git clone https://github.com/googlestaging/foundation-model-embedded-3dgs.git --recursive

If you forgot to do --recursive

git submodule update --init --recursive

Note that the simple-diff-gaussian-rasterization submodule for rendering high-dimensional feature map is already put in "submodules" directory.

```

Overview

The components have different requirements w.r.t. both hardware and software. They have been tested on Ubuntu Linux 20.04. Instructions for setting up and running each of them are found in the sections below.

Hardware and Software Requirements

  • CUDA-ready GPU with Compute Capability 7.0+
  • 24 GB VRAM (to train to paper evaluation quality)

  • Conda (recommended for easy setup)

  • C++ Compiler for PyTorch extensions

  • CUDA SDK 11 for PyTorch extensions, install after Visual Studio (we used 11.8)

  • C++ Compiler and CUDA SDK must be compatible

Local Setup

Our default, provided install method is based on Conda package and environment management: shell conda env create --file environment.yml conda activate fmgs

DATA and Pretrained Weights

For open-vocubulary object detection task, we conducted the experiments on LERF dataset. The raw LERF dataset is in NS (NerfStudio) format, which is not compatible with Gaussian_Splatting, which operates on Colmap format dataset. The difference between the two types of data formats is illustrated here.

The provided poses in LERF dataset can be poor sometimes, and we run Colmap by ourselves to get more accurate trajectories. We then transform the object bounding box labels of LERF dataset to our obtained Colmap trajectories for evaluation. We also share our post-processed dataset, our pretrained weights, and results (relevancy map corresponding to various queries) on this Hugging Face page. Please download it, unzip and save it in the 'data/tidylerf' folder by: ```shell mkdir -p data/tidylerf unzip fmgspostprocessedlerfdatatrainedweights.zip -d data/tidylerf ```

Running on LERF Dataset

*Inference for open-vocabulary object detection with trained scene representations: *

To render the featuremap and images, get the relevancy maps corresponding to the given open-vocabulary queries. Please run:

```shell python ./renderlerfrelavancyeval.py -s $sequencefolder -m $ckptspath/${sequences[i]} --dataformat colmap --evalkeyframepathfilename $evalpath/${sequences[i]}/keyframesreversed_transform2colmap.json --iteration ${iterations[iter]}

Or simply run the script

bash ./scripts/runevalon_lerf.sh ```

Train the scene representation:

We firstly train the vanilla GaussianSplatting by:

shell python train.py -s ${sequence_folder} --model_path $save_path --test_iterations 7000 30000 --save_iterations 7000 30000 --iterations 30000 --checkpoint_iterations 7000 30000 --port 6009

Then we start from the trained Vanilla GaussianSplatting checkpoints and embeded the CLIP and DINO semantic features from foundation models:

```shell python train.py -s ${sequencefolder} --modelpath $savepath --optvlrenderfeatfrom 30000 --testiterations 32000 32500 --saveiterations 32000 32500 --iterations 32500 --checkpointiterations 32000 32500 --startcheckpoint $savepath/chkpnt30000.pth --fmapresolution 2 --lambdaclip 0.2 --fmaplr 0.005 --fmaprenderradiithre ${sequencesradiithre[i]} --port 6009

After training several iterations, enable the dotpsimloss_w. Note that there is some randomness in different training trials.

python train.py -s ${sequencefolder} --modelpath $savepath --optvlrenderfeatfrom 30000 --testiterations 33800 34000 34200 --saveiterations 33800 34000 34200 --iterations 34200 --checkpointiterations 33800 34000 34200 --startcheckpoint $savepath/chkpnt32500.pth --fmapresolution 2 --lambdaclip 0.2 --fmaplr 0.005 --fmaprenderradiithre ${sequencesradiithre[i]} --dotpsimlossw 0.01 --port 6009 ```

Acknowlegement

Our code is partially dependent on the 3D Gaussian Splatting and LERF, we thanks the authors for the excellent contributions.

Owner

  • Name: Google Research
  • Login: google-research
  • Kind: organization
  • Location: Earth

Dependencies

environment.yml pypi
  • ftfy *
  • jaxtyping *
  • open_clip_torch *
  • regex *
  • third_party *
  • timm *
third_party/simple-diff-gaussian-rasterization/setup.py pypi