spatialfusion-lm
SpatialFusion-LM is a real-time spatial reasoning framework that combines neural depth, 3D reconstruction, and language-driven scene understanding.
Science Score: 26.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
○CITATION.cff file
-
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
○DOI references
-
○Academic publication links
-
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (11.7%) to scientific vocabulary
Keywords
Repository
SpatialFusion-LM is a real-time spatial reasoning framework that combines neural depth, 3D reconstruction, and language-driven scene understanding.
Basic Info
- Host: GitHub
- Owner: jagennath-hari
- License: gpl-3.0
- Language: Python
- Default Branch: main
- Homepage: https://github.com/jagennath-hari/SpatialFusion-LM
- Size: 84.9 MB
Statistics
- Stars: 1
- Watchers: 1
- Forks: 0
- Open Issues: 0
- Releases: 0
Topics
Metadata Files
README.md
![]() Replica scene office3
|
![]() TUM scene office
|
Tested Configuration
SpatialFusion-LM has been tested on:
- Ubuntu: 24.04
- GPU: NVIDIA RTX A6000
- CUDA: 12.8
- Environment: Docker container with GPU support
Other modern Ubuntu + CUDA setups may work, but this is the validated reference configuration.
A GPU with 24GB of VRAM is recommended to ensure stable real-time inference and efficient handling of high-resolution inputs across all components.
Quick Start
- Clone the repo
shell
git clone --recursive https://github.com/jagennath-hari/SpatialFusion-LM.git && cd SpatialFusion-LM
- Download model weights and sample dataset
shell
bash scripts/download_weights.sh && bash scripts/download_sample.sh
- Run the demo inside Docker
shell bash run_container.sh ros2 launch llm_ros llm_demo.launch.py
Change modes using mode:=mono/mono+/stereo
Supported Modes
| Mode | Input | Depth Estimator | Use Case |
|:-------------|:-------:|:---------------------------------------------------------------------------:|---------------------:|
| mono | RGB Only | UniK3D (ViT-L) | Uncalibrated monocular |
| mono+ | RGB + camera intrinsics | UniK3D (ViT-L) | Calibrated monocular |
| stereo | Rectified left + right + intrinsics + baseline | FoundationStereo (ViT-S) | Accurate stereo depth |
Overview
Features
- Supports monocular, monocular+ and stereo vision
- Neural depth estimation with metric 3D reconstruction
- Differentiable point cloud generation in the camera frame
- Language-conditioned spatial layout prediction
- Modular ROS 2 architecture (plug-and-play components)
- Real-time inference and visualization
- Integrated logging via Rerun
TODO
- [ ] Create end-to-end inference using Triton Inference Server via ensemble models
- [ ] Quantize UniK3D and FoundationStereo to TensorRT engines
- [ ] Fix UniK3D bug in CUDA Memeory stream for async error while running SpatialLM in parallel
- [x] (2025-04-24) Expand evaluation to TUM, Replica, and with one custom recorded dataset
Download TUM and Replica ROS 2 datasets
This script will prompt you to select one or more datasets to download:
shell
bash scripts/download_dataset.sh
Launch Configuration Options
The llm_demo.launch.py file accepts the following arguments:
| Argument | Type | Description | Default |
|:-------------|:-------:|:---------------------------------------------------------------------------:|---------------------:|
| mode | string | Input mode: mono, mono+, or stereo | stereo |
| spatialLM | bool | Enable or disable layout prediction via SpatialLM | true |
| rerun | bool | Enable or disable logging to Rerun | true |
| rviz | bool | Enable or disable RVIZ visualization | true |
Mono, Mono+, Stereo?
+---------------------------------+
| mode=? |
+---------------------------------+
|
mono mono+ stereo
rgb rgb camera
intr. intr.
+ +
rgb stereo
pair
+
baseline
Mode Descriptions
monoOnly RGB image is provided.
UniK3D internally estimates camera intrinsics and uses them to predict metric (absolute) depth.
While this enables 3D reconstruction without calibration, the accuracy depends on the quality of intrinsic estimation.
Suitable for quick deployment or uncalibrated cameras.mono+RGB image and accurate camera intrinsics are provided.
UniK3D uses the supplied intrinsics to produce more accurate metric depth, with better scale alignment.
Ideal for calibrated cameras (e.g., using/camera_info).stereoLeft and right rectified images, intrinsics, and baseline are required.
FoundationStereo uses a ViT-based architecture to predict dense disparity maps from stereo pairs.
Metric depth is then computed using the stereo baseline and intrinsics, and converted to a 3D point cloud.
This mode provides the most robust and accurate depth, especially in structured or texture-rich environments.
Demo Gallery
Below are example configurations showing how SpatialFusion-LM behaves with different launch options.
Mono SpatialLM Disabled (mono, rerun)
shell
ros2 launch llm_ros llm_demo.launch.py mode:=mono spatialLM:=false rerun:=true rviz:=false

SpatialFusion-LM performing monocular estimation and 3D reconstruction on TUM scene xyz.
Stereo SpatialLM Disabled (mono, rerun)
shell
ros2 launch llm_ros llm_demo.launch.py mode:=stereo spatialLM:=false rerun:=true rviz:=false

SpatialFusion-LM performing stereo estimation and 3D reconstruction on indoor scene indoor_0.
Run with TUM Dataset
SpatialFusion-LM supports pre-recorded ROS 2 bags from the TUM RGB-D dataset. The llm_demo_tum.launch.py launch file is preconfigured and mono or mono+ modes depending on intrinsics.
shell
ros2 launch llm_ros llm_demo_tum.launch.py \
mode:=mono+ \
bag_path:=/datasets/tum_office \
spatialLM:=true \
rerun:=true \
rviz:=true
This assumes you have already downloaded the ROS 2 TUM dataset. If not, you can follow the provided script
scripts/download_dataset.shto do this.
![]() TUM scene office (Mono+)
|
![]() TUM scene desk (Mono)
|
Run with Replica Dataset
SpatialFusion-LM supports pre-recorded ROS 2 bags from the Replica dataset. The llm_demo_replica.launch.py launch file is preconfigured and mono or mono+ modes depending on intrinsics.
shell
ros2 launch llm_ros llm_demo_replica.launch.py \
mode:=mono+ \
bag_path:=/datasets/replica_office2 \
spatialLM:=true \
rerun:=true \
rviz:=true
This assumes you have already downloaded the ROS 2 Replica dataset. If not, you can follow the provided script
scripts/download_dataset.shto do this.
![]() Replica scene office2 (Mono+)
|
![]() Replica scene room0 (Mono)
|
Using SpatialFusion-LM with Your Own ROS 2 Topics
To run SpatialFusion-LM on a live ROS 2 system or your own dataset:
1 Use llm.launch.py for direct topic-level control
This version of the launch file allows you to specify raw topic names directly (no bag playback or auto setup). Examples:
This simulates mode:=mono as there is no rgb_info provided.
shell
ros2 launch llm_ros llm.launch.py \
rgb_image:=/your_camera/image_rect \
rerun:=true \
spatialLM:=true
This simulates mode:=mono+ as rgb_info is provided.
shell
ros2 launch llm_ros llm.launch.py \
rgb_image:=/your_camera/image_rect \
rgb_info:=/your_camera/camera_info \
rerun:=true \
spatialLM:=true
This simulates mode:=stereo as left and right topics, leftinfo and rightinfo, and baseline are provided.
shell
ros2 launch llm_ros llm.launch.py \
left_image:=/stereo/left/image_rect \
right_image:=/stereo/right/image_rect \
left_info:=/stereo/left/camera_info \
right_info:=/stereo/right/camera_info \
baseline:=0.12 \
rerun:=true \
spatialLM:=true
2 Parameter Descriptions
shell
ros2 launch llm_ros llm.launch.py -s
|Parameter | Description | Default | ROS 2 Msg Type |
|:----------|:-----------:|:-------:|--------------:|
|rgb_image | RGB image topic | '' | sensor_msgs/msg/Image |
|rgb_info | RGB camera info topic | '' | sensor_msgs/msg/CameraInfo |
|left_image | Left stereo image topic | '' | sensor_msgs/msg/Image |
|right_image | Right stereo image topic | '' | sensor_msgs/msg/Image |
|left_info | Left camera info topic | '' | sensor_msgs/msg/CameraInfo |
|right_info | Right camera info topic | '' | sensor_msgs/msg/CameraInfo |
|baseline | Stereo camera baseline (in meters) | 0.0 | float (launch param) |
|rerun | Enable Rerun logging | true | bool (launch param) |
|spatialLM | Enable 3D layout prediction via SpatialLM | true | bool (launch param) |
Output Topics
These are the outputs published by the core_node.py:
|Topic|Description|ROS 2 Msg Type|
|:----|:---------:|-----------:|
|/spatialLM/depth| Predicted depth map (1-channel float)| sensor_msgs/msg/Image|
|/spatialLM/cloud| Reconstructed 3D point cloud| sensor_msgs/msg/PointCloud2|
|/spatialLM/image| RGB image with projected 3D layout| sensor_msgs/msg/Image|
|/spatialLM/boxes| Predicted 3D layout objects (e.g., boxes)| visualization_msgs/msg/MarkerArray|
|/tf| Transform tree| (e.g., map camera)| tf2_msgs/msg/TFMessage|
Depth Comparison

Comparison of predicted depth maps from stereo, mono+, and mono modes respectively.
Performance Benchmarks
Measured on a single NVIDIA RTX A6000 (48GB VRAM) at 19201080 resolution. Input images are automatically resized to model-specific inference resolution, and values may vary depending on hardware, backend load, and ROS 2 message overhead.
Measured using Headless mode without Rerun or RVIZ.
| Mode | Inference Time (ms) | Avg FPS | VRAM Used | Backbone | |:-------:|:-------------------:|:-------:|:---------:|:--------:| | Mono | 169.6 | 5.85 | 4440 MiB | UniK3D (ViT-L) | | Mono+ | 126.1 | 7.87 | 4460 MiB | UniK3D (ViT-L) | | Stereo | 292.5 | 3.35 | 2126 MiB | FoundationStereo (ViT-S) |
A signficant VRAM of ~8192 MiB is needed when spatialLM:=true.
Contributing
I welcome pull requests and suggestions! If you want to add a new dataset, model backend, or visualization utility, open an issue or fork this repo!
If you find any bug in the code, please report to jh7454@nyu.edu
Citation
If you found this code/work to be useful in your own research, please considering citing the following:
bibtex
@article{wen2025stereo,
title={FoundationStereo: Zero-Shot Stereo Matching},
author={Bowen Wen and Matthew Trepte and Joseph Aribido and Jan Kautz and Orazio Gallo and Stan Birchfield},
journal={CVPR},
year={2025}
}
bibtex
@inproceedings{piccinelli2025unik3d,
title = {{U}ni{K3D}: Universal Camera Monocular 3D Estimation},
author = {Piccinelli, Luigi and Sakaridis, Christos and Segu, Mattia and Yang, Yung-Hsu and Li, Siyuan and Abbeloos, Wim and Van Gool, Luc},
booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
year = {2025}
}
bibtex
@misc{spatiallm,
title = {SpatialLM: Large Language Model for Spatial Understanding},
author = {ManyCore Research Team},
howpublished = {\url{https://github.com/manycore-research/SpatialLM}},
year = {2025}
}
License
This software is released under the GNU General Public License v3.0 (GPL-3.0). You are free to use, modify, and distribute this code under the terms of the license, but derivative works must also be open-sourced under GPL-3.0.
Acknowledgement
This work integrates several powerful research papers, libraries, and open-source tools:
Owner
- Name: Jagennath Hari
- Login: jagennath-hari
- Kind: user
- Website: https://www.linkedin.com/in/jagennath-hari/
- Repositories: 2
- Profile: https://github.com/jagennath-hari
GitHub Events
Total
- Watch event: 7
- Push event: 25
- Public event: 1
Last Year
- Watch event: 7
- Push event: 25
- Public event: 1
Dependencies
- nvidia/cuda 12.8.0-devel-ubuntu22.04 build




