Science Score: 54.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
    Found CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
  • Academic publication links
    Links to: arxiv.org
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (10.9%) to scientific vocabulary
Last synced: 6 months ago · JSON representation ·

Repository

Basic Info
  • Host: GitHub
  • Owner: jungletada
  • License: apache-2.0
  • Language: Python
  • Default Branch: master
  • Size: 62.5 KB
Statistics
  • Stars: 0
  • Watchers: 1
  • Forks: 0
  • Open Issues: 0
  • Releases: 0
Created 10 months ago · Last pushed 10 months ago
Metadata Files
Readme License Citation

README.md

Exploiting Diffusion Prior for Generalizable Dense Prediction

Hsin-Ying Lee, Hung-Yu Tseng, Hsin-Ying Lee, Ming-Hsuan Yang

CVPR 2024

Project Page | arXiv

This is the official implementation of Exploiting Diffusion Prior for Generalizable Dense Prediction.

Quick Start

Installation

Our implementation is based on Python 3.10 and CUDA 11.3.

Required

diffusers==0.20.0 pytorch==1.12.1 torchvision==0.13.1 transformers==4.31.0

Optional

shell accelerate # for training gradio # for demo omegaconf # for configuration xformers # for acceleration

Checkpoints

We provide the model weights of five tasks for reproducing the results in the paper. These checkpoints are trained with 10K synthesized bedroom images, prompts, and pseudo ground truths.

Besides, for normal and depth prediction, we provide the weights trained with more diverse scenes and without prompts, which are more suitable for practical use cases.

Download the weights from this google drive and place them in the root directory.

Run

For checkpoints with -notext, set disable_prompts=True.

```python from PIL import Image from pipeline import Pipeline

LORADIR = 'ckpt/normal-scene100-notext' disableprompts = LORADIR.endswith('-notext') ppl = Pipeline( disableprompts=disableprompts, lorackpt=LORADIR, device='cuda', mixedprecision='fp16', ) img = Image.open('/path/to/img') ```

For depth prediction,

python output_np_array = ppl(img, inference_step=5, target_mode='F')

Otherwise,

python output_pil_img = ppl(img, inference_step=5, target_mode='RGB')

Alternatively, we provide Gradio demo. You can launch it with

shell python app.py

and access the app at localhost:7860.


Data Generation

Images

We conduct the experiments with synthetic images, so we can control and analyze the performance of different data domains. We first generate prompts with scene keywords. Then we generate images with the prompts.

To generate prompts,

shell python tools/gencap.py KEYWORD -n NUMBER_OF_PROMPTS -o OUTPUT_TXT

KEYWORD can be a single word or a text file containing multiple words separated by lines.

To generate images,

shell python tools/txt2img.py --from-file PROMPTS_TXT --output OUTPUT_DIR --batch-size BSZ

These two scripts are some wrappers of huggingface's transformers and diffusers.

Then make a meta file to record images and prompts. Prompts are not necessary if you set disable-prompts (see the section of training).

shell python tools/makemeta.py --imgs IMAGE_DIR [--captions PROMPTS]

It collects the png and jpg files in IMAGE_DIR, sort them by their file names, and generates a metadata.jsonl in IMAGE_DIR with the same format as huggingface's ImageFolder. If prompts are provided, it should be in the same order as the file names.

Pseudo Ground Truths

Then we generate pseudo ground truths with the following code bases.

For normals, albedo and shading, clone the repo, set up the environments, and put getnorm.py and getintr.py in each directory.

For depths, getdepth.py can be run in isolation.

shell python tools/get{norm,depth,intr}.py -i INPUT_IMG_DIR -o OUTPUT_DIR

These scripts store the predictions in lmdb by default. The keys of predictions are the file names without extensions. The keys of albedo and shading outputs get an extra -r (reflectance) and-s (shading) suffix. Use --save-files to save outputs in files.

For semantic segmentation, generate segmentation maps with eva02_L_ade_seg_upernet_sz512 in EVA-02.

Detailed instructions 1. Download `eva02_L_ade_seg_upernet_sz512.pth` 2. Make a directory with an arbitrary name, e.g. `dummy`, and make another directory named `images` under it. ```shell mkdir -p dummy/images ``` 3. Link the directory of input images as `validation` in `dummy/images` ```shell ln -s INPUT_IMG_DIR dummy/images/validation ``` 4. Modify `data_root` at line 3 in `configs/_base_/datasets/ade20k.py` to be `dummy` 5. Run `test.py` in the [EVA-02 repo](https://github.com/baaivision/EVA/tree/master/EVA-02/seg) with ```shell python test.py \ configs/eva02/upernet/upernet_eva02_large_24_512_slide_80k.py \ eva02_L_ade_seg_upernet_sz512.pth \ --show-dir OUTPUT_DIR \ --opacity 1 ``` 6. Convert the segmentation maps with better color mapping for classes commonly seen in bedrooms. ```shell python tools/color2cls.py INPUT_DIR OUTPUT_DIR --pal 1 --ext png python tools/cls2color.py INPUT_DIR OUTPUT_DIR --pal 2 ```

Then collect the segmentation maps in lmdb.

shell python tools/makedb.py INPUT_DIR OUTPUT_DB

Training

To reproduce the trained models, the following script is the basic setting for all tasks. The script is adaped from an example provided by huggingface.

bash accelerate launch \ --num_processes=2 --mixed_precision="fp16" train.py \ --train_batch_size=8 \ --max_train_steps=50000 \ --learning_rate=1e-04 \ --lr_scheduler="cosine" \ --lr_warmup_steps=0 \ --prediction_type="v_prediction"

Additionally, for depths, set --target_mode=F and --target_scale=8.

For depths, albedo, shading, and segmentation, set --random_flip.

For albedo, set --target_extra_key=r.

For shading, set --target_extra_key=s.

To add and train lora for only self-attention, set --self_attn_only.

To disable prompts, set --disable_prompts.

To enable xformers, set --enable_xformers_memory_efficient_attention.

Inference

To generate predictions, run infer.py with the same options you run train.py.

```shell DATADIR="/path/to/source/images" # optinoal PROMPTS="/path/to/prompts.txt" # optional LORADIR="/path/to/train/output" OUTPUT_DIR="/path/to/output"

python infer.py \ --src $DATADIR \ --prompts $PROMPTS \ --lora-ckpt $LORADIR \ --output $OUTPUT_DIR \ --config config.yaml \ --batch-size 4 ```

For depths, set --target-mode=F, --target-scale=8. It generates depths and saves in numpy compressed npz format with key x.

Optionally set --target-pred-type, --self-attn-only, and --disable-prompts that aligns training. If you don't provide --src, it will generate images with the original (no lora) model from --prompts . If you don't set --disable-prompts but forget to provide --prompts, it will raise an error.

More settings for the generation process such as the number of generation steps and guidance scales are in config.yaml.

Besides, in the paper we construct the samples of previous diffusion steps with input images and estimated output predictions, but we empirically found using the orignial DDIM, which estimates both input images and output predictions, gives slightly worse in-domain performance but slightly better generalizability. The difference is little, though. The results in the paper were generated by the original DDIM. Set --use-oracle-ddim to use exactly the same generation process of the paper.

Also note that the words in the options are connected by hyphens -, not underscores _.

Evaluation

The evaluation script runs on GPU. For normals,

shell python test/eval.py PRED GROUND_TRUTH --metrics l1 angular

For depths,

shell python test/eval.py PRED GROUND_TRUTH --metrics rel delta --ext npz --abs python test/eval.py PRED GROUND_TRUTH --metrics rmse --ext npz --abs --norm

For albedo and shading,

shell python test/eval.py PRED GROUND_TRUTH --metrics mse

For segmentation, turn output images into class maps.

shell python tools/color2cls.py INPUT_DIR OUTPUT_DIR --pal 2 --ext npy --filter

Then calculate miou.

shell python test/miou.py PRED GROUND_TRUTH

The mIoU evaluation is borrowed from MMSegmentation.

Acknowledgement

This repo contains the code from diffusers and MMSegmentation.

Citation

bibtex @InProceedings{lee2024dmp, author = {Lee, Hsin-Ying and Tseng, Hung-Yu and Lee, Hsin-Ying and Yang, Ming-Hsuan}, title = {Exploiting Diffusion Prior for Generalizable Dense Prediction}, booktitle = {Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)}, year = {2024}, }

Owner

  • Name: DINGJIE PENG
  • Login: jungletada
  • Kind: user
  • Location: Tokyo, Japan

Citation (CITATION.cff)

cff-version: 1.2.0
title: >-
  Exploiting Diffusion Prior for Generalizable Dense Prediction
message: >-
  If you find this work interesting, please cite it as below.
authors:
  - given-names: Hsin-Ying
    family-names: Lee
    affiliation: UC Merced
  - given-names: Hung-Yu
    family-names: Tseng
    affiliation: Meta
  - given-names: Hsin-Ying
    family-names: Lee
    affiliation: Snap
  - given-names: Ming-Hsuan
    family-names: Yang
    affiliation: UC Merced
preferred-citation:
  type: conference-paper
  title: Exploiting Diffusion Prior for Generalizable Dense Prediction
  authors:
  - given-names: Hsin-Ying
    family-names: Lee
    affiliation: UC Merced
  - given-names: Hung-Yu
    family-names: Tseng
    affiliation: Meta
  - given-names: Hsin-Ying
    family-names: Lee
    affiliation: Snap
  - given-names: Ming-Hsuan
    family-names: Yang
    affiliation: UC Merced
  year: 2024
  collection-title: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

GitHub Events

Total
  • Push event: 2
  • Create event: 2
Last Year
  • Push event: 2
  • Create event: 2