ffcv-pl

[FFCV-PL] manage fast data loading with ffcv and pytorch lightning

https://github.com/serezd/ffcv_pytorch_lightning

Science Score: 57.0%

This score indicates how likely this project is to be science-related based on various indicators:

✓
CITATION.cff file
Found CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
✓
DOI references
Found 1 DOI reference(s) in README
○
Academic publication links
○
Committers with academic emails
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (13.2%) to scientific vocabulary

Keywords

dataloader ffcv pytorch pytorch-lightning

Last synced: 6 months ago · JSON representation ·

Repository

[FFCV-PL] manage fast data loading with ffcv and pytorch lightning

Basic Info

Host: GitHub
Owner: SerezD
License: mit
Language: Python
Default Branch: master
Homepage:
Size: 64.5 KB

Statistics

Stars: 15
Watchers: 1
Forks: 0
Open Issues: 0
Releases: 0

Topics

dataloader ffcv pytorch pytorch-lightning

Created about 3 years ago · Last pushed over 2 years ago

Metadata Files

Readme License Citation

FFCV Dataloader with Pytorch Lightning

FFCV is a fast dataloader for neural networks training: https://github.com/libffcv/ffcv

In this repository, all the steps to install and configure it with pytorch-lightning are presented.
The idea is to provide very generic methods and utils, while letting the user decide and configure anything.

Installation

Tested with: Ubuntu 22.04.2 LTS python 3.11 ffcv==1.0.2 pytorch==2.0.1 pytorch-lightning==2.0.4

Dependencies

You can install dependencies (FFCV, Pytorch) with the provided environment.yml file:
conda env create --file environment.yml conda activate ffcv-pl This should correctly create a conda environment named ffcv-pl.

Note: Modify the pytorch-cuda version to the one compatible with your system.

Note: Solving environment can take quite a long time. I suggest to use libmamba solver to speed up the process.

If the above does not work, then another option is manual installation:

create conda environment conda create --name ffcv-pl conda activate ffcv-pl
install pytorch according to official website

```

in my environment the command is the following

conda install pytorch torchvision torchaudio pytorch-cuda=[your-version] -c pytorch -c nvidia ```
install ffcv dependencies and pytorch-lightning ```

can take some time for solving, but should not create conflicts

conda install cupy pkg-config libjpeg-turbo">=2.1.4" opencv numba pytorch-lightning">=2.0.0" -c pytorch -c conda-forge ```
install ffcv pip install ffcv

For further help, check out FFCV installation guidelines: ffcv official page

Package

Once dependencies are installed, it is safe to install the package: pip install ffcv_pl

Dataset Creation

You need to save your dataset in ffcv format (.beton).
Official FFCV docs.

This package provides you the create_beton_wrapper method, which allows to easily create a .beton dataset from a torch dataset.

Example from the dataset_creation.py script:

``` from ffcv.fields import RGBImageField

from ffcvpl.generatedataset import createbetonwrapper from torch.utils.data.dataset import Dataset import numpy as np from PIL import Image

class ToyImageLabelDataset(Dataset):

def __init__(self, n_samples: int):
    self.samples = [Image.fromarray((np.random.rand(32, 32, 3) * 255).astype('uint8')).convert('RGB')
                    for _ in range(n_samples)]

def __len__(self):
    return len(self.samples)

def __getitem__(self, idx):
    return (self.samples[idx], int(idx))

def main():

# 1. Instantiate the torch dataset that you want to create
# Important: the __get_item__ dataset must return tuples! (This depends on FFCV library)
image_label_dataset = ToyImageLabelDataset(n_samples=256)

# 2. Optional: create Field objects.
# here overwrites only RGBImageField, leave default IntField.
fields = (RGBImageField(write_mode='jpg', max_resolution=32), None)

# 3. call the method, and it will automatically create the .beton dataset for you.
create_beton_wrapper(image_label_dataset, "./data/image_label.beton", fields)

if name == 'main':

main()

```

Dataloader and Datamodule

Merge the PL Datamodule with the FFCV Loader object.
Official FFCV Loader docs.
Official Pytorch-Lightning DataModule docs.

In main.py a complete example on how to use the FFCVDataModule method and train a Lightning Model is given.

The main steps to follow are: 1. create FFCVPipelineManager object, which needs the path to a previously created .beton file, a list of operations to perform on each item returned by your dataset and an ordering option for Loading. 2. create the FFCVDataModule object, which is a Lightning Module with FFCV Loader. 3. Pass the data module to Pytorch Lightning trainer, and run!

Suggestion : read FFCV performance guide to better understand which options fit your needs.

Complete Example from the main.py script:

``` import pytorchlightning as pl import torch from ffcv.fields.basics import IntDecoder from ffcv.fields.rgbimage import RandomResizedCropRGBImageDecoder, CenterCropRGBImageDecoder from ffcv.loader import OrderOption from ffcv.transforms import ToTensor, ToTorchImage from pytorch_lightning.strategies.ddp import DDPStrategy

from torch import nn from torch.optim import Adam from torchvision.transforms import RandomHorizontalFlip

from ffcvpl.dataloading import FFCVDataModule from ffcvpl.ffcvutils.augmentations import DivideImage255

from ffcvpl.ffcvutils.utils import FFCVPipelineManager

define the LightningModule

class LitAutoEncoder(pl.LightningModule):

def __init__(self):
    super().__init__()
    self.encoder = nn.Sequential(nn.Linear(32 * 32 * 3, 64), nn.ReLU(), nn.Linear(64, 3))
    self.decoder = nn.Sequential(nn.Linear(3, 64), nn.ReLU(), nn.Linear(64, 32 * 32 * 3))

def training_step(self, batch, batch_idx):

    x = batch[0]

    b, c, h, w = x.shape
    x = x.reshape(b, -1)
    z = self.encoder(x)
    x_hat = self.decoder(z)
    loss = nn.functional.mse_loss(x_hat, x)

    # Logging to TensorBoard by default
    self.log("train_loss", loss)
    return loss

def validation_step(self, batch, batch_idx):
    pass

def configure_optimizers(self):
    optimizer = Adam(self.parameters(), lr=1e-3)
    return optimizer

def main():

seed = 1234

pl.seed_everything(seed, workers=True)

batch_size = 16
gpus = 2
nodes = 1
workers = 8

# image label dataset
train_manager = FFCVPipelineManager("./data/image_label.beton",  # previously defined using dataset_creation.py
                                    pipeline_transforms=[

                                        # image pipeline
                                        [RandomResizedCropRGBImageDecoder((32, 32)),
                                         ToTensor(),
                                         ToTorchImage(),
                                         DivideImage255(dtype=torch.float32),
                                         RandomHorizontalFlip(p=0.5)],

                                        # label (int) pipeline
                                        [IntDecoder(),
                                         ToTensor()
                                         ]
                                    ],
                                    ordering=OrderOption.RANDOM)  # random ordering for training

val_manager = FFCVPipelineManager("./data/image_label.beton",
                                  pipeline_transforms=[

                                      # image pipeline (different from train)
                                      [CenterCropRGBImageDecoder((32, 32), ratio=1.),
                                       ToTensor(),
                                       ToTorchImage(),
                                       DivideImage255(dtype=torch.float32)],

                                      # label (int) pipeline
                                      None  # if None, uses default
                                  ],
                                  ordering=OrderOption.SEQUENTIAL)  # sequential ordering for validation

# datamodule creation
# ignore test and predict steps, since managers are not defined.
data_module = FFCVDataModule(batch_size, workers, train_manager=train_manager, val_manager=val_manager,
                             is_dist=True, seed=seed)

# define model
model = LitAutoEncoder()

# trainer
trainer = pl.Trainer(strategy=DDPStrategy(find_unused_parameters=False), deterministic=True,
                     accelerator='gpu', devices=gpus, num_nodes=nodes, max_epochs=5, logger=False)

# start training!
trainer.fit(model, data_module)

if name == 'main':

main()

```

Code Citations

Pytorch-Lightning: @software{Falcon_PyTorch_Lightning_2019, author = {Falcon, William and {The PyTorch Lightning team}}, doi = {10.5281/zenodo.3828935}, license = {Apache-2.0}, month = mar, title = {{PyTorch Lightning}}, url = {https://github.com/Lightning-AI/lightning}, version = {1.4}, year = {2019} }
FFCV: @misc{leclerc2022ffcv, author = {Guillaume Leclerc and Andrew Ilyas and Logan Engstrom and Sung Min Park and Hadi Salman and Aleksander Madry}, title = {{FFCV}: Accelerating Training by Removing Data Bottlenecks}, year = {2022}, howpublished = {\url{https://github.com/libffcv/ffcv/}}, note = {commit 2544abdcc9ce77db12fecfcf9135496c648a7cd5} }

Owner

Name: Dario Serez
Login: SerezD
Kind: user
Location: Genoa, Italy
Company: Italian Institute of Technology (IIT)

Repositories: 3
Profile: https://github.com/SerezD

Ph.D. student at "Istituto Italiano di Tecnologia" - PAVIS research line, Genoa, Italy

Citation (CITATION.cff)

cff-version: 1.2.0
message: "If you use this software, please cite it as below."
authors:
- family-names: "Serez"
  given-names: "Dario"
title: "FFCV Pytorch Lightning"
version: 0.2.0
date-released: 2023-05-18
url: "https://github.com/SerezD/ffcv_pytorch_lightning"

GitHub Events

Total

Watch event: 4

Last Year

Watch event: 4

Committers

Last synced: about 1 year ago

All Time

Total Commits: 28
Total Committers: 2
Avg Commits per committer: 14.0
Development Distribution Score (DDS): 0.036

Past Year

Commits: 0
Committers: 0
Avg Commits per committer: 0.0
Development Distribution Score (DDS): 0.0

Top Committers

Name	Email	Commits
dserez	s**7@g**m	27
Dario Serez	6****D	1

Issues and Pull Requests

Last synced: 6 months ago

All Time

Total issues: 0
Total pull requests: 3
Average time to close issues: N/A
Average time to close pull requests: 4 minutes
Total issue authors: 0
Total pull request authors: 1
Average comments per issue: 0
Average comments per pull request: 0.0
Merged pull requests: 3
Bot issues: 0
Bot pull requests: 0

Past Year

Issues: 0
Pull requests: 0
Average time to close issues: N/A
Average time to close pull requests: N/A
Issue authors: 0
Pull request authors: 0
Average comments per issue: 0
Average comments per pull request: 0
Merged pull requests: 0
Bot issues: 0
Bot pull requests: 0

View more stats

Top Authors

Issue Authors

Pull Request Authors

SerezD (3)

Top Labels

Issue Labels

Pull Request Labels

Packages

Total packages: 1
Total downloads:
- pypi 29 last-month

Total dependent packages: 0
Total dependent repositories: 0
Total versions: 13
Total maintainers: 1

pypi.org: ffcv-pl

manage fast data loading with ffcv and pytorch lightning

Homepage: https://github.com/SerezD/ffcv_pytorch_lightning
Documentation: https://ffcv-pl.readthedocs.io/
License: MIT
Latest release: 0.3.2
published over 2 years ago

Versions: 13
Dependent Packages: 0
Dependent Repositories: 0
Downloads: 29 Last month

Rankings

Dependent packages count: 6.6%

Downloads: 17.7%

Average: 21.4%

Stargazers count: 21.8%

Forks count: 30.5%

Dependent repos count: 30.6%

Maintainers (1)

dserez

Last synced: 6 months ago

Dependencies

environment.yml conda

cupy
libjpeg-turbo >=2.1.4
numba
opencv
pip
pkg-config
pytorch >=2.0.0
pytorch-cuda 11.8.*
pytorch-lightning >=2.0.0
torchaudio >=2.0.1
torchvision >=0.15.1

requirements.txt pypi

PyTurboJPEG *
cupy-cuda11x *
ffcv >=1.0.0
numba *
opencv-python *
pkgconfig *
pytorch-lightning >=2.0.0
torch *
torchaudio *
torchvision *

src/setup.py pypi

ffcv-pl

Science Score: 57.0%

Keywords

Repository

Basic Info

Statistics

Topics

Metadata Files

readme.md

FFCV Dataloader with Pytorch Lightning

Installation

Dependencies

in my environment the command is the following

can take some time for solving, but should not create conflicts

Package

Dataset Creation

Dataloader and Datamodule

define the LightningModule

Code Citations

Owner

Citation (CITATION.cff)

GitHub Events

Total

Last Year

Committers

All Time

Past Year

Top Committers

Issues and Pull Requests

All Time

Past Year

Top Authors

Issue Authors

Pull Request Authors

Top Labels

Issue Labels

Pull Request Labels

Packages

pypi.org: ffcv-pl

Rankings

Maintainers (1)

Dependencies