ffcv-pl

[FFCV-PL] manage fast data loading with ffcv and pytorch lightning

https://github.com/serezd/ffcv_pytorch_lightning

Science Score: 57.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
    Found CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
    Found 1 DOI reference(s) in README
  • Academic publication links
  • Committers with academic emails
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (13.2%) to scientific vocabulary

Keywords

dataloader ffcv pytorch pytorch-lightning
Last synced: 6 months ago · JSON representation ·

Repository

[FFCV-PL] manage fast data loading with ffcv and pytorch lightning

Basic Info
  • Host: GitHub
  • Owner: SerezD
  • License: mit
  • Language: Python
  • Default Branch: master
  • Homepage:
  • Size: 64.5 KB
Statistics
  • Stars: 15
  • Watchers: 1
  • Forks: 0
  • Open Issues: 0
  • Releases: 0
Topics
dataloader ffcv pytorch pytorch-lightning
Created about 3 years ago · Last pushed over 2 years ago
Metadata Files
Readme License Citation

readme.md

FFCV Dataloader with Pytorch Lightning

FFCV is a fast dataloader for neural networks training: https://github.com/libffcv/ffcv

In this repository, all the steps to install and configure it with pytorch-lightning are presented.
The idea is to provide very generic methods and utils, while letting the user decide and configure anything.

Installation

Tested with: Ubuntu 22.04.2 LTS python 3.11 ffcv==1.0.2 pytorch==2.0.1 pytorch-lightning==2.0.4

Dependencies

You can install dependencies (FFCV, Pytorch) with the provided environment.yml file:
conda env create --file environment.yml conda activate ffcv-pl This should correctly create a conda environment named ffcv-pl.

Note: Modify the pytorch-cuda version to the one compatible with your system.

Note: Solving environment can take quite a long time. I suggest to use libmamba solver to speed up the process.

If the above does not work, then another option is manual installation:

  1. create conda environment conda create --name ffcv-pl conda activate ffcv-pl

  2. install pytorch according to official website

    ```

    in my environment the command is the following

    conda install pytorch torchvision torchaudio pytorch-cuda=[your-version] -c pytorch -c nvidia ```

  3. install ffcv dependencies and pytorch-lightning ```

    can take some time for solving, but should not create conflicts

    conda install cupy pkg-config libjpeg-turbo">=2.1.4" opencv numba pytorch-lightning">=2.0.0" -c pytorch -c conda-forge ```

  4. install ffcv pip install ffcv

For further help, check out FFCV installation guidelines: ffcv official page

Package

Once dependencies are installed, it is safe to install the package: pip install ffcv_pl

Dataset Creation

You need to save your dataset in ffcv format (.beton).
Official FFCV docs.

This package provides you the create_beton_wrapper method, which allows to easily create a .beton dataset from a torch dataset.

Example from the dataset_creation.py script:

``` from ffcv.fields import RGBImageField

from ffcvpl.generatedataset import createbetonwrapper from torch.utils.data.dataset import Dataset import numpy as np from PIL import Image

class ToyImageLabelDataset(Dataset):

def __init__(self, n_samples: int):
    self.samples = [Image.fromarray((np.random.rand(32, 32, 3) * 255).astype('uint8')).convert('RGB')
                    for _ in range(n_samples)]

def __len__(self):
    return len(self.samples)

def __getitem__(self, idx):
    return (self.samples[idx], int(idx))

def main():

# 1. Instantiate the torch dataset that you want to create
# Important: the __get_item__ dataset must return tuples! (This depends on FFCV library)
image_label_dataset = ToyImageLabelDataset(n_samples=256)

# 2. Optional: create Field objects.
# here overwrites only RGBImageField, leave default IntField.
fields = (RGBImageField(write_mode='jpg', max_resolution=32), None)

# 3. call the method, and it will automatically create the .beton dataset for you.
create_beton_wrapper(image_label_dataset, "./data/image_label.beton", fields)

if name == 'main':

main()

```

Dataloader and Datamodule

Merge the PL Datamodule with the FFCV Loader object.
Official FFCV Loader docs.
Official Pytorch-Lightning DataModule docs.

In main.py a complete example on how to use the FFCVDataModule method and train a Lightning Model is given.

The main steps to follow are: 1. create FFCVPipelineManager object, which needs the path to a previously created .beton file, a list of operations to perform on each item returned by your dataset and an ordering option for Loading. 2. create the FFCVDataModule object, which is a Lightning Module with FFCV Loader. 3. Pass the data module to Pytorch Lightning trainer, and run!

Suggestion : read FFCV performance guide to better understand which options fit your needs.

Complete Example from the main.py script:

``` import pytorchlightning as pl import torch from ffcv.fields.basics import IntDecoder from ffcv.fields.rgbimage import RandomResizedCropRGBImageDecoder, CenterCropRGBImageDecoder from ffcv.loader import OrderOption from ffcv.transforms import ToTensor, ToTorchImage from pytorch_lightning.strategies.ddp import DDPStrategy

from torch import nn from torch.optim import Adam from torchvision.transforms import RandomHorizontalFlip

from ffcvpl.dataloading import FFCVDataModule from ffcvpl.ffcvutils.augmentations import DivideImage255

from ffcvpl.ffcvutils.utils import FFCVPipelineManager

define the LightningModule

class LitAutoEncoder(pl.LightningModule):

def __init__(self):
    super().__init__()
    self.encoder = nn.Sequential(nn.Linear(32 * 32 * 3, 64), nn.ReLU(), nn.Linear(64, 3))
    self.decoder = nn.Sequential(nn.Linear(3, 64), nn.ReLU(), nn.Linear(64, 32 * 32 * 3))

def training_step(self, batch, batch_idx):

    x = batch[0]

    b, c, h, w = x.shape
    x = x.reshape(b, -1)
    z = self.encoder(x)
    x_hat = self.decoder(z)
    loss = nn.functional.mse_loss(x_hat, x)

    # Logging to TensorBoard by default
    self.log("train_loss", loss)
    return loss

def validation_step(self, batch, batch_idx):
    pass

def configure_optimizers(self):
    optimizer = Adam(self.parameters(), lr=1e-3)
    return optimizer

def main():

seed = 1234

pl.seed_everything(seed, workers=True)

batch_size = 16
gpus = 2
nodes = 1
workers = 8

# image label dataset
train_manager = FFCVPipelineManager("./data/image_label.beton",  # previously defined using dataset_creation.py
                                    pipeline_transforms=[

                                        # image pipeline
                                        [RandomResizedCropRGBImageDecoder((32, 32)),
                                         ToTensor(),
                                         ToTorchImage(),
                                         DivideImage255(dtype=torch.float32),
                                         RandomHorizontalFlip(p=0.5)],

                                        # label (int) pipeline
                                        [IntDecoder(),
                                         ToTensor()
                                         ]
                                    ],
                                    ordering=OrderOption.RANDOM)  # random ordering for training

val_manager = FFCVPipelineManager("./data/image_label.beton",
                                  pipeline_transforms=[

                                      # image pipeline (different from train)
                                      [CenterCropRGBImageDecoder((32, 32), ratio=1.),
                                       ToTensor(),
                                       ToTorchImage(),
                                       DivideImage255(dtype=torch.float32)],

                                      # label (int) pipeline
                                      None  # if None, uses default
                                  ],
                                  ordering=OrderOption.SEQUENTIAL)  # sequential ordering for validation

# datamodule creation
# ignore test and predict steps, since managers are not defined.
data_module = FFCVDataModule(batch_size, workers, train_manager=train_manager, val_manager=val_manager,
                             is_dist=True, seed=seed)

# define model
model = LitAutoEncoder()

# trainer
trainer = pl.Trainer(strategy=DDPStrategy(find_unused_parameters=False), deterministic=True,
                     accelerator='gpu', devices=gpus, num_nodes=nodes, max_epochs=5, logger=False)

# start training!
trainer.fit(model, data_module)

if name == 'main':

main()

```

Code Citations

  1. Pytorch-Lightning: @software{Falcon_PyTorch_Lightning_2019, author = {Falcon, William and {The PyTorch Lightning team}}, doi = {10.5281/zenodo.3828935}, license = {Apache-2.0}, month = mar, title = {{PyTorch Lightning}}, url = {https://github.com/Lightning-AI/lightning}, version = {1.4}, year = {2019} }

  2. FFCV: @misc{leclerc2022ffcv, author = {Guillaume Leclerc and Andrew Ilyas and Logan Engstrom and Sung Min Park and Hadi Salman and Aleksander Madry}, title = {{FFCV}: Accelerating Training by Removing Data Bottlenecks}, year = {2022}, howpublished = {\url{https://github.com/libffcv/ffcv/}}, note = {commit 2544abdcc9ce77db12fecfcf9135496c648a7cd5} }

Owner

  • Name: Dario Serez
  • Login: SerezD
  • Kind: user
  • Location: Genoa, Italy
  • Company: Italian Institute of Technology (IIT)

Ph.D. student at "Istituto Italiano di Tecnologia" - PAVIS research line, Genoa, Italy

Citation (CITATION.cff)

cff-version: 1.2.0
message: "If you use this software, please cite it as below."
authors:
- family-names: "Serez"
  given-names: "Dario"
title: "FFCV Pytorch Lightning"
version: 0.2.0
date-released: 2023-05-18
url: "https://github.com/SerezD/ffcv_pytorch_lightning"

GitHub Events

Total
  • Watch event: 4
Last Year
  • Watch event: 4

Committers

Last synced: about 1 year ago

All Time
  • Total Commits: 28
  • Total Committers: 2
  • Avg Commits per committer: 14.0
  • Development Distribution Score (DDS): 0.036
Past Year
  • Commits: 0
  • Committers: 0
  • Avg Commits per committer: 0.0
  • Development Distribution Score (DDS): 0.0
Top Committers
Name Email Commits
dserez s****7@g****m 27
Dario Serez 6****D 1

Issues and Pull Requests

Last synced: 6 months ago

All Time
  • Total issues: 0
  • Total pull requests: 3
  • Average time to close issues: N/A
  • Average time to close pull requests: 4 minutes
  • Total issue authors: 0
  • Total pull request authors: 1
  • Average comments per issue: 0
  • Average comments per pull request: 0.0
  • Merged pull requests: 3
  • Bot issues: 0
  • Bot pull requests: 0
Past Year
  • Issues: 0
  • Pull requests: 0
  • Average time to close issues: N/A
  • Average time to close pull requests: N/A
  • Issue authors: 0
  • Pull request authors: 0
  • Average comments per issue: 0
  • Average comments per pull request: 0
  • Merged pull requests: 0
  • Bot issues: 0
  • Bot pull requests: 0
Top Authors
Issue Authors
Pull Request Authors
  • SerezD (3)
Top Labels
Issue Labels
Pull Request Labels

Packages

  • Total packages: 1
  • Total downloads:
    • pypi 29 last-month
  • Total dependent packages: 0
  • Total dependent repositories: 0
  • Total versions: 13
  • Total maintainers: 1
pypi.org: ffcv-pl

manage fast data loading with ffcv and pytorch lightning

  • Versions: 13
  • Dependent Packages: 0
  • Dependent Repositories: 0
  • Downloads: 29 Last month
Rankings
Dependent packages count: 6.6%
Downloads: 17.7%
Average: 21.4%
Stargazers count: 21.8%
Forks count: 30.5%
Dependent repos count: 30.6%
Maintainers (1)
Last synced: 6 months ago

Dependencies

environment.yml conda
  • cupy
  • libjpeg-turbo >=2.1.4
  • numba
  • opencv
  • pip
  • pkg-config
  • pytorch >=2.0.0
  • pytorch-cuda 11.8.*
  • pytorch-lightning >=2.0.0
  • torchaudio >=2.0.1
  • torchvision >=0.15.1
requirements.txt pypi
  • PyTurboJPEG *
  • cupy-cuda11x *
  • ffcv >=1.0.0
  • numba *
  • opencv-python *
  • pkgconfig *
  • pytorch-lightning >=2.0.0
  • torch *
  • torchaudio *
  • torchvision *
src/setup.py pypi