https://github.com/fgnt/lazy_dataset

lazy_dataset: Process large datasets as if it was an iterable.

https://github.com/fgnt/lazy_dataset

Science Score: 31.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
  • DOI references
  • Academic publication links
  • Committers with academic emails
    7 of 20 committers (35.0%) from academic institutions
  • Institutional organization owner
    Organization fgnt has institutional domain (nt.uni-paderborn.de)
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (9.2%) to scientific vocabulary

Keywords from Contributors

audio speech toolbox dereverberation enhancement signal-processing asr der wer
Last synced: 6 months ago · JSON representation

Repository

lazy_dataset: Process large datasets as if it was an iterable.

Basic Info
Statistics
  • Stars: 18
  • Watchers: 9
  • Forks: 8
  • Open Issues: 5
  • Releases: 0
Created about 7 years ago · Last pushed over 1 year ago
Metadata Files
Readme License

README.md

lazy_dataset

Run python tests codecov.io License: MIT

Lazy_dataset is a helper to deal with large datasets that do not fit into memory. It allows to define transformations that are applied lazily, (e.g. a mapping function to read data from HDD). When someone iterates over the dataset all transformations are applied.

Supported transformations: - dataset.map(map_fn): Apply the function map_fn to each example (builtins.map) - dataset[2]: Get example at index 2. - dataset['example_id'] Get that example that has the example id 'example_id'. - dataset[10:20]: Get a sub dataset that contains only the examples in the slice 10 to 20. - dataset.filter(filter_fn, lazy=True) Drops examples where filter_fn(example) is false (builtins.filter). - dataset.concatenate(*others): Concatenates two or more datasets (numpy.concatenate) - dataset.intersperse(*others): Combine two or more datasets such that examples of each input dataset are evenly spaced (https://stackoverflow.com/a/19293603). - dataset.zip(*others): Zip two or more datasets - dataset.shuffle(reshuffle=False): Shuffles the dataset. When reshuffle is True it shuffles each time when you iterate over the data. - dataset.tile(reps, shuffle=False): Repeats the dataset reps times and concatenates it (numpy.tile) - dataset.cycle(): Repeats the dataset endlessly (itertools.cycle but without caching) - dataset.groupby(group_fn): Groups examples together. In contrast to itertools.groupby a sort is not nessesary, like in pandas (itertools.groupby, pandas.DataFrame.groupby) - dataset.sort(key_fn, sort_fn=sorted): Sorts the examples depending on the values key_fn(example) (list.sort) - dataset.batch(batch_size, drop_last=False): Batches batch_size examples together as a list. Usually followed by a map (tensorflow.data.Dataset.batch) - dataset.random_choice(): Get a random example (numpy.random.choice) - dataset.cache(): Cache in RAM (similar to ESPnet's keep_all_data_on_mem) - dataset.diskcache(): Cache to a cache directory on the local filesystem (useful in clusters network slow filesystems) - ...

```python

from IPython.lib.pretty import pprint import lazydataset examples = { ... 'exampleid1': { ... 'observation': [1, 2, 3], ... 'label': 1, ... }, ... 'exampleid2': { ... 'observation': [4, 5, 6], ... 'label': 2, ... }, ... 'exampleid3': { ... 'observation': [7, 8, 9], ... 'label': 3, ... }, ... } for exampleid, example in examples.items(): ... example['exampleid'] = exampleid ds = lazydataset.new(examples) ds DictDataset(len=3) MapDataset(pickle.loads) ds.keys() ('exampleid1', 'exampleid2', 'exampleid3') for example in ds: ... print(example) {'observation': [1, 2, 3], 'label': 1, 'exampleid': 'exampleid1'} {'observation': [4, 5, 6], 'label': 2, 'exampleid': 'exampleid2'} {'observation': [7, 8, 9], 'label': 3, 'exampleid': 'exampleid3'} def transform(example): ... example['label'] *= 10 ... return example ds = ds.map(transform) for example in ds: ... print(example) {'observation': [1, 2, 3], 'label': 10, 'exampleid': 'exampleid1'} {'observation': [4, 5, 6], 'label': 20, 'exampleid': 'exampleid2'} {'observation': [7, 8, 9], 'label': 30, 'exampleid': 'exampleid3'} ds = ds.filter(lambda example: example['label'] > 15) for example in ds: ... print(example) {'observation': [4, 5, 6], 'label': 20, 'exampleid': 'exampleid2'} {'observation': [7, 8, 9], 'label': 30, 'exampleid': 'exampleid3'} ds['exampleid2'] {'observation': [4, 5, 6], 'label': 20, 'exampleid': 'exampleid2'} ds DictDataset(len=3) MapDataset(pickle.loads) MapDataset() FilterDataset( at 0x7ff74efb67b8>) ```

Comparison with PyTorch's DataLoader

See here for a feature and throughput comparison of lazy_dataset with PyTorch's DataLoader.

Installation

Install it directly with Pip, if you just want to use it:

bash pip install lazy_dataset

If you want to make changes or want the most recent version: Clone the repository and install it as follows:

bash git clone https://github.com/fgnt/lazy_dataset.git cd lazy_dataset pip install --editable .

Owner

  • Name: Department of Communications Engineering University of Paderborn
  • Login: fgnt
  • Kind: organization
  • Location: Paderborn, Germany

GitHub Events

Total
  • Watch event: 1
  • Issue comment event: 1
  • Push event: 2
  • Pull request review event: 1
  • Pull request event: 3
Last Year
  • Watch event: 1
  • Issue comment event: 1
  • Push event: 2
  • Pull request review event: 1
  • Pull request event: 3

Committers

Last synced: 7 months ago

All Time
  • Total Commits: 463
  • Total Committers: 20
  • Avg Commits per committer: 23.15
  • Development Distribution Score (DDS): 0.559
Past Year
  • Commits: 7
  • Committers: 1
  • Avg Commits per committer: 7.0
  • Development Distribution Score (DDS): 0.0
Top Committers
Name Email Commits
Christoph c****j@m****e 204
Thilo von Neumann t****n@m****e 67
Janek Ebbers e****s@n****e 46
Lukas Drude m****l@l****e 30
Thomas Glarner t****r@m****e 26
Christoph Boeddeker b****r@u****m 21
Jahn Heymann h****n@n****e 18
jensheit h****r@n****e 13
mkuhlmann m****l@m****e 12
Lukas Drude l****e@m****e 9
deegen m****n@w****e 6
jensheit 3****t@u****m 2
mdeegen 8****n@u****m 2
Jahn Heymann j****n@m****e 1
Janek Ebbers j****2@g****m 1
Juan Manuel Martin-Donas j****s@n****e 1
Michael Kuhlmann k****n@n****e 1
Oliver Walter w****r@n****e 1
danielha d****r@g****e 1
raphaelk r****k@m****e 1

Issues and Pull Requests

Last synced: 8 months ago

All Time
  • Total issues: 6
  • Total pull requests: 60
  • Average time to close issues: 11 months
  • Average time to close pull requests: 11 days
  • Total issue authors: 4
  • Total pull request authors: 8
  • Average comments per issue: 1.5
  • Average comments per pull request: 1.25
  • Merged pull requests: 58
  • Bot issues: 0
  • Bot pull requests: 0
Past Year
  • Issues: 0
  • Pull requests: 2
  • Average time to close issues: N/A
  • Average time to close pull requests: about 15 hours
  • Issue authors: 0
  • Pull request authors: 1
  • Average comments per issue: 0
  • Average comments per pull request: 0.5
  • Merged pull requests: 2
  • Bot issues: 0
  • Bot pull requests: 0
Top Authors
Issue Authors
  • thequilo (2)
  • boeddeker (2)
  • LukasDrude (1)
  • tglarner (1)
Pull Request Authors
  • boeddeker (30)
  • JanekEbb (12)
  • thequilo (10)
  • tglarner (3)
  • michael-kuhlmann (2)
  • jensheit (2)
  • alexanderwerning (1)
  • mdeegen (1)
Top Labels
Issue Labels
Pull Request Labels

Packages

  • Total packages: 1
  • Total downloads:
    • pypi 6,190 last-month
  • Total dependent packages: 1
  • Total dependent repositories: 1
  • Total versions: 15
  • Total maintainers: 4
pypi.org: lazy-dataset

Process large datasets as if it was an iterable.

  • Versions: 15
  • Dependent Packages: 1
  • Dependent Repositories: 1
  • Downloads: 6,190 Last month
Rankings
Dependent packages count: 4.7%
Forks count: 11.9%
Average: 13.5%
Downloads: 14.2%
Stargazers count: 14.9%
Dependent repos count: 21.8%
Maintainers (4)
Last synced: 7 months ago

Dependencies

.github/workflows/run_python_tests.yml actions
  • actions/checkout v2 composite
  • actions/setup-python v2 composite
setup.py pypi