https://github.com/fgnt/lazy_dataset
lazy_dataset: Process large datasets as if it was an iterable.
Science Score: 31.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
○CITATION.cff file
-
✓codemeta.json file
Found codemeta.json file -
○.zenodo.json file
-
○DOI references
-
○Academic publication links
-
✓Committers with academic emails
7 of 20 committers (35.0%) from academic institutions -
✓Institutional organization owner
Organization fgnt has institutional domain (nt.uni-paderborn.de) -
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (9.2%) to scientific vocabulary
Keywords from Contributors
Repository
lazy_dataset: Process large datasets as if it was an iterable.
Basic Info
- Host: GitHub
- Owner: fgnt
- License: mit
- Language: Python
- Default Branch: master
- Homepage: https://pypi.org/project/lazy-dataset/
- Size: 1.2 MB
Statistics
- Stars: 18
- Watchers: 9
- Forks: 8
- Open Issues: 5
- Releases: 0
Metadata Files
README.md
lazy_dataset
Lazy_dataset is a helper to deal with large datasets that do not fit into memory. It allows to define transformations that are applied lazily, (e.g. a mapping function to read data from HDD). When someone iterates over the dataset all transformations are applied.
Supported transformations:
- dataset.map(map_fn): Apply the function map_fn to each example (builtins.map)
- dataset[2]: Get example at index 2.
- dataset['example_id'] Get that example that has the example id 'example_id'.
- dataset[10:20]: Get a sub dataset that contains only the examples in the slice 10 to 20.
- dataset.filter(filter_fn, lazy=True) Drops examples where filter_fn(example) is false (builtins.filter).
- dataset.concatenate(*others): Concatenates two or more datasets (numpy.concatenate)
- dataset.intersperse(*others): Combine two or more datasets such that examples of each input dataset are evenly spaced (https://stackoverflow.com/a/19293603).
- dataset.zip(*others): Zip two or more datasets
- dataset.shuffle(reshuffle=False): Shuffles the dataset. When reshuffle is True it shuffles each time when you iterate over the data.
- dataset.tile(reps, shuffle=False): Repeats the dataset reps times and concatenates it (numpy.tile)
- dataset.cycle(): Repeats the dataset endlessly (itertools.cycle but without caching)
- dataset.groupby(group_fn): Groups examples together. In contrast to itertools.groupby a sort is not nessesary, like in pandas (itertools.groupby, pandas.DataFrame.groupby)
- dataset.sort(key_fn, sort_fn=sorted): Sorts the examples depending on the values key_fn(example) (list.sort)
- dataset.batch(batch_size, drop_last=False): Batches batch_size examples together as a list. Usually followed by a map (tensorflow.data.Dataset.batch)
- dataset.random_choice(): Get a random example (numpy.random.choice)
- dataset.cache(): Cache in RAM (similar to ESPnet's keep_all_data_on_mem)
- dataset.diskcache(): Cache to a cache directory on the local filesystem (useful in clusters network slow filesystems)
- ...
```python
from IPython.lib.pretty import pprint import lazydataset examples = { ... 'exampleid1': { ... 'observation': [1, 2, 3], ... 'label': 1, ... }, ... 'exampleid2': { ... 'observation': [4, 5, 6], ... 'label': 2, ... }, ... 'exampleid3': { ... 'observation': [7, 8, 9], ... 'label': 3, ... }, ... } for exampleid, example in examples.items(): ... example['exampleid'] = exampleid ds = lazydataset.new(examples) ds DictDataset(len=3) MapDataset(pickle.loads) ds.keys() ('exampleid1', 'exampleid2', 'exampleid3') for example in ds: ... print(example) {'observation': [1, 2, 3], 'label': 1, 'exampleid': 'exampleid1'} {'observation': [4, 5, 6], 'label': 2, 'exampleid': 'exampleid2'} {'observation': [7, 8, 9], 'label': 3, 'exampleid': 'exampleid3'} def transform(example): ... example['label'] *= 10 ... return example ds = ds.map(transform) for example in ds: ... print(example) {'observation': [1, 2, 3], 'label': 10, 'exampleid': 'exampleid1'} {'observation': [4, 5, 6], 'label': 20, 'exampleid': 'exampleid2'} {'observation': [7, 8, 9], 'label': 30, 'exampleid': 'exampleid3'} ds = ds.filter(lambda example: example['label'] > 15) for example in ds: ... print(example) {'observation': [4, 5, 6], 'label': 20, 'exampleid': 'exampleid2'} {'observation': [7, 8, 9], 'label': 30, 'exampleid': 'exampleid3'} ds['exampleid2'] {'observation': [4, 5, 6], 'label': 20, 'exampleid': 'exampleid2'} ds DictDataset(len=3) MapDataset(pickle.loads) MapDataset(
) FilterDataset( at 0x7ff74efb67b8>) ```
Comparison with PyTorch's DataLoader
See here for a feature and throughput comparison of lazy_dataset with PyTorch's DataLoader.
Installation
Install it directly with Pip, if you just want to use it:
bash
pip install lazy_dataset
If you want to make changes or want the most recent version: Clone the repository and install it as follows:
bash
git clone https://github.com/fgnt/lazy_dataset.git
cd lazy_dataset
pip install --editable .
Owner
- Name: Department of Communications Engineering University of Paderborn
- Login: fgnt
- Kind: organization
- Location: Paderborn, Germany
- Website: http://nt.uni-paderborn.de
- Repositories: 37
- Profile: https://github.com/fgnt
GitHub Events
Total
- Watch event: 1
- Issue comment event: 1
- Push event: 2
- Pull request review event: 1
- Pull request event: 3
Last Year
- Watch event: 1
- Issue comment event: 1
- Push event: 2
- Pull request review event: 1
- Pull request event: 3
Committers
Last synced: 7 months ago
Top Committers
| Name | Commits | |
|---|---|---|
| Christoph | c****j@m****e | 204 |
| Thilo von Neumann | t****n@m****e | 67 |
| Janek Ebbers | e****s@n****e | 46 |
| Lukas Drude | m****l@l****e | 30 |
| Thomas Glarner | t****r@m****e | 26 |
| Christoph Boeddeker | b****r@u****m | 21 |
| Jahn Heymann | h****n@n****e | 18 |
| jensheit | h****r@n****e | 13 |
| mkuhlmann | m****l@m****e | 12 |
| Lukas Drude | l****e@m****e | 9 |
| deegen | m****n@w****e | 6 |
| jensheit | 3****t@u****m | 2 |
| mdeegen | 8****n@u****m | 2 |
| Jahn Heymann | j****n@m****e | 1 |
| Janek Ebbers | j****2@g****m | 1 |
| Juan Manuel Martin-Donas | j****s@n****e | 1 |
| Michael Kuhlmann | k****n@n****e | 1 |
| Oliver Walter | w****r@n****e | 1 |
| danielha | d****r@g****e | 1 |
| raphaelk | r****k@m****e | 1 |
Committer Domains (Top 20 + Academic)
Issues and Pull Requests
Last synced: 8 months ago
All Time
- Total issues: 6
- Total pull requests: 60
- Average time to close issues: 11 months
- Average time to close pull requests: 11 days
- Total issue authors: 4
- Total pull request authors: 8
- Average comments per issue: 1.5
- Average comments per pull request: 1.25
- Merged pull requests: 58
- Bot issues: 0
- Bot pull requests: 0
Past Year
- Issues: 0
- Pull requests: 2
- Average time to close issues: N/A
- Average time to close pull requests: about 15 hours
- Issue authors: 0
- Pull request authors: 1
- Average comments per issue: 0
- Average comments per pull request: 0.5
- Merged pull requests: 2
- Bot issues: 0
- Bot pull requests: 0
Top Authors
Issue Authors
- thequilo (2)
- boeddeker (2)
- LukasDrude (1)
- tglarner (1)
Pull Request Authors
- boeddeker (30)
- JanekEbb (12)
- thequilo (10)
- tglarner (3)
- michael-kuhlmann (2)
- jensheit (2)
- alexanderwerning (1)
- mdeegen (1)
Top Labels
Issue Labels
Pull Request Labels
Packages
- Total packages: 1
-
Total downloads:
- pypi 6,190 last-month
- Total dependent packages: 1
- Total dependent repositories: 1
- Total versions: 15
- Total maintainers: 4
pypi.org: lazy-dataset
Process large datasets as if it was an iterable.
- Homepage: https://github.com/fgnt/lazy_dataset
- Documentation: https://lazy-dataset.readthedocs.io/
- License: MIT
-
Latest release: 0.0.15
published almost 2 years ago
Rankings
Maintainers (4)
Dependencies
- actions/checkout v2 composite
- actions/setup-python v2 composite