https://github.com/cosmaadrian/acumen-indexer

Utility for constructing highly efficient in-memory / on-disk datasets.

Science Score: 23.0%

This score indicates how likely this project is to be science-related based on various indicators:

○
CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
○
.zenodo.json file
○
DOI references
✓
Academic publication links
Links to: scholar.google
○
Academic email domains
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (15.3%) to scientific vocabulary

Keywords

dataset-manager utility

Last synced: 11 months ago · JSON representation

Repository

Utility for constructing highly efficient in-memory / on-disk datasets.

Basic Info

Host: GitHub
Owner: cosmaadrian
License: mit
Language: Python
Default Branch: master
Homepage: https://github.com/cosmaadrian/acumen-indexer
Size: 20.5 KB

Statistics

Stars: 3
Watchers: 1
Forks: 0
Open Issues: 2
Releases: 0

Topics

dataset-manager utility

Created over 2 years ago · Last pushed over 1 year ago

Metadata Files

Readme License

Acumen 👉🏻 Indexer 👈🏻

Coded with love and coffee ☕ by Adrian Cosma. But I need more coffee!

Description

AcumenIndexer is designed to help with organizing various big datasets with many small instances into a common, highly efficient format, enabling random accessing, using either RAM or HDD for storing binary data chunks.

But why?

Currently, the way storing and accessing data is performed is inefficient, especially for begginer data scientists, each practitioner having its own way of doing things. It is not always possible to store the whole dataset in RAM memory, so a usual approach is resorting to splitting each training instance in a separate file. Datasets comprised of many images or small files are very difficult to handle in practice (i.e., transferring the dataset through ssh, zipping takes a long time). Many files in a single folder can lead to performance issues on certain filesystems and lead to crashes.

But how?

A simple way to overcome the issue of big dataset with many small instances is to store in RAM only the metadata and the index, and use a random access mechanism for big binary chunks of data on disk.

Say what?

We make use of the native Python I/O operations of f.seek(), f.read() to read and write from large binary chunk files. We build a custom index based on byte offsets to access any training instance in O(1). Chunks can be mmap()-ed into RAM if memory is available to speed up I/O operations.

Installation

Install the pypi package via pip:

bash pip install -U acumenindexer

Alternatively, install directly via git: bash pip install -U git+https://github.com/cosmaadrian/acumen-indexer

Usage

Building an index

```python import cv2 import numpy as np import acumenindexer as ai

def datareadfn(path): # read image from file image = cv2.imread(path) # or something like this x = {'path': path}

# must return (data:numpy.ndarray, metadata:dict)
return image, x

file_names = [x for x in os.listdir('images/')]

ai.splitintochunks( datalist = filenames, readfn = datareadfn, outputpath = 'mydata', chunksizebytes = 5 * 1024 * 1024, #5MB usegzip = False, dtype = np.float16, n_jobs = 1, ) ```

Reading from index

```python import numpy as np import acumenindexer as ai

theindex = ai.loadindex('index.csv') # just a pd.DataFrame

in_memory = False reads directly from chunk in O(1) using f.seek()

in_memory = True uses mmap to map the data into RAM

readfn = ai.readfromindex(theindex, dtype = np.float16, inmemory = True, usegzip = False)

for i in range(10): data = read_fn(i) print(data) # contains both metadata and actual binary data ```

Use with PyTorch Datasets

```python from torch.utils.data import Dataset import acumenindexer as ai

class CustomDataset(Dataset): def init(self, indexpath): self.index = ai.loadindex(theindex, dtype = np.float16, inmemory = True, usegzip = False) self.readfn = ai.readfromindex(self.index)

def __len__(self):
    return len(self.index)

def __getitem__(self, idx):
    data = self.read_fn(idx)
    return data

```

License

This repository uses MIT License.

Owner

Name: Adrian Cosma
Login: cosmaadrian
Kind: user
Location: Bucharest, Romania
Company: University Politehnica of Bucharest

Repositories: 21
Profile: https://github.com/cosmaadrian

Mercenary Researcher

GitHub Events

Total

Issues event: 2
Push event: 1

Last Year

Issues event: 2
Push event: 1

Issues and Pull Requests

Last synced: 11 months ago

All Time

Total issues: 2
Total pull requests: 0
Average time to close issues: N/A
Average time to close pull requests: N/A
Total issue authors: 1
Total pull request authors: 0
Average comments per issue: 0.0
Average comments per pull request: 0
Merged pull requests: 0
Bot issues: 0
Bot pull requests: 0

Past Year

Issues: 2
Pull requests: 0
Average time to close issues: N/A
Average time to close pull requests: N/A
Issue authors: 1
Pull request authors: 0
Average comments per issue: 0.0
Average comments per pull request: 0
Merged pull requests: 0
Bot issues: 0
Bot pull requests: 0

View more stats

Top Authors

Issue Authors

cosmaadrian (2)

Pull Request Authors

Top Labels

Issue Labels

Pull Request Labels

Packages

Total packages: 2
Total downloads:
- pypi 24 last-month

Total dependent packages: 0
(may contain duplicates)
Total dependent repositories: 0
(may contain duplicates)
Total versions: 2
Total maintainers: 1

pypi.org: acumenindexer

The AcumenIndexer

Homepage: https://github.com/cosmaadrian/acumen-indexer
Documentation: https://acumenindexer.readthedocs.io/
License: MIT License
Latest release: 0.0.1
published about 2 years ago

Versions: 1
Dependent Packages: 0
Dependent Repositories: 0
Downloads: 13 Last month

Rankings

Dependent packages count: 10.6%

Average: 35.1%

Dependent repos count: 59.6%

Maintainers (1)

cosmaadrian

Last synced: 12 months ago

pypi.org: acumenindexer-cosmaadrian

The AcumenIndexer

Homepage: https://github.com/cosmaadrian/acumen-indexer
Documentation: https://acumenindexer-cosmaadrian.readthedocs.io/
License: MIT License
Latest release: 0.0.1
published about 2 years ago

Versions: 1
Dependent Packages: 0
Dependent Repositories: 0
Downloads: 11 Last month

Rankings

Dependent packages count: 10.6%

Average: 35.1%

Dependent repos count: 59.6%

Maintainers (1)

cosmaadrian

Last synced: 12 months ago

https://github.com/cosmaadrian/acumen-indexer

Science Score: 23.0%

Keywords

Repository

Basic Info

Statistics

Topics

Metadata Files

README.md

Acumen 👉🏻 Indexer 👈🏻

Description

But why?

But how?

Say what?

Installation

Usage

Building an index

Reading from index

in_memory = False reads directly from chunk in O(1) using f.seek()

in_memory = True uses mmap to map the data into RAM

Use with PyTorch Datasets

License

Owner

GitHub Events

Total

Last Year

Issues and Pull Requests

All Time

Past Year

Top Authors

Issue Authors

Pull Request Authors

Top Labels

Issue Labels

Pull Request Labels

Packages

pypi.org: acumenindexer

Rankings

Maintainers (1)

pypi.org: acumenindexer-cosmaadrian

Rankings

Maintainers (1)