https://github.com/cumbof/chopin2

Domain-Agnostic Supervised Learning with Hyperdimensional Computing

Science Score: 36.0%

This score indicates how likely this project is to be science-related based on various indicators:

○
CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
○
.zenodo.json file
✓
DOI references
Found 18 DOI reference(s) in README
○
Academic publication links
✓
Committers with academic emails
1 of 2 committers (50.0%) from academic institutions
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (11.9%) to scientific vocabulary

Keywords

apache-spark backward-elimination feature-selection gpgpu hd-computing machine-learning supervised-learning vsa

Last synced: 5 months ago · JSON representation

Repository

Domain-Agnostic Supervised Learning with Hyperdimensional Computing

Basic Info

Host: GitHub
Owner: cumbof
License: gpl-3.0
Language: Python
Default Branch: master
Homepage:
Size: 35.1 MB

Statistics

Stars: 12
Watchers: 2
Forks: 2
Open Issues: 0
Releases: 7

Topics

apache-spark backward-elimination feature-selection gpgpu hd-computing machine-learning supervised-learning vsa

Created about 5 years ago · Last pushed over 1 year ago

Metadata Files

Readme Contributing License

chopin2

Supervised Classification with Hyperdimensional Computing.

Conda

Originally forked from https://github.com/moimani/HD-Permutaion

This repository includes some Python 3.8 utilities to build a Hyperdimensional Computing classification model according to the architecture originally introduced in https://doi.org/10.1109/DAC.2018.8465708

The src/generators folder contains two Python 3.8 scripts able to create training a test datasets with randomly selected samples from: - BRCA, KIRP, and THCA DNA-Methylation data from the paper Classification of Large DNA Methylation Datasets for Identifying Cancer Drivers by Fabrizio Celli, Fabio Cumbo, and Emanuel Weitschek; - Gene-expression quantification and Methylation Beta Value experiments provided by OpenGDC for all the 33 different types of tumors of the TCGA program.

Due to the size of the datasets, they have not been reported on this repository but can be retrieved from: - ftp://bioinformatics.iasi.cnr.it/public/bigbiocldna-methdata/ - http://geco.deib.polimi.it/opengdc/ and https://github.com/cumbof/OpenGDC/

The isolet dataset is part of the original forked version of the repository and it has been maintained in order to provide a simple toy model for testing purposes only.

Install

We deployed chopin2 as a Python 3.8 package that can be installed through pip and conda, as well as a Docker image.

Please, use one of the following commands to start playing with chopin2:

```

Install chopin2 with pip

pip install chopin2

Install chopin2 with conda

conda install -c conda-forge chopin2

Initialise the Docker image

docker build -t chopin2 . docker run -it chopin2 ```

Please note that chopin2 is also available as a Galaxy tool. It's wrapper is available under the official Galaxy ToolShed at https://toolshed.g2.bx.psu.edu/view/fabio/chopin2

Usage

Once installed, you are ready to start playing with chopin2.

Try running the following command to run chopin2 on the isolet dataset: chopin2 --dimensionality 10000 \ --levels 100 \ --retrain 10 \ --pickle ../dataset/isolet/isolet.pkl \ --psplit_training 80 \ --dump \ --nproc 4 \ --verbose

In order to run it on Spark, other arguments must be specified: chopin2 --dimensionality 10000 \ --levels 100 \ --retrain 10 \ --pickle ../dataset/isolet/isolet.pkl \ --psplit_training 80 \ --dump \ --spark \ --slices 10 \ --master local \ --memory 2048m \ --verbose

List of standard arguments: --dimensionality -- Dimensionality of the HD model (default 10000) --levels -- Number of level hypervectors (default 2) --retrain -- Number of retraining iterations (default 0) --stop -- Stop retraining if the error rate does not change (default False) --dataset -- Path to the dataset file --fieldsep -- Field separator (default ",") --psplit_training -- Percentage of observations that will be used to train the model. The remaining percentage will be used to test the classification model --crossv_k -- Number of folds for cross validation. Cross validate HD models if --k_folds greater than 1 --seed -- Seed for reproducing random sampling of the observations in the dataset and build both the training and test set (default 0) --pickle -- Path to the pickle file. If specified, "--dataset", "--fieldsep", and "--training" parameters are not used --dump -- Build a summary and log files (default False) --cleanup -- Delete the classification model as soon as it produces the prediction accuracy (default False) --keep_levels -- Do not delete the level hypervectors. It works in conjunction with --cleanup only (default True) --nproc -- Number of parallel jobs for the creation of the HD model. This argument is ignored if --spark is enabled (default 1) --verbose -- Print results in real time (default False) --cite -- Print references and exit -v, --version -- Print the current chopin2.py version and exit

List of arguments to enable backward variable selection: --features -- Path to a file with a single column containing the whole set or a subset of feature --select_features -- This triggers the backward variable selection method for the identification of the most significant features. Warning: computationally intense! --group_min -- Minimum number of features among those specified with the --features argument (default 1) --accuracy_threshold -- Stop the execution if the best accuracy achieved during the previous group of runs is lower than this number (default 60.0) --accuracy_uncertainty_perc -- Take a run into account even if its accuracy is lower than the best accuracy achieved in the same group minus its "accuracy_uncertainty_perc" percent

List of argument for the execution of the classifier on a Spark distributed environment: --spark -- Build the classification model in a Apache Spark distributed environment --slices -- Number of slices in case --spark argument is enabled. This argument is ignored if --gpu is enabled --master -- Master node address --memory -- Executor memory

List of arguments for the execution of the classifier on NVidia powered GPUs: --gpu -- Build the classification model on an NVidia powered GPU. This argument is ignored if --spark is specified --tblock -- Number of threads per block in case --gpu argument is enabled. This argument is ignored if --spark is enabled

Credits

Please credit our work in your manuscript by citing:

Fabio Cumbo, Eleonora Cappelli, and Emanuel Weitschek, "A brain-inspired hyperdimensional computing approach for classifying massive DNA methylation data of cancer", MDPI Algorithms, 2020 https://doi.org/10.3390/a13090233

Fabio Cumbo, Emanuel Weitschek, and Daniel Blankenberg, "hdlib: A Python library for designing Vector-Symbolic Architectures", Journal of Open Source Software, 2023 https://doi.org/10.21105/joss.05704

Do not forget to also cite the following paper from which this works takes inspiration:

Mohsen Imani, Chenyu Huang , Dequian Kong, Tajana Rosing, "Hierarchical Hyperdimensional Computing for Energy Efficient Classification", IEEE/ACM Design Automation Conference (DAC), 2018 https://doi.org/10.1109/DAC.2018.8465708

Owner

Name: Fabio Cumbo
Login: cumbof
Kind: user
Location: Cleveland, OH, USA
Company: Genomic Medicine Institute, Lerner Research Institute, Cleveland Clinic

Website: https://cumbof.github.io/
Twitter: cumbofabio
Repositories: 14
Profile: https://github.com/cumbof

Ph.D. in Computer Science and Automation Engineering, Postdoctoral Research Fellow @BlankenbergLab, GMI, LRI, Cleveland Clinic, USA

GitHub Events

Total

Watch event: 1
Fork event: 1

Last Year

Watch event: 1
Fork event: 1

Committers

Last synced: 8 months ago

All Time

Total Commits: 182
Total Committers: 2
Avg Commits per committer: 91.0
Development Distribution Score (DDS): 0.016

Past Year

Commits: 0
Committers: 0
Avg Commits per committer: 0.0
Development Distribution Score (DDS): 0.0

Top Committers

Name	Email	Commits
fabio-cumbo	f**o@g**m	179
Mohsen	m**i@e**u	3

Committer Domains (Top 20 + Academic)

eng.ucsd.edu: 1

Issues and Pull Requests

Last synced: 6 months ago

All Time

Total issues: 1
Total pull requests: 0
Average time to close issues: 6 days
Average time to close pull requests: N/A
Total issue authors: 1
Total pull request authors: 0
Average comments per issue: 5.0
Average comments per pull request: 0
Merged pull requests: 0
Bot issues: 0
Bot pull requests: 0

Past Year

Issues: 0
Pull requests: 0
Average time to close issues: N/A
Average time to close pull requests: N/A
Issue authors: 0
Pull request authors: 0
Average comments per issue: 0
Average comments per pull request: 0
Merged pull requests: 0
Bot issues: 0
Bot pull requests: 0

View more stats

Top Authors

Issue Authors

bernt-matthias (1)

Pull Request Authors

Top Labels

Issue Labels

Pull Request Labels

Packages

Total packages: 2
Total downloads:
- pypi 17 last-month

Total dependent packages: 0
(may contain duplicates)
Total dependent repositories: 1
(may contain duplicates)
Total versions: 15
Total maintainers: 1

pypi.org: chopin2

Supervised Classification with Hyperdimensional Computing

Homepage: http://github.com/cumbof/chopin2
Documentation: https://chopin2.readthedocs.io/
License: LICENSE
Latest release: 1.0.9
published over 1 year ago

Versions: 13
Dependent Packages: 0
Dependent Repositories: 1
Downloads: 17 Last month

Rankings

Dependent packages count: 10.0%

Stargazers count: 20.3%

Dependent repos count: 21.8%

Forks count: 22.6%

Average: 25.8%

Downloads: 54.3%

Maintainers (1)

fabiocumbo

Last synced: 6 months ago

conda-forge.org: chopin2

Supervised Classification with Hyperdimensional Computing for massive datasets with commodity hardware. Supporting k-fold cross-validation and feature selection as backward variable elimination.

Homepage: https://github.com/cumbof/chopin2
License: GPL-3.0-only
Latest release: 1.0.6
published almost 4 years ago

Versions: 2
Dependent Packages: 0
Dependent Repositories: 0

Rankings

Dependent repos count: 34.0%

Average: 42.6%

Dependent packages count: 51.2%

Last synced: 6 months ago

Dependencies

chopin2/requirements.txt pypi

numba ==0.51.2
numpy ==1.16.3
pyspark ==2.40

setup.py pypi

numba *
numpy *
pyspark *

Dockerfile docker

ubuntu 18.04 build

https://github.com/cumbof/chopin2

Science Score: 36.0%

Keywords

Repository

Basic Info

Statistics

Topics

Metadata Files

README.md

chopin2

Originally forked from https://github.com/moimani/HD-Permutaion

Install

Install chopin2 with pip

Install chopin2 with conda

Initialise the Docker image

Usage

Credits

Owner

GitHub Events

Total

Last Year

Committers

All Time

Past Year

Top Committers

Committer Domains (Top 20 + Academic)

Issues and Pull Requests

All Time

Past Year

Top Authors

Issue Authors

Pull Request Authors

Top Labels

Issue Labels

Pull Request Labels

Packages

pypi.org: chopin2

Rankings

Maintainers (1)

conda-forge.org: chopin2

Rankings

Dependencies