https://github.com/cumbof/chopin2

Domain-Agnostic Supervised Learning with Hyperdimensional Computing

https://github.com/cumbof/chopin2

Science Score: 36.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
  • DOI references
    Found 18 DOI reference(s) in README
  • Academic publication links
  • Committers with academic emails
    1 of 2 committers (50.0%) from academic institutions
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (11.9%) to scientific vocabulary

Keywords

apache-spark backward-elimination feature-selection gpgpu hd-computing machine-learning supervised-learning vsa
Last synced: 5 months ago · JSON representation

Repository

Domain-Agnostic Supervised Learning with Hyperdimensional Computing

Basic Info
  • Host: GitHub
  • Owner: cumbof
  • License: gpl-3.0
  • Language: Python
  • Default Branch: master
  • Homepage:
  • Size: 35.1 MB
Statistics
  • Stars: 12
  • Watchers: 2
  • Forks: 2
  • Open Issues: 0
  • Releases: 7
Topics
apache-spark backward-elimination feature-selection gpgpu hd-computing machine-learning supervised-learning vsa
Created about 5 years ago · Last pushed over 1 year ago
Metadata Files
Readme Contributing License

README.md

chopin2

Supervised Classification with Hyperdimensional Computing.

Conda

Originally forked from https://github.com/moimani/HD-Permutaion

This repository includes some Python 3.8 utilities to build a Hyperdimensional Computing classification model according to the architecture originally introduced in https://doi.org/10.1109/DAC.2018.8465708

The src/generators folder contains two Python 3.8 scripts able to create training a test datasets with randomly selected samples from: - BRCA, KIRP, and THCA DNA-Methylation data from the paper Classification of Large DNA Methylation Datasets for Identifying Cancer Drivers by Fabrizio Celli, Fabio Cumbo, and Emanuel Weitschek; - Gene-expression quantification and Methylation Beta Value experiments provided by OpenGDC for all the 33 different types of tumors of the TCGA program.

Due to the size of the datasets, they have not been reported on this repository but can be retrieved from: - ftp://bioinformatics.iasi.cnr.it/public/bigbiocldna-methdata/ - http://geco.deib.polimi.it/opengdc/ and https://github.com/cumbof/OpenGDC/

The isolet dataset is part of the original forked version of the repository and it has been maintained in order to provide a simple toy model for testing purposes only.

Install

We deployed chopin2 as a Python 3.8 package that can be installed through pip and conda, as well as a Docker image.

Please, use one of the following commands to start playing with chopin2:

```

Install chopin2 with pip

pip install chopin2

Install chopin2 with conda

conda install -c conda-forge chopin2

Initialise the Docker image

docker build -t chopin2 . docker run -it chopin2 ```

Please note that chopin2 is also available as a Galaxy tool. It's wrapper is available under the official Galaxy ToolShed at https://toolshed.g2.bx.psu.edu/view/fabio/chopin2

Usage

Once installed, you are ready to start playing with chopin2.

Try running the following command to run chopin2 on the isolet dataset: chopin2 --dimensionality 10000 \ --levels 100 \ --retrain 10 \ --pickle ../dataset/isolet/isolet.pkl \ --psplit_training 80 \ --dump \ --nproc 4 \ --verbose

In order to run it on Spark, other arguments must be specified: chopin2 --dimensionality 10000 \ --levels 100 \ --retrain 10 \ --pickle ../dataset/isolet/isolet.pkl \ --psplit_training 80 \ --dump \ --spark \ --slices 10 \ --master local \ --memory 2048m \ --verbose

List of standard arguments: --dimensionality -- Dimensionality of the HD model (default 10000) --levels -- Number of level hypervectors (default 2) --retrain -- Number of retraining iterations (default 0) --stop -- Stop retraining if the error rate does not change (default False) --dataset -- Path to the dataset file --fieldsep -- Field separator (default ",") --psplit_training -- Percentage of observations that will be used to train the model. The remaining percentage will be used to test the classification model --crossv_k -- Number of folds for cross validation. Cross validate HD models if --k_folds greater than 1 --seed -- Seed for reproducing random sampling of the observations in the dataset and build both the training and test set (default 0) --pickle -- Path to the pickle file. If specified, "--dataset", "--fieldsep", and "--training" parameters are not used --dump -- Build a summary and log files (default False) --cleanup -- Delete the classification model as soon as it produces the prediction accuracy (default False) --keep_levels -- Do not delete the level hypervectors. It works in conjunction with --cleanup only (default True) --nproc -- Number of parallel jobs for the creation of the HD model. This argument is ignored if --spark is enabled (default 1) --verbose -- Print results in real time (default False) --cite -- Print references and exit -v, --version -- Print the current chopin2.py version and exit

List of arguments to enable backward variable selection: --features -- Path to a file with a single column containing the whole set or a subset of feature --select_features -- This triggers the backward variable selection method for the identification of the most significant features. Warning: computationally intense! --group_min -- Minimum number of features among those specified with the --features argument (default 1) --accuracy_threshold -- Stop the execution if the best accuracy achieved during the previous group of runs is lower than this number (default 60.0) --accuracy_uncertainty_perc -- Take a run into account even if its accuracy is lower than the best accuracy achieved in the same group minus its "accuracy_uncertainty_perc" percent

List of argument for the execution of the classifier on a Spark distributed environment: --spark -- Build the classification model in a Apache Spark distributed environment --slices -- Number of slices in case --spark argument is enabled. This argument is ignored if --gpu is enabled --master -- Master node address --memory -- Executor memory

List of arguments for the execution of the classifier on NVidia powered GPUs: --gpu -- Build the classification model on an NVidia powered GPU. This argument is ignored if --spark is specified --tblock -- Number of threads per block in case --gpu argument is enabled. This argument is ignored if --spark is enabled

Credits

Please credit our work in your manuscript by citing:

Fabio Cumbo, Eleonora Cappelli, and Emanuel Weitschek, "A brain-inspired hyperdimensional computing approach for classifying massive DNA methylation data of cancer", MDPI Algorithms, 2020 https://doi.org/10.3390/a13090233

Fabio Cumbo, Emanuel Weitschek, and Daniel Blankenberg, "hdlib: A Python library for designing Vector-Symbolic Architectures", Journal of Open Source Software, 2023 https://doi.org/10.21105/joss.05704

Do not forget to also cite the following paper from which this works takes inspiration:

Mohsen Imani, Chenyu Huang , Dequian Kong, Tajana Rosing, "Hierarchical Hyperdimensional Computing for Energy Efficient Classification", IEEE/ACM Design Automation Conference (DAC), 2018 https://doi.org/10.1109/DAC.2018.8465708

Owner

  • Name: Fabio Cumbo
  • Login: cumbof
  • Kind: user
  • Location: Cleveland, OH, USA
  • Company: Genomic Medicine Institute, Lerner Research Institute, Cleveland Clinic

Ph.D. in Computer Science and Automation Engineering, Postdoctoral Research Fellow @BlankenbergLab, GMI, LRI, Cleveland Clinic, USA

GitHub Events

Total
  • Watch event: 1
  • Fork event: 1
Last Year
  • Watch event: 1
  • Fork event: 1

Committers

Last synced: 8 months ago

All Time
  • Total Commits: 182
  • Total Committers: 2
  • Avg Commits per committer: 91.0
  • Development Distribution Score (DDS): 0.016
Past Year
  • Commits: 0
  • Committers: 0
  • Avg Commits per committer: 0.0
  • Development Distribution Score (DDS): 0.0
Top Committers
Name Email Commits
fabio-cumbo f****o@g****m 179
Mohsen m****i@e****u 3
Committer Domains (Top 20 + Academic)

Issues and Pull Requests

Last synced: 6 months ago

All Time
  • Total issues: 1
  • Total pull requests: 0
  • Average time to close issues: 6 days
  • Average time to close pull requests: N/A
  • Total issue authors: 1
  • Total pull request authors: 0
  • Average comments per issue: 5.0
  • Average comments per pull request: 0
  • Merged pull requests: 0
  • Bot issues: 0
  • Bot pull requests: 0
Past Year
  • Issues: 0
  • Pull requests: 0
  • Average time to close issues: N/A
  • Average time to close pull requests: N/A
  • Issue authors: 0
  • Pull request authors: 0
  • Average comments per issue: 0
  • Average comments per pull request: 0
  • Merged pull requests: 0
  • Bot issues: 0
  • Bot pull requests: 0
Top Authors
Issue Authors
  • bernt-matthias (1)
Pull Request Authors
Top Labels
Issue Labels
Pull Request Labels

Packages

  • Total packages: 2
  • Total downloads:
    • pypi 17 last-month
  • Total dependent packages: 0
    (may contain duplicates)
  • Total dependent repositories: 1
    (may contain duplicates)
  • Total versions: 15
  • Total maintainers: 1
pypi.org: chopin2

Supervised Classification with Hyperdimensional Computing

  • Versions: 13
  • Dependent Packages: 0
  • Dependent Repositories: 1
  • Downloads: 17 Last month
Rankings
Dependent packages count: 10.0%
Stargazers count: 20.3%
Dependent repos count: 21.8%
Forks count: 22.6%
Average: 25.8%
Downloads: 54.3%
Maintainers (1)
Last synced: 6 months ago
conda-forge.org: chopin2

Supervised Classification with Hyperdimensional Computing for massive datasets with commodity hardware. Supporting k-fold cross-validation and feature selection as backward variable elimination.

  • Versions: 2
  • Dependent Packages: 0
  • Dependent Repositories: 0
Rankings
Dependent repos count: 34.0%
Average: 42.6%
Dependent packages count: 51.2%
Last synced: 6 months ago

Dependencies

chopin2/requirements.txt pypi
  • numba ==0.51.2
  • numpy ==1.16.3
  • pyspark ==2.40
setup.py pypi
  • numba *
  • numpy *
  • pyspark *
Dockerfile docker
  • ubuntu 18.04 build